Most people who use Azure Application insights to monitor their applications will not look at it until something is wrong and only then will they look for what exceptions are thrown to see what is going on. In my opinion if you want to build high available systems you also want to be able to see if everything is working as normal when there are no problems.
When you build a monolithic application it’s often quite easy to find where certain performance bottle necks are by monitoring cpu and memory usage. When we look into distributed systems and microservice architectures an application will often span multiple services with even more instances running into thousands of machines, service busses, APIs you name it. How do we monitor this by looking into CPU, memory and all other traditional monitoring measures. You simply can’t.
In these types of scenarios where you have several instances or maybe even thousands of instances we have to look for other things. One thing you could do is come up with a KPI of measuring a service that your application is providing and seeing how often this is completed. To make this a bit simpler to understand lets look at an example:
Netflix is famous for their micro service architecture spanning thousands of machines and they monitor on SPS (Starts per second). With the millions of subscribers they have this number is something that should be fairly predictable. That’s why they monitor for this and if this number is affected something must be wrong (If people start more often maybe playback isn’t working so they keep pressing play? If less people press play maybe the UI is broken and the event is not coming down to the server or something else might be wrong.) By just monitoring 1 number they can use this if the overall health of the system is OK or not. You can learn more at the Netflix technology blog.
So how do you start with something like this yourself?
Finding the right KPI
There is not 1 solution to find the right KPI that is best for measuring. But there are some things you might consider. First of all it has to be important for your business. Next to that it would be nice if the number was somewhat stable or has clear patterns. This all depends on your business and application.
Maybe it’s best to start with another example we used for one of our clients. We’ll take this example from the initial idea to how we actually monitor it using Azure Log Analytics and Application Insights.
The application we worked on had to do calculations every few minutes and these calculations could take up from 10 seconds to about a minute. It was really important that the end result of these calculations were send customers / other systems every X minutes. Because of this the development team added logging to Application Insights that stored the calculation time for each cycle. During the day the calculation time ranged from fast (10 seconds) to slow (1 minute) because of several parameters. I’ve drawn a picture of what the graph looked that took all the App Insights calculation times and plotted it over time.
The Graph looked like this. Initially the dev team only created this view to monitor health of the calculation times. A big problem in here is that it provides no information of what is “normal”. As humans we are quite good at recognizing patterns and after showing this picture to several people they all noted. Wow somewhere between 9:00 and 12:00 in the morning there must be something wrong.
The problem is that this data is only the data of 1 day. It does not even have a pattern. There are several external influences that have impact on calculation times. One of them is customer orders being created. This application is a business to business application and a majority of orders is created during the morning of european working hours. This is why we need more data in our graph so we can actually see if there are some patterns.
In the next graph I’ve plotted the data of a full work week on the same area to see if we can find patterns.
When we plot this full week of calculation times we can see that there is quite a pattern to be found. Next to that we can also very easily spot where something is not following our pattern. Is the high curve just before 12:00 still an anomaly? Guess not… But what is happening in the afternoon? Data that first looked like being part of some pattern in our heads does stand out all of a sudden. I think we’ve found our KPI that we want to measure.
When developing an application adding counters and logging information is important to be able to create these kinds of dashboards. If you are not sure on what to measure. Just start with business functions start/completed and each service start/completed/retried. This gives you a starting point. from there on you can come up with new measures and counters.
An important area of DevOps is as developers we have to start thinking more like Ops. what are good things to measure, monitor etc. In the past few years I often come across Devs telling Ops to become more like Developers by adding automation and doing stuff as code but it’s also important to focus on the other way around. Devs taking ownership of what they are building and making it easy to see if the application is still working like it is supposed to.
As Devs you have far more knowledge of what could cause certain delays, outages etc because you know how the application is working internally. So join forces and work together.
Implement it using Azure Application Insights and Azure Log Analytics
So now we have a pretty good idea of what we want on a dashboard. How do we implement this? Since the title of this post is about Application insights and Azure log analytics i’m assuming you already have Azure Application insights in place. If not here is a guide. When we have access to an Application insights instance we can start doing our custom measurements. In this post we’ll focus on measuring calculation times similar to the example above but you could do this with any type of measurement.
How to track timing in app insights?
We can use the code above to track custom timing of pieces of code. We’ll create a DependencyTelemetry object, Fill in the name and type properties call Start, do your calculation and if it succeeds set the success to true and then finally call the Stop method so the timer is always stopped. This is all the code you need. When you run your app now and go to Application insights open the Analytics tab and run a query showing all “dependencies with name “CalculationCycle”. Since we haven’t logged anything else we’ll just query all dependencies and voila there are our timings in the duration field. So our application is logging the calculation times. Now it is time to create a dashboard that shows the “normal” state and values from the last 24 hours.
Creating a kusto Query in log analytics:
We want to create a similar graph as i drew in this post earlier. We could have all these colored lines for all the different days but what is even better is that we can take the data for the last month and combine it. When we create the query we’re actually building 2 series and we will combine them in the end to display a graph. The first series we will create will be called “Today” and will show all the values of the calculation time and will summarize them per hour. The second series we create is called “LastMonth” and will take all the values of the last 30 days and will group them by hours of the day as well. We also only take the 90 percentile of the values so we remove values that are special cases.
Run the query to get the graph below. You can pin this graph to a dashboard and now you can see your calculation times compared to average calculation times of the last month on a per hour basis.
For our scenario this worked really well. If you create something similar make sure that the last 30 days is a good comparison. Should calculations be the same every day of the week or are your calculations taking longer on a Monday compared to the Friday? if that is the case you might want to tweak your query so you are actually comparing to your “normal” state.
Hopefully this post helped you set up a dashboard view of viewing a “normal” state of your application that you could have displayed near your team working area to see if everything is still working as you expected it to.
Happy Coding (and observing)
Geert van der Cruijsen