by David Lutz
For those working in operations teams looking at graphs is a daily ritual. Not just glancing at a dashboard occasionally (although that is useful) but really exploring the data represented by those graphs.
The amount of data generated by a busy running website is astounding. You have metrics from the very low level such as CPU and disk utilization through to custom metrics generated by your application, through to high level business metrics such as page views or sales.
These metrics can be viewed in many ways. Sometimes histograms or scatter plots are useful and give great insight into what’s happening in your systems. Probably the most common way to view data is a plot of values on the Y axis and time on the X axis. The ubiquitous time series graph.
Have you really thought about what you are searching for when looking at these graphs?
Probably you are doing one of two quite different things. Looking for anomalies or trends. An anomaly would be a dip or spike in the graph that isn’t usually there. It’s probably an indication that something broken and is usually seen over a short period of time. A trend is some kind of behavior over a longer period of time that you might describe in words like “the CPU utilization is going up”.
By looking at graphs we can make predictions like “at the current rate of increase we’re going to run out of disk space in a week”.
Our brains are really good at finding patterns like this and understanding what’s going on.
At 99designs we have lots of graphs of data from many sources. We also make use of a number of third party services such as New Relic. The data provided by the New Relic web interface is great and we like the service. However it has some limitations. They smooth the data out to give you a good idea of trends but this summarizing makes it harder to see spikes.
These two graphs are of exactly the same data. It is response time from our application servers over a 24 hour period. It’s surprising how much smoothing tells a different story. Did we have one incident on the site or two?
The bottom graph was created from our Graphite server. New Relic have an API you can use to extract the metrics they collect. We use some scripts to import this data which we have open sourced here: https://github.com/99designs/vacuumetrix
Once you have the data in your own tool you can create graphs just how you like them. Graphite is a fantastic tool for exploring data in an ad hoc way, or for building dashboards.
To control how spiky your data appears you can use the Summarize function in Graphite.
This is throughput data from the last couple of months. If you’re looking for a trend you can summarize by day into.
Or into a weekly summary.
It’s exactly the same data pulled from New Relic every minute and stored in Graphite’s whisper datastore at whatever resolution and retention period you like.
Another extremely useful function of Graphite is to time shift your data.
This lets you easily compare (for example) this week to last week.
Just clone your graph, time shift one and drag to merge the two.
Graphite isn’t the only tool we use to make graphs but it is a very useful one. Check it out!