Monthly Archives: July 2012

Spiky Graphs

by David Lutz

For those working in operations teams looking at graphs is a daily ritual.  Not just glancing at a dashboard occasionally (although that is useful) but really exploring the data represented by those graphs.

The amount of data generated by a busy running website is astounding.  You have metrics from the very low level such as CPU and disk utilization through to custom metrics generated by your application, through to high level business metrics such as page views or sales.

These metrics can be viewed in many ways.  Sometimes histograms or scatter plots are useful and give great insight into what’s happening in your systems.  Probably the most common way to view data is a plot of values on the Y axis and time on the X axis.  The ubiquitous time series graph.

Have you really thought about what you are searching for when looking at these graphs?

Probably you are doing one of two quite different things.  Looking for anomalies or trends.  An anomaly would be a dip or spike in the graph that isn’t usually there.  It’s probably an indication that something broken and is usually seen over a short period of time.  A trend is some kind of behavior over a longer period of time that you might describe in words like “the CPU utilization is going up”.

By looking at graphs we can make predictions like “at the current rate of increase we’re going to run out of disk space in a week”.

Our brains are really good at finding patterns like this and understanding what’s going on.

At 99designs we have lots of graphs of data from many sources.  We also make use of a number of third party services such as New Relic.  The data provided by the New Relic web interface is great and we like the service.  However it has some limitations.  They smooth the data out to give you a good idea of trends but this summarizing makes it harder to see spikes.

These two graphs are of exactly the same data. It is response time from our application servers over a 24 hour period.  It’s surprising how much smoothing tells a different story.  Did we have one incident on the site or two?

The bottom graph was created from our Graphite server.  New Relic have an API you can use to extract the metrics they collect.  We use some scripts to import this data which we have open sourced here:  https://github.com/99designs/vacuumetrix

Once you have the data in your own tool you can create graphs just how you like them. Graphite is a fantastic tool for exploring data in an ad hoc way, or for building dashboards.

To control how spiky your data appears you can use the Summarize function in Graphite.

 

 

 

 

 

 

 

This is throughput data from the last couple of months.  If you’re looking for a trend you can summarize by day into.

 

 

 

 

 

 

 

Or into a weekly summary.

 

 

 

 

 

 

 

 

It’s exactly the same data pulled from New Relic every minute and stored in Graphite’s whisper datastore at whatever resolution and retention period you like.

Another extremely useful function of Graphite is to time shift your data.

 

 

 

 

 

 

This lets you easily compare (for example) this week to last week.

Just clone your graph, time shift one and drag to merge the two.

Graphite isn’t the only tool we use to make graphs but it is a very useful one.  Check it out!

http://graphite.wikidot.com/

Advertisements

Leave a comment

Filed under Uncategorized

OpenTSDB

I’ve been hearing good things about OpenTSDB   for some time and I’ve been meaning to take it for a spin.

Well I’ve finally gotten around to it.

First impressions:

  • Pretty damn easy to set up. Considering the technology behind this.

The instructions are easy to follow and the build.sh script does all the hard work.  I’m running a single HBase server. Not a cluster. HBase is the Hadoop database.  Wooo Big Data. I’m running it on an AWS t1.micro (small Big Data). Just evaluating it at the moment.

  • UI probably designed by an Engineer.

Functional but not so pretty.  Hey I’m not saying I can do better!  Frontend is not something I’m good at.  You should see my horrible HTML.  It would make your eyes bleed.  Blink tag is still cool yeah?  The thing that is kinda annoying is that it seems to use the time zone from my browser, which is not the time zone of my servers (UTC).  Graphite’s UI took a long time for me to warm to.  Now I’d like to see something like that for OpenTSDB.  I believe there are some projects along these lines but I haven’t investigated.

  • NoSQL to be Web Scale.  Ok I can push billions of data points into HBase and never lose data like I do in Graphite (whisper) or Ganglia (RRD).  Cool.

As I understand it, there’s no real difference between the concept of a Metric, and a key/value pair Tag.  It’s all just data in NoSQL magic land.

This flexibility is proving difficult for me to work out how to get data in.  I’m working on a project in my hack days at 99designs to suck metrics in from a variety of services and spit them out into a variety of backend systems.   Ganglia and Graphite so far and now OpenTSDB.  Just to see where it leads.   It’s open source and on github here:  https://github.com/99designs/vacuumetrix  But I’m not sure how to use tags yet.  There’s some talk on the mailing list about changing how this works, so I’m not going to fret too much.   https://groups.google.com/d/msg/opentsdb/llAVqkKqFPw/_vQbf1MSX6sJ

  • Downsampling.  Ok this is really useful, and is something the likes of Ganglia and Graphite don’t do so well.  Sometimes to see what is going on in the data you want to downsample.
  • Aggregation is the default.  OpenTSDB is designed to handle huge abouts of data. So aggregating metrics is naturally on.
  • You have to manually register each new metric.
    e.g.  #tsdb mkmetric newrelic  This is a bit sucky.  Is there a way to do it automatically?
  • tcollector I’ve no idea if this is working or not.  Admittedly I’ve only been looking at OpenTSDB for a couple of hours.  It’s not immediately clear how this works or what the metrics it may or may not be injecting are called.  I have not yet looked at the source.  I just tried to run something called startstop?!?  If I have a few moments I’ll investigate further.

In conclusion.  The prospect of not throwing away data is attractive.  I will keep playing with this tool, get some real data in and see if it’s useful.

I recommend watching this video about OpenTSDB.

4 Comments

Filed under Uncategorized