Spiky Graphs

by David Lutz

For those working in operations teams looking at graphs is a daily ritual.  Not just glancing at a dashboard occasionally (although that is useful) but really exploring the data represented by those graphs.

The amount of data generated by a busy running website is astounding.  You have metrics from the very low level such as CPU and disk utilization through to custom metrics generated by your application, through to high level business metrics such as page views or sales.

These metrics can be viewed in many ways.  Sometimes histograms or scatter plots are useful and give great insight into what’s happening in your systems.  Probably the most common way to view data is a plot of values on the Y axis and time on the X axis.  The ubiquitous time series graph.

Have you really thought about what you are searching for when looking at these graphs?

Probably you are doing one of two quite different things.  Looking for anomalies or trends.  An anomaly would be a dip or spike in the graph that isn’t usually there.  It’s probably an indication that something broken and is usually seen over a short period of time.  A trend is some kind of behavior over a longer period of time that you might describe in words like “the CPU utilization is going up”.

By looking at graphs we can make predictions like “at the current rate of increase we’re going to run out of disk space in a week”.

Our brains are really good at finding patterns like this and understanding what’s going on.

At 99designs we have lots of graphs of data from many sources.  We also make use of a number of third party services such as New Relic.  The data provided by the New Relic web interface is great and we like the service.  However it has some limitations.  They smooth the data out to give you a good idea of trends but this summarizing makes it harder to see spikes.

These two graphs are of exactly the same data. It is response time from our application servers over a 24 hour period.  It’s surprising how much smoothing tells a different story.  Did we have one incident on the site or two?

The bottom graph was created from our Graphite server.  New Relic have an API you can use to extract the metrics they collect.  We use some scripts to import this data which we have open sourced here:  https://github.com/99designs/vacuumetrix

Once you have the data in your own tool you can create graphs just how you like them. Graphite is a fantastic tool for exploring data in an ad hoc way, or for building dashboards.

To control how spiky your data appears you can use the Summarize function in Graphite.








This is throughput data from the last couple of months.  If you’re looking for a trend you can summarize by day into.








Or into a weekly summary.









It’s exactly the same data pulled from New Relic every minute and stored in Graphite’s whisper datastore at whatever resolution and retention period you like.

Another extremely useful function of Graphite is to time shift your data.







This lets you easily compare (for example) this week to last week.

Just clone your graph, time shift one and drag to merge the two.

Graphite isn’t the only tool we use to make graphs but it is a very useful one.  Check it out!


Leave a comment

Filed under Uncategorized


I’ve been hearing good things about OpenTSDB   for some time and I’ve been meaning to take it for a spin.

Well I’ve finally gotten around to it.

First impressions:

  • Pretty damn easy to set up. Considering the technology behind this.

The instructions are easy to follow and the build.sh script does all the hard work.  I’m running a single HBase server. Not a cluster. HBase is the Hadoop database.  Wooo Big Data. I’m running it on an AWS t1.micro (small Big Data). Just evaluating it at the moment.

  • UI probably designed by an Engineer.

Functional but not so pretty.  Hey I’m not saying I can do better!  Frontend is not something I’m good at.  You should see my horrible HTML.  It would make your eyes bleed.  Blink tag is still cool yeah?  The thing that is kinda annoying is that it seems to use the time zone from my browser, which is not the time zone of my servers (UTC).  Graphite’s UI took a long time for me to warm to.  Now I’d like to see something like that for OpenTSDB.  I believe there are some projects along these lines but I haven’t investigated.

  • NoSQL to be Web Scale.  Ok I can push billions of data points into HBase and never lose data like I do in Graphite (whisper) or Ganglia (RRD).  Cool.

As I understand it, there’s no real difference between the concept of a Metric, and a key/value pair Tag.  It’s all just data in NoSQL magic land.

This flexibility is proving difficult for me to work out how to get data in.  I’m working on a project in my hack days at 99designs to suck metrics in from a variety of services and spit them out into a variety of backend systems.   Ganglia and Graphite so far and now OpenTSDB.  Just to see where it leads.   It’s open source and on github here:  https://github.com/99designs/vacuumetrix  But I’m not sure how to use tags yet.  There’s some talk on the mailing list about changing how this works, so I’m not going to fret too much.   https://groups.google.com/d/msg/opentsdb/llAVqkKqFPw/_vQbf1MSX6sJ

  • Downsampling.  Ok this is really useful, and is something the likes of Ganglia and Graphite don’t do so well.  Sometimes to see what is going on in the data you want to downsample.
  • Aggregation is the default.  OpenTSDB is designed to handle huge abouts of data. So aggregating metrics is naturally on.
  • You have to manually register each new metric.
    e.g.  #tsdb mkmetric newrelic  This is a bit sucky.  Is there a way to do it automatically?
  • tcollector I’ve no idea if this is working or not.  Admittedly I’ve only been looking at OpenTSDB for a couple of hours.  It’s not immediately clear how this works or what the metrics it may or may not be injecting are called.  I have not yet looked at the source.  I just tried to run something called startstop?!?  If I have a few moments I’ll investigate further.

In conclusion.  The prospect of not throwing away data is attractive.  I will keep playing with this tool, get some real data in and see if it’s useful.

I recommend watching this video about OpenTSDB.


Filed under Uncategorized

Apply Solarized color scheme to Graphite graphs

I spend at least part of every work day looking at graphs.  It’s standard operating procedure for those on the Ops side of things.

I really like using Graphite for exploring my data. It’s such a useful tool.

Recently I’ve been thinking about color schemes. It started off when I heard about the Solarized project from Ethan Schoonover on The Changelog podcast.  Ethan knows more about the effective use of color than anyone I’ve come across. It really expanded my mind to hear about Solarized and it’s choice of colors to “to ensure perceptual uniformity in terms of lightness”.

The default graph colors from Graphite are, well, quite highly contrasting to my uneducated eye. Here’s an example graph using the Solarized color scheme. How do they look to you? Easier on the eye or not?


You can specify the colors to use for a graph by using this in the URL to render the graph. &colorList=6c71c4,859900,d33682,b58900,cb4b16,268bd2,2aa198,dc322f

Leave a comment

Filed under Uncategorized

Tips for Ops to work better with Devs

Recently I re-read Evan Bottcher’s presentation and blog post DevOps – Ten tips for loveless developers.  It’s good advice.

This is a response to Evan’s blog post with some advice for Operations people and teams.

* Earn their trust. If you can form good working relationships with your developers, friendships even, rather than client/patron (or master/slave), they’ll treat you with respect. Try to build equality in the relationship. Help them out, and you may find you can even ask for help when the time comes that you need it.

* Educate them in the mysterious ways of operations. Developers are generally intelligent and curious people. They may not know as much about operating systems and networking protocols as you do. That’s ok. Teach them something!  A less known unix command or how to copy files more quickly, something interesting or useful, or  about what your challenges and constraints are and they’ll understand you better. They will think twice about bothering you about a broken printer during Sev1 incident-in-progress, for example.

* Get an understandng about the motivations and challenges of developers. They may have time constraints and other pressures you’re not aware of. Sometimes it may be ok to bend the “no deploys on a Friday afternoon” rule to meet a deadline.

* Enable them to do their job better. Be proactive about meeting their needs. Whether it be provisioning servers, or upgrading their laptops. If you’re not meeting their needs they’ll find ways around you and it’ll end badly for all. Remember Operations is a service department (that’s a fact of life, it’s not a negative thing). The developers are our customers.

* Learn about their tools and language. Become at least a bit familiar with continuous integration, testing frameworks, Agile software development methodologies. Ask to pair program for a day on a typical development task. Ops often have latent programming skills too! Definitely understand source control and use it.

* Share the metrics with them. Enable easy access to metrics. Preferably without requiring authentication. Teach them what the graphs mean. Celebrate with them when they go in the right direction. Work with them when they go in the wrong direction. Get developers excited about how they can do their jobs better by better integrating their code with Operations’ metrics and monitoring.

* Security is a means to an end. Not the main goal. (Unless that is your main business). It is tempting to lock systems down so well that it prevents people from doing their job. Use appropriate levels of security keeping mindful of the big picture.

* Configuration management is cross team communication piazza. Open up and provide full access to the config management code repository. The more you try and hide what you’re doing from the devs the more they’ll mistrust you.

* As Operations we live by the principle of least privilege. Developers may not understand this concept as well. It’s our job to educate just as much to enforce.

* Invite yourself along to their meetings, standups, kickoffs, retros etc. Be actively involved in the development process from as early as possible. This will help prevent surprises and operations bottlenecks. You may even be invited to their launch parties.

* Don’t resist change. Developers are employed to make changes. These changes to the code presumably will make money. If the organization is more efficient at making money Operations may get new toys to play with, and you may even get paid every week/month. This is a good thing. Enable change.

* Make them aware of the pain of operations. If you’ve been up all night keeping the site alive let everyone know about it. Especially if it’s a code related problem. If you’ve earned their trust they will make it a high priority to fix the root cause. Outages cause loss of confidence in the business, and therefore money. Availability is a shared responsibility.

* The best way of making developers responsible for the code they produce is to allow them to deploy to production. Are you still deploying code manually? Deployment is work for bots (or devs). Building systems to enable this to happen safely is far more interesting work for humans.

And to reiterate Evan’s final comment, use common courtesy.  Say pleasethankyou, and when you’ve screwed up say sorry!

Leave a comment

Filed under Uncategorized

Running IT like a Rock band

I’d like to take you on a little journey dear reader.

My background

I have two driving passions in my life. Music and computers. Since a kid I have tinkered with computers and noodled on musical instruments. For the past 10 or so years I have been working in IT, mostly as a sysadmin, mostly on busy websites. And I’ve been playing in bands, on and off. I’d like to draw some parallels between running an IT shop and playing in a rock band. Humour me?

Back in the day I went to University. I enrolled in a Science degree and studied Physics and Chemistry and all that jazz, and I took all the Computer Science classes. Mainly programming in C and networking. My first year assignments were done on real VT100 terminals connected to some wonderful old SGI and Sun hardware. Man, they were some good lookin’ servers! But the classes that really piqued my interest were those from the History and Philosophy of Science department, and a great CS subject called Professional Issues in Computing. I guess it started a long term interest in the culture that underpins our various pastimes and endeavours.

The analogy

So, in my day job, I’ve worked in Web 2.0 start-ups, small teams where everyone does everything through to bigger Enterprisy teams where people are specialised and siloed in departments. Although I studied to be a software developer, I have always been attracted to the more operational side of IT. I’ve hired and fired people and built teams. Have worked with some great people along the way. I’ve worked in some effective, collaborative environments and also in some dysfunctional ones. I’ve seen good environments go bad and even bad ones get good again. Outside of my day job I’ve played in a bunch of bands that have these characteristics too. I can’t help thinking there are some similarities.

I’ll use a typical rock band as a framework. 2 Guitars, bass, drums, vocals etc.

* The roles

The Drummer – Operations.
In the band, he’s the one who keeps the beat. He drives the music along and gives it energy. No other member of the band is under such constant pressure to perform. He rarely gets a break. If he’s missing it leaves a huge hole. You want him to be consistent and always available. Similarly the operations role (which may be shared in a small start-up, or a dedicated person, or team in a larger org) drives and maintains IT. Drummers and Ops don’t need a lot of glory and spotlight time, but they don’t like to be ignored either. They both like hardware.

Lead guitarist – Development.
The lead guitarist on the other hand does need to show off regularly. He’s practised till his fingers bled and needs to have his solos heard. Can be temperamental at times. Will definitely have strong opinions on how the song should go. The lead guitarist really adds colour and showmanship to the band but can have a tense relationship with the drummer. Sometimes gets ahead of the beat or even loses it. Probably ambitious and might be moonlighting with another band. Developers tend to move jobs more than operations. Interested in shiny things and new programming languages. Short attention spans, and don’t like repetition. However, they drive change and build features and need to be looked after.

Rhythm Guitarist – QA
Often overlooked. The rhythm guitarist is more comfortable out of the spotlight. Often highly accomplished and reliable though. They keep things together. They can back up the lead guitarist at any time, filling in gaps when the song’s getting out of hand. They listen to and speak the same language as the lead guitarist and the bass player. A small band/small IT shop may not have a dedicated rhythm/QA role. Instead, the responsibilities may be shared by others. Really good QA staff can catch bugs and performance regressions before they hit production. Big picture people who care for the end user.

Bass Player – DBA
The database administrator is a key bridge between operations and development. The bass player is the link between the drums and lead guitar. He has just as much commitment to the beat as the drummer, but has much to contribute to the melody and harmony too. Drums and guitars will definitely be listening to the bass player if they’re to play together well. The developer might find it convenient to SELECT * from bigtable but the DBA will insist on a WHERE clause for sane performance. Interesting characters if you get to know ’em.

Lead Singer – Product Owner
Now this is where things can really go wrong. Or right. The singer thinks they run the whole show. They have the personality and ego to be the main focus for the audience. They are the one who leads, who sets the style of the band, who gets the gigs, and writes the songs. They can be a temperamental prima donna, but audiences love them. Product owners can come into conflict with any in the organisation but watch out for their relationship between the lead guitarist/developer. Most important and they know it.

Keyboards – Security
Ok, how far will the analogy stretch… Not every rock band has a keyboard player. Nor does every IT shop have a security guy. The bigger ones of each do. Good to have one if you can.

Roadie – Network Engineer
Roadie sets things up before the show. Plugs in all the wires. Sets up lights, drums, smoke machines. Roadie makes sure the microphones and amplifiers are cabled. Network guy makes sure all the servers and switches are cabled. Probably knows a whole lot more than you think about how stuff works. Knows stuff about how stuff works that you don’t need to know. Arguably not needed so much when the band is playing (or the data centre is built), but if something goes wrong, it’s really really handy to have him around to fix it. Can fix a broken mic stand or router with duct tape in a crisis. Drives a fast car or motorcycle. Loves the hardware. Really loves the hardware.

* Songs, features.
What is it that a band or a typical IT organisation produce? Bands write songs. IT produces features on the website. Let’s say the lead singer is the creative force in the band and the Product Manager/Product Owner is the one accountable for driving the features. We can start to see immediately that the contributions of all members of the band or team are important. But if the creative vision is wrong, or the execution of the vision fails, it’s just not going to work well is it?

* What can go right
Ideally a harmonious working relationship exists between all members. If everyone’s pulling together, the band locks into the groove, the team morale is high, there’s nothing better. It’s a pleasure to play together or come to work. Amazing things can be accomplished. Hit records. Hit applications on the web. Fame and fortune.

A good leader realises that the contributions of all members of the team have worth. The singer may write most of the songs, but sometimes the drummer comes up with a hit tune too. Don’t disregard anyone on the team. Perhaps the QA guy has a better feel for which features will make money and which will annoy the crap out of the users.

A sign that the team is really working well together is that people start backing each other up, never undermining. In the band, members might cover each others parts if there’s a problem. In a work environment Ops might try being proactive about provisioning new development environments, or Dev might come over to chat with Ops about a new technology they’re thinking about implementing way in advance.

* What can go wrong
In the band, it’s often the prima donna lead singer, who is out of step with the rest of the band. They have creative differences and decide to go solo. Then they realise what they’ve lost and never capture the same vibe again. The singer/product owner needs to listen to what the rest of the band/team are saying. They need to rein in the ego off stage. There’s a tendency for these creative types to become too attached to their ideas. They might think a song is the best thing ever, or a new application feature is exactly what the users need. Sometimes they might be right. Sometimes they’ll be horribly wrong. The music industry has well established metrics to measure success or failure of a tune. Top 40 charts, downloads and the like. The IT industry have some catching up to do in this area. Smart players are using feature flipping and A/B testing to understand the impact of features.

In a band it’s pretty obvious when personnel are out of balance. The rock band formula has been around for more than 60 years. You don’t see (often) a band with 2 drummers. Or 10 guitarists. Yet I have sometimes seen a distinct lack of balance in IT. I’ve seen one team grow in numbers without adding the supporting head count for the other teams. For example, if Dev grows too fast they might become frustrated that Ops can’t keep up with their perfectly reasonable requests. If Ops proportionally outnumber Dev, they might also constrict the rate of innovation by adding too much process. Take a look at the balance of your wider team. Are there resource gaps, or even an oversupply? Too many Product Owners or persnickety Project Managers can be positively poisonous!

Some times in a band or a workplace there are just plain old personality clashes. Some people aren’t team players. They have their own career agendas and ambitions. Most people want to do good, but some, not many mind you, are just evil. Identify these people and work out if you can bear to spend a significant amount of your time with them and if not, leave before you become bitter and twisted yourself. Life’s to short to suffer assholism.

* Methodologies
Company culture and certain methodologies can vary widely depending on age and size.

An established band. They’ve been around for some years. Now they put on an occasional stadium show. They go into the recording studio and produce an album every 5 or so years. They’re worried about upsetting fans by changing too much.
Compare to an older Enterprise IT shop. They’ve built up processes and red tape. Have implemented ITIL. Play it slow and safe. They may still use a waterfall software development methodology and have armies of business analysts and project managers. They have legacy systems to maintain and are worried about changes or outages upsetting their users.

New band. They’re actively gigging around the traps. May be reliable and show up at the local pub on Saturday night. Or not. They’re producing songs quickly trying new sounds. They’re listening to their fans.
Like a fast young startup. Using any of lean, Agile, XP, scrum, Kanban.

* ITIL vs Devops
I guess it’s my amateur interest in organisational culture that has me excited by the devops movement, kicked off by maestro Patrick Debois a couple of years ago. I’ve seen first hand how big waterfall projects can fail to produce anything except used printer cartridges. And I’ve seen Agile fail too. Ultimately even more slowly and painfully. Agile and scrum methodologies are focussed on software development, and while great for delivering regularly incremental features, can miss the bigger picture. Agilistas need to ask themselves, are they continuously delivering the wrong things? Do they use the metrics Ops have or involve Ops in sprint planning? All the while they sang Kumbaya at stand up while the ship went down…

ITIL, Agile, Devops and Lean all try to give IT a clue to how to do things better. ITIL has so much potential, yet is so widely hated. I actually think ITIL is misunderstood and often misimplemented. The ideas are sound. It’s about delivering communication and change, safely. I’ve done the training. It’s not about filling in a form just to reboot a server. That’s a bad implementation of it. I think some managers see ITIL as about maintaining law and order in the wild west of IT land. About keeping the cowboys under control. It fails because smart people don’t like to be told what to do.

I think devops will succeed where others have failed because it is a holistic yet nebulous idea and defies definition.

But every devopstician has a devops definition anyway, so here’s mine. This week anyway.

Devops is putting the band back together. The C in CAMS is about culture, which I think is the most interesting bit of devops. It’s about an inclusive way of looking at IT and even the wider business (not just Dev and Ops) and getting teams working together again. It’s about software development and infrastructure and deployment, and feeding back data into product development so we end up building the right things. It’s a business solution not a technology problem (paraphrasing Damon Edwards). It’s about balance and harmony and having fun.


Filed under Uncategorized


Vagrant really kicks arse for creating multiple virtual machine environments.

Leave a comment

Filed under Uncategorized

Performance problem with VirtualBox

I spent a good few hours trying to squeeze some extra performance out of our Vagrant / VirtualBox Development environments.

I was benchmarking different OSs, and virtual hardware configurations.

My guest OS is ubuntu 11. I was experimenting with adding virtual CPUs and suddenly noticed a massive network performance degradation. It seems changing the network adapter to Paravirtualized Network (virtio-net) wasn’t a good idea. Even the Intel PRO/1000 nics were problematic.

Best performance (with multiple CPUs) was PCnet-FAST III (Am79C973).

Hope this helps someone.


Filed under Uncategorized