infracoders hot topics

It’s been 4 years since Matt Jones and I started the infracoders group.


4 years is a long time in our industry.  4 years ago I spent most of my work day writing automation code in Chef and Puppet.  I worked on VMWare virtual machines and physical infrastructure.  I worked closely with the developers on build pipelines and was primarily responsible for production availability and performance.

These days, it’s still working closely with developers but my focus is more on AWS’s cloudformation for automation.  I work on more of the administrative side of things like keeping a lid on the cost of cloud convenience.  It’s a daily battle to keep up with all of the new shiny technologies people keep inventing.

When we created infracoders I thought we would mostly talk about automation with Chef and Puppet (and whatever else people were using) but we wanted to keep the scope of the meetup open enough to talk about with whatever was cool and new in the broad infrastructure as code area.

The meetup is all about building a community for sysadmins in particular but we welcome others! Our speakers are the practitioners of our craft who write the infrastructure code and get woken up in the middle if the night when it fails. The 88 fabulous talks we’ve had over the 4 years have been interesting and above all useful and educational for our members.

I often wonder what’s next?  What technologies should we be learning about next?

I thought I’d spend a couple of hours analysing the kinds of topics we’ve had presented at infracoders meetups so far.  I used the publicly available topics from each meetup on and put them in a big ole spreadsheet.  Then I classified them into categories as best I could from the topic names and my own fallible memory of the meetups.  I’ve been to most of the 48! to date.

Here’s the results of my analysis.


category count
puppet 4
automation/orchestration 3
sysadmin culture 2
cd/ci 1
web architecture 1
testing 1
packaging 1
chef 1
babushka 1
ansible 1
hardware hacking/arduino 1
windows powershell 1


category count
cd/ci 6
metrics/monitoring 4
sysadmin culture 3
aws 2
networking 2
security 1
web architecture 1
gradle 1
graph database 1
elasticsearch 1


category count
aws 4
sysadmin culture 3
docker 2
metrics/monitoring 2
ansible 2
puppet 1
cd/ci 1
support 1
coreos 1
cloud 1
elasticsearch 1
terraform 1
vagrant 1


category count
docker 10
metrics/monitoring 3
puppet 2
cd/ci 2
storage 2
security 2
networking 1
cdn 1
paas 1
sysadmin culture 1
windows powershell 1
aws 1

“Sysadmin culture” is a category that covers the talks that were case studies or organisational transformation talks.  These are more the staple topic of devops meetups.

I salute our members who spend their own time (or their employers’!) preparing talks.  It is surprisingly time consuming to do (and that’s just looking for pictures of cats).

Do we follow the technology trends in our industry?

This leads me on to the question that I ask at every meetup.  What do you (the community) want to hear about next?    I wait to see eagerly what the rest of 2016 and beyond brings.


Leave a comment

Filed under Uncategorized

Twas the night before 1.0

T’was the night before 1.0, and all through the cloud
Not an instance was flapping, reserved instances vowed
All the deployments were done via chef with great care
Idempotent deployments, so quick redeploys aren’t a bear
All the devs were home, nestled tight in their beds
while visions of stateless sessions danced in their heads
I with my maven and friends with their ant
Both waiting in earnest to give gradle a chance
When up on the office dashboard, came the red in a smatter
I sprang from my Aeron, and told pound-ops: “Hold the chatter!”
To the Puppetmaster, I SSH’d in a jiffy
sudo’ing my way to /var/log, searching for entries that were iffy
When, what to my wondering eyes should appear
But run-away processes, a load average of 80 on my gear
With some special tools on my old thumb stick
My debugging started lickedy split
With phone all a ringing the SMSes came
From PagerDuty dutifully calling my name
Now! ssh, now! tmux, now! grep and awk
On Python, on Perl. Ruby and gmock!
I desperately tailed log files, tailed them all
to find what had caused my servers to fall
When I looked at Graphite my consternation grew
My brand new database server’s gone down too!
And to my dismay load went through the roof
On every last node of my clustered Hadoop
Back through the changelog I looked to the sound
Of more and more alerts going off all around
Hardware failure? Software bug? Configuration caput?
Until it was fixed I would have to stay put
A bundle exec rake capistrano deploy command, in fact
Were the a few key presses I knew would rebuild my stack
Yet running the command failed causing much apprehension
ERROR: Failed to build gem native extension
Packet loss was rising now, 40 percent, oh no!
I was wondering how much worse it could go
Every cluster had problems, I gritted my teeth
Kernels were panicing, JVMs out of memory, No LOLcat relief
Surely our architecture wasn’t that smelly
For complex failures even beyond Machiavelli
My nerves shattered, I reached for the Scotch on the shelf
With panic almost overwhelming myself
I picked up the phone to wake the CTO from his bed
The product launch the next morning, must go ahead
He spoke not a word, as I told him “It’s borked”
But sighed down the line, as his mind went to work
“Ah ha, I have it! A foolproof plan”
“Have you tried turning it off and then on again?”
I did exactly what the great man suggested
And lo and behold, without needing to test it
Everything started working, Nagios said it’s alright
Happy launching tomorrow, and for me it’s good night.


Written by David Lutz, J. Paul Reed, Seth Thomas and Edward Ciramella for the Ship Show podcast.


Leave a comment

Filed under Uncategorized

incident severity sev1 sev2 sev3 sev4 sev5

by David Lutz

A standard classification for incidents gives all involved a common language to describe what’s going on.

Why bother? And why have so many levels?

I think it’s important to track the kinds of things engineers are being woken up for and to deliver a response that’s suited to the problem.

severity levels defined

  • Sev1 Complete outage
  • Sev2 Major functionality broken and revenue affected
  • Sev3 Minor problem, bug
  • Sev4 Redundant component failure
  • Sev5 False alarm or alert for something you can’t fix

Whenever the pager goes off, it’s an incident.  All these kinds of incidents need different responses.

Classifying them might appear difficult.  But it isn’t really.  Here’s an automotive example.

  • Your car runs out of fuel. = Sev1
  • Your clutch is busted. You can drive but only in first gear. = Sev2
  • One headlight has blown. = Sev3
  • You find your car has a flat tyre. You change the tyre and drive to your destination. = Sev4
  • The low fuel warning light is stuck on even though you just filled the tank. = Sev5

Everyone in your organization should be trained to use this terminology.  Especially front line support people.  They should feel comfortable saying “Guys we have a Sev1, call the on-call engineer immediately” if that’s the case.

Track the frequency of these every week.  Put ’em in a spreadsheet.  Make sure people know what’s going on.  If you’re getting alerts for Sev4 and Sev5, you need to change something to stop them.  Sleep is precious. We have !SPOF for a reason.  Some things are best left till morning to fix.  Perhaps the thresholds are set wrong? Don’t alert on something you can’t fix.  That’s a deeper problem that you need to address as an organization, not the responsibility of the guy on call.

Leave a comment

Filed under Uncategorized

DevOps is a job title

by David Lutz

Because I am interested and amused by such things (and because I am a stirrer) I decided to do some analysis of job titles of those attending the devops un-conference next week in Silicon Valley.

DevOpsDays Mountain View, in Santa Clara.

It’s an amazing event.  I attended it for the last two years and if you are able to, I highly recommend you go too.  Without doubt the brightest and most forward thinking innovators in Web Scaling will be there.

Alas, I can’t make it this time 😦

There’s been a fair bit of snarkiness around the Internet about using devops in job titles.  And while I can see both sides of the debate, I really object to people saying “you can’t do X”.  If people want to call themselves DevOps Engineers, or devopsticians, or any other made up words, I don’t think it is really hurting anyone.

So, who attends devopsdays?  I took the publicly available attendance list and manually normalised it a bit, removing things like “Senior”, made the spellings more consistent (Systems vs System), Expanded some acronyms etc… fed it into sort | uniq -c | sort -nr | head -n 20  and came up with this table.

Top 20 job titles of devopsdays attendees.

rank job title attendees with this title
1 Software Engineer 27
2 DevOps Engineer 19
3 Systems Engineer 18
4 Operations Engineer 13
5 Consultant 13
6 Release Engineer 10
7 Engineer 9
8 CTO 9
9 CEO 8
10 Architect 8
11 Site Reliability Engineer 7
12 Manager 7
13 Systems Administrator 6
14 Software Architect 5
15 Product Manager 5
16 Manager Engineering 5
17 Infrastructure Engineer 5
18 Co-founder 5
19 Systems Architect 4
20 Sales Engineer 4

I leave all conclusions to you dear reader.  Or if you wish to flame me, please do so below. 😉

Leave a comment

Filed under Uncategorized

The Phoenix Project

by David Lutz

The other day I read The Phoenix Project by Gene Kim, Kevin Behr and George Spafford.

This is not a review as such, but may contain spoilers, so stop reading now if you haven’t read the book. Then go read it! I thoroughly enjoyed it.

The characters and scenarios in the book are easily recognisable for anyone who has worked in IT. Sometimes painfully so.

I’ve since found myself pondering some of the decisions made by the main character, Bill. I spend a fair bit of time thinking about what motivates people at work and what’s the secret sauce that makes for a highly effective team.

Bill is the low level team manager suddenly given a promotion and real power in a highly dysfunctional operations team.

Bill wonders what to do about Brent. Brent is the archetypal brilliant guy who knows how everything works. Later Bill realises Brent is also the biggest bottleneck to getting work done. Every small project involves Brent somehow. He makes the technical decisions and has made people other depend on him. The throughput of work for the entire operations team is constrained.

I find myself wondering what the outcome of the project would have been if Bill’s first action was to fire Brent. Would the project have been finished earlier? (I don’t have a moral problem firing a fictitious character in a thought experiment. I wouldn’t do this in real life of course!)

Brent is a rockstar engineer. Lots of organisations have them. Often brilliant. Sometimes eratic. Sometimes unwilling or unable to share their knowledge. I can understand the thinking. “It’ll take me two hours to explain how to do job Y, but 10 minutes if I do it myself. Let me just do it.”

I’ve also come across otherwise smart guys who are of the mistaken belief that if they hold on to a task, something only they know how to do, it’ll ensure job security. These people are knowledge Hoarders.

It doesn’t work. Everyone is replaceable. No matter how talented they are. Sure it may take longer at first to find out how to do that special task, but it will happen without them.

The other kind of people are Sharers. They believe job security comes from  sharing knowledge with colleagues. If everyone in a team thinks this way, people are able to watch each other’s backs and easily step in to each other’s roles. The sum of team knowledge and capability increases. This is a highly effective team. People can even go on vacation in these ones!

I guess this post is playing devil’s advocate against hiring rockstars.

I can’t count the times I’ve seen job ads stating “Rockstar Engineer required.”

Really? Do you want a whole team of them? That might not work out so well.

Interviewing applicants for a role in my team is something that keeps me up at night. Just like nagios! Making a mistake can mean misery for all.

It’s tempting to hire the person who appears to be the smartest, but I urge you to look deeper into what kind of person is best for the team.

Will they be a Sharer or a Hoarder of knowledge.


Filed under Uncategorized

Infrastructure as code != Agile != devops.

by David Lutz
Untitled drawing

Melbourne has a pretty happening technology meetup scene. You can drink pizza and eat beer almost every night of the week. Oh, and learn about all kinds of topics from people who just want to share. Cloud computing, startups, Agile, JavaScript, Web Dev, Ruby, whatever tickles your fancy. It’s great.

One of those meetups is the Infrastructure Coders (aka infracoders) Melbourne group founded by the multi talented Matthew Jones and I, specifically for those who want to talk about “Infrastructure as Code” including things like infrastructure automation and configuration management.

For me at least (not speaking on behalf of Matt of course) the impetus for starting our group came partly from a conversation after one of the DevOps Melbourne meetups. Someone said something along the lines of… “This devops idea sounds cool but how do I get started. Can you show me how to write a Puppet manifest?”.

DevOps Melbourne meetups were really popular and had grown big. I wondered if there was room for a smaller group more focussed on tools like Chef and Puppet and how to use them.

Almost a year later I was reflecting on Infracoders. We had had some brilliant talks on Infrastructure as Code, automation and configuration management with Chef, Puppet, Ansible and Babushka. But also fantastic topics on the physics of LEDs, the psychology of failure, the challenges of monitoring, how to manage your clouds and the joy of MySQL schema changes that don’t take the whole site down.

Although some of our talks had been devops related many weren’t specifically. The thought dawned on me that while I strongly identified with the devops community, at least some of the Infracoders members didn’t. Some of our members came from big Enterprise environments that weren’t as far down that path (yet anyway). Basically I noticed that although there was some cross-over with DevOps Melbourne, this was a different group of people.

This was the same with the other meetup groups I went along to. The AWS Melbourne meetup had a couple of familiar faces, but was mostly a different group. The Web Dev meetup was a different bunch again.

Pretty obvious I know. But I started thinking about the difference between the ideas of infrastructure as code and devops.
Are they different things?
Or phrased another way.
Can you do one without the other?
Yes. Your sysadmin team can be doing infrastructure as code in a non-devops siloed environment. It’s possible (but unlikely, granted) that an organization could have a good devops environment but have manually built infrastructure.

How about Agile? Can you have Agile software development without infrastructure as code? Yes. I’ve seen environments like this. And you could have Agile infrastructure along with Waterfall software development. Possibly. They’re different things.

Is devops just an extension of Agile software development? No, I think they’re different. Agile software development is just about software development. Devops tends to be at a more holistic level. Concerned with inter-group collaboration and culture and monitoring and metrics.

Continuous delivery is naively seen by some as being devops. Back to the litmus test. Can you have one without the other? Devops without continuous delivery? Sure. There are companies that have long release cycles that are devopsie. Continuous delivery without devops? Difficult, but I’d argue yes, possible. (NoOps anyone?)

Certainly you can do devops without virtualization or cloud computing. Even though these things can help smooth the way.


To be a well rounded and adaptable professional in the technology field we should take Bruce Lee’s advice: “Use only that which works, and take it from any place you can find it”.

Whether the lesson comes from a field of engineering, or psychology, or fire-fighting or ITIL, military or medicine. Use it to be better at your job.

Follow me on Google+

1 Comment

Filed under Uncategorized

Test driven systems administration

by David Lutz

The practice of test driven development has been around for some time and has proven to be a useful programming technique.

The general idea is
1. write a test case first (naturally it will fail)
2. write the minimal amount of code required to make the test pass (don’t over-engineer)
3. iterate and improve the code as required (without breaking the test)

As we are thinking about infrastructure as code it seems to make sense in theory to apply the same idea to systems administration.

In practise it’s not simple or straightforward. Assuming we’re using one of the popular frameworks for configuration management such as Chef or Puppet, there are a number of different ways of going about testing the code, and frankly, if you’re starting out it’s not at all clear which is the best way to do it.  If you’re using Chef for example to you look at minitest, cucumber-chef, test-kitchen, foodcritic? Something else? All of the above?  It’s a hot topic in the configuration management communities. Not a solved problem.

The learning curve for either Chef or Puppet is pretty steep. You have DSLs to learn, the Ruby programming language, unintuitive directory paths, servers, certs, convergence…

What if you don’t need something as complex as Chef/Puppet, don’t need a centralized server, but do want the ability to test stuff and write scripts that have “idempotent” properties?

First, let’s take a step back and see what we can do with plain old bash shell.

The example I’ll use is a typical task for a sysadmin. Install a package on a system. I’ve picked the text based web browser, links.

So let’s go.

Woah, hold your horses!

Let’s think about it from the perspective of a configuration management application. Actually we’ve made an assumption that the package is not already installed. As a human sysadmin wouldn’t you check to see if links is installed before blindly trying to install it again?

So what’s a nice way to check to see if an application is installed? We might just type the command, or use dpkg-query… how about the ‘which’ command? This’ll give us a nice return code that we can use programmatically.

so here’s our first go at test driven sysadmin in bash.

This script has a few properties of test driven systems administration.
It has a test. Is the package installed? I’m relying on the return code of which being 0 (true) if it is.
If the test passes the script does nothing. It’s safe to run multiple times.
If the test doesn’t pass, the script does something to make it pass. It installs the package.

It has some shortcomings. It assumes something about the environment. That it’s a debian or ubuntu based linux distribution. A better script would not make that assumption, but would work on other *nix environments. Portability is good. I typically work on OSX (BSD with homebrew) or Linux distributions with APT or RPM packages.

Ughhh, no one really writes scripts like this. It’s getting messy. The purpose of the script above is just to illustrate my point.

Enter Babushka. Test driven out of the box.

Babushka is a simple configuration management system. Sort of like Chef and Puppet in that the app can be used in a similar way to provision and configure a workstation, a development machine, or even a production server. But with a different focus. Babushka doesn’t have a server component. It’s a tool for automating the typical manual tasks a sysadmin would perform on a server.

The building blocks in Babushka are dependencies (aka deps). Which are either met or not. If a dependency is not met, (a failing test) then a block of code is executed to meet the dependency. Met and Meet!
Here’s how you’d approach the task of installing links with Babushka.

Create a subdirectory called babushka-deps

In this directory create a file. The name isn’t important but give it a .rb extension.
The format of the dep is something like this:

Now we can fill in our test (met?) and our action to take if it’s false (meet).

# babushka 'links is installed'
 links is installed {
 meet {
 Test failed. links is not installed, installing now...
 } ✓ links is installed
# babushka 'links is installed'
 links is installed {
 } ✓ links is installed

Because installing a package is such a common task, there’s a platform independent way to do it. Like this. Change the name of the dep to dep.bin

.bin is a template. There are others.  .src .app

# babushka 'links.bin'
 links.bin {
 apt {
 package manager {
 'apt-get' runs from /usr/bin.
 } ✓ package manager
 apt source {
 } ✓ apt source
 apt source {
 } ✓ apt source
 } ✓ apt
 meet {
 Installing links via apt... done.
 } ✓ links.bin

# babushka 'links.bin'
 links.bin {
 apt {
 package manager {
 'apt-get' runs from /usr/bin.
 } ✓ package manager
 apt source {
 } ✓ apt source
 apt source {
 } ✓ apt source
 } ✓ apt
 } ✓ links.bin

or to be platform independent

Or even

Now you don’t even need the do, end.

This will work on deb or rpm based Linux boxen or OSX. The important thing to remember is that even in its briefest form, the script above is following the test driven methodology. Test something and take action if the test fails.

The test can be something more complex than checking the return code from “which”. The idea is that you model it on the real commands you would type on the console. As a human sysadmin you might check for the existence of a file, or a process, or connect to a port, or otherwise check the response from a program.

Leave a comment

Filed under Uncategorized