Startups @ Scale: Make the abstract actionable

startups10/09/2011

This post is pretty technical. If you don’t cross your 1s or circle your 0s, then it’s probably best to move on to something more fruitful for you, business monkey.

I’ve always thought that a major challenge in building a dev team is continuously improving how effectively you can respond to changes in your metrics day-to-day. One of those tasks I face is a sweep daily of our error logs. If you, like us, run a website, and you’re properly logging everything, then your error logs probably look something like this:

(These error logs have been scrubbed of any actual usage or error data, and their use has been approved by my employer)

Oct 7 01:34:14 10.182.41.217 httpd: app14 JSError http://www.DOMAIN.com/Inbox/ 31732521Script error.0 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.23) Gecko/20110920 Firefox/3.6.23 [version1.xxx] [OnErr:] [_GET:uerystring:EL/,E:JSError,EI:http://www.DOMAIN,EI2:31723 Script erro,type:error]

Oct 7 01:34:22 10.182.36.33 httpd: app1 404 http://ww1.DOMAIN.com/http://ww1.DOMAIN.com/Referred from UID Email: None IP:66.249.171.179 Location – Los Altos, CA [version1.xxx]

… etc ad nauseum ….

The problem: a lack of actionable information

Confusing? Probably. Boring? Absolutely. I’d be willing to bet that your eyes just skipped right past those error logs to the beginning of this paragraph. Getting one of these in my inbox every morning is pretty-much a recipe for a wild goose hunt every morning. (and not-so-fun one at that) It’s a real gumption trap for my team.

Providing Value

This post is all about making that wild goose chase into an effective process. With that in mind, I recently delivered a project to increase the effectiveness of Ignighter’s error log maintenance process. There are some questions that I’ve set out to automatically answer for my team, and that’s what this post is about.

What’s the root cause of each error?
What is it’s urgency?
How often is it happening?
What does the timeline for these errors look like?
Who’s in charge of figuring them out?

If you ever work for an Ignighter development team, here is an example of an error log email you might get from me.

Subject: Error Action Items for Thursday 10/06/2011 (109629 items)

LEADERBOARD:

  <strong>kevin</strong> : 941   (801 phpErr's,140 SQL errs )    
  <strong>steve</strong> : 56427   (56408  unserializableClsoure errs, 2 Redis Errs, )    - you're way past <a href="http://imgs.xkcd.com/comics/ballmer_peak.png">your ballmer peak bro</a>
  <strong>mike</strong> : 59529   (56468 EmailWithoutMessageException errs, 3061 404 errs)    - every day this happens, a fairy dies
  <strong>john</strong> : 353   (353 phpOutOfMem errs )    - less than 500.  keep it up!
  <strong>joe</strong> : 20975   (20975 JSError's )    - you MUST watch <a href="http://www.youtube.com/watch?v=oHg5SJYRHA0">this video</a> today ==&gt;

The error log aggregator is the arbitrator of cleanup responsibility. The first thing in the email ‘leaderboard’ which heckles a developer if a service they are maintaining has many errors! That means I don’t need to send needy emails to a developer asking them to clean things up anymore. It’s worth noting that heckling your team will only work if you have the kind of culture that supports lighthearted fun and constructive criticism. ( we do )

Next, we display the most prevalent items in the error log. Nothing fancy here, but I now see what the most pressing issues are.

(Again, These error logs have been scrubbed of any actual usage or error data)

LOG SUMMARY (20975 items):

  56468  <a href="#EmailWithoutMessageException">EmailWithoutMessageException</a>
  56408  <a href="#unserializableClsoure">unserializableClsoure</a>
  20975  <a href="#JSError">JSError</a>
  2466  <a href="#404">404</a>
  800  <a href="#phpErr">phpErr</a>
  353  <a href="#phpOutOfMem">phpOutOfMem</a>
  140  <a href="#SQL">SQL</a>
  2  <a href="#RedisConnectInstance">RedisConnectInstance</a>

From there, I aim to provide as much actionable detail about each error type as possible for the critical types. See below example of MySQL errors (of which much Ignighter-specific information has been scrubbed).

So, what?

Now that we’ve all aware of the volume/priority/assignee of each type of issue our system is encountering, we’re all much more efficient.

Want to build your own? If you’re any good with awk or graphite, you could probably do the same for your team with a modest hour or two investment.

How does your team keep on top of application metrics and logs? Leave a comment below.

If you’re a developer who is looking to work in an efficient, fun, environment that empowers you, check out Ignighter’s open positions..

Note: The usage of any information of proprietary value to my employer has been removed, and this post has been approved by my employer.