Startups @ Scale: Log Everything, then you can Manage Anything.

06/22/2011 Kevin Owocki

One thing that hasn’t changed during the span of my time at Ignighter is the importance of our in-house analytics. Ever since our first lecture at Techstars 2008, when we were prodded to “obsess over core metrics”, we’ve been obsessed with our usage data. Having the right information on-demand is essential to being nimble in your decision making as a management team

“If you aren’t measuring it, you can’t manage it” — Greg Tisch

As CTO, the responsibility of maintaining our business intelligence infrastructure has fallen to me. So I’ve been logging anything that’s remotely significant to our decision making process. We’ve been doing this for years and it’s not my first rodeo, but shit, so many things have changed as we’ve scaled.

Many of these things may evolve for your project too.

The business model
The volume of data
The usage patterns
The systems
The technical architecture
The reporting system
The market

When we first started the company, I was logging all of our usage data to our MySQL database. A DB write (maybe several) on every pageload!? Boy, was that dumb! It didn’t take long before the site was crippling over the load of our usage. One of the first rules of disk-bound databases is that writes are the most expensive operation you can perform. Even when you set up a master-slave MySQL, you cannot scale writes by much, since all write queries must be performed on the slave in order to keep it up to data!

Let me give you a snapshot of how things look nowadays, when we’re on the order of many many millions (maybe more – I’m not at liberty to say) of loggable operations daily.

We’ve built a Logger class which allows us to pass in error, debug, user usage, user statistics, and general info to either the filesystem or into the database.
For flat filesystem logs use open source syslog, syslog-ng to log operations as they happen. Since syslog-ng can support up to 150k loggable operations per second, it’s an ideal tool.
For database logs, we use memcached as a buffer. Basically, what you want is an ‘Aggregated Stat’ class, which has an interface that updates a counter in memcache every time an action happens, then periodicly flushes the results to the database.

After this you just need to decide what’s relevant and loggable, and whether to put it in the filesystem or database. Filesystem logging is more scalable, but there’s advantages to having data in our database too. There’s no way to query your flat logs from your application. For example, in the application, I like being able to know how many times user x has logged in the past month. That’s as easy as a

mysql> SELECT SUM(`Value`) as `NumLogins` FROM `AggregatedStats` WHERE `Segment1` = 'LoginsByUser' and `Segment2` = '<em>[uid]</em>' AND `AddDate` > (UNIX_TIMESTAMP() - 60 * 60 * 24 * 30) LIMIT 1

Whereas, with the flat logs, it looks much more like:

$ cat LOG_STATS.log | awk -F'\t' '{print $3$4$5}' | grep LoginsByUser | grep <em>[uid]</em> | wc -l

Now I can access data about anything at any time. This system scales, and it’s nibble enough to handle queries you did not foresee. Of course, having the ability to view this information does not mean anyone’s actually going to do it.

As a matter of practicality, I’ve found it useful to provide the following tools (and make sure they are blazing – fast ). All of them are plain-vanilla open-source and 100% FREE too!

Nightly email script that rolls up the ERROR and STAT logs, and sends the most interesting tidbits to the team on a nightly basis.
Make the data available in our open source graphing system, Graphite. I’m a huge huge fan of graphite. Importing data into it is as easy as writing script that scrapes the flat logs periodically and passing into an included python script. Big ups to Esty for letting me know about graphite. Check out these sample graphs from their implementation of graphite:
Did I mention that I’m a super-fan of graphite yet? It’s super nimble, fast, and it scales. If you choose one tool from this post to implement, choose graphite.
Plug the data into your team’s private twitter-bot.
Here’s a sample tweet. Note this data is not actual usage data.
Make the data available the your admin section of your application. I’ve found it useful to write queries that I frequently run, give them a name that even the business monkeys can understand, and make them available to everyone via a ‘reports’ section. (Just kidding Adam and Dan, you’re not monkeys)

Make the data available via a board-level reporting system that ONLY includes key metrics. The exclusivity of this reporting system is what makes it special. Only the KEY metrics make it into here!
I have a data porn (get it, cause data is fun to look at? 😉 ) box with several monitors in the office to show me how everythings going for the past 24 hours. I especially like chartbeat for this.
Nagios is a great tool for informing your team of Systems issues. Munin allows you to see system-level information (CPU usage, load average, network transfer, swap i/o) over time.

Informed decision making made easy! Watch out Zoltar, Now even us mere mortals can tell you anything about anything.

The usual warnings apply. These are all just ideas and your mileage may vary based upon your technical ability, execution, and your gumption. I’d love to hear what your team uses and how it compares to what I’ve outlined in this post! Leave me a tweet or a comment below.

Did you know? Ignighter is hiring. We’re based in NYC, work hard, have a lot of fun, build cool shit, and we’re backed by some of the best investors in the business . Check out our open development positions.

Note: Any information of proprietary value to my employer has been removed or approved, and this post has been approved by my employer.

The World As Perpetual Beta

by Kevin Owocki

Startups @ Scale: Log Everything, then you can Manage Anything.

06/22/2011 Kevin Owocki