Welcome guest

Things I learnt tracking a billion events in 24 hours

Posted by ben 06 Sep 2010 | 0 comments

As I wrote yesterday an insanely popular little game integrated Playtomic with pretty interesting results.

Here's what I learnt:

Know your host

Don't be anonymous hosting customer #7822. Because I have Dave from Hivelocity on my MSN Messenger I had extra VPS machines being set up very soon after I realized what was going on. Just figuring out what was going on was a challenge since 100s of thousands of people were connecting to each server so I couldn't even get into them. The wrong way to do things is lodging tickets that time time to get read let alone acknowledged and responded to .... having a direct line to your host rocks.

Scaling isn't just servers

The real bottleneck for Playtomic is the log processing software. I'd been thinking about what the next step was for the log processing software for a couple weeks although I expected I would have the rest of this year to redesign it. Turns out I only needed a late night and then most of today to get it sorted after all. The software now operates in two parts:

First the logs are pre-processed during which all the little bits and pieces like adding new sources and metrics and making sure all the necessary rows are in the database is performed. It reduces a few hundred megabytes of raw log files into one file full of database updates that need to be done - add x views to this row in this table.

The actual processing now works from those lists of database operations and it also merges as many as it can to further reduce them down into the least amount of work before executing them.

Prior to last night it was all being done in a single phase which was OK with ~200 million events per day... during peak loads reports were still updating every 20 minutes or so and in quiet times just 2 or 3 minutes. Yesterday though once the servers were stabilized data was arriving much faster than it was being processed.

My servers need to talk to me more

This is an annoying one. Because the load until now has almost always been very manageable, especially as I've tuned and refined the system, I hadn't really anticipated a sudden leap from a couple thousand events per second to a couple tens of thousands. My servers need to be able to tell me when they reach a certain threshold so I can hopefully resolve the situation.

Currently I track CPU, memory, requests per second and connections on all my servers using a great little dashboard but a more proactive early warning system is obviously going to be vital in the future. It wouldn't have actually helped yesterday at all since that went from normal to FUBAR in about 60 seconds, but it will help in more natural scenarios.

Kill switches for users

I obviously need a way to deactivate games so this can't happen in the future. Having all the traffic routed through a single subdomain means I can't disable any individual game, whatever I do still leaves them using resources. In the not-too-distant future I'll set things up so every game uses its own unique subdomain which I can re-route to nowhere so it's one person's problem instead of everybody's. The API in games is self-disabling so when it encounters connectivity issues it just stops trying to send, so this solution should make yesterday very easy to avoid in the future.

What you don't know is the problem

I always imagined the killer moment for Playtomic would be some widely popular social app tossing the API in and unleashing a ton of activity. It turns out there's this site called Conduit which I'd never even heard of - although they're Alexa rank #32 - and they have a toolbar, and in that toolbar you can add a kind of game I've never even seen before. Conduit has 170 million users, and PiTSi has 40 million installs. Usually I'm pretty skeptical when sites or worse, PR, says bla bla bla we have eleventy billion users. I don't doubt Conduit's numbers at all for some reason.

Don't mix server roles

At the moment the main server handles websites, databases and shares tracking. Then there are 2 (although currently 4) servers dedicated exclusively to tracking. Because the main server shares tracking load when it got hit it and everything else went down too.

This is a really tough problem that I'm sure every startup faces ... it's not like anyone wants to do things this way, it comes down to money. I want separate database servers and another one for the web sites but it's just not feasible until I roll out the billing system and get some cash flowing in.

Know your most important users outside of your site

I spend most of the day signed into the chatroom at FlashGameLicense and many more of my users are on my MSN which allowed me to explain to at least some of my users why nothing was working and what was being done about it.

In general I'm pleased with how it turned out in the end, a massive load was thrown at Playtomic and it hurt but it limped through and now I'm much better prepared for the future.

Comments

blog comments powered by Disqus