John P.

Our First Ever Complete Server Crash!

Woopra News, September 6th, 2009 by John P.

Wow! Today we had our first ever complete server failure. After two years of continuous operation, and with a whole slew of servers churning through over 500 million pieces of data daily, it was bound to happen! So I’m glad that is over with. Now, we have to fix it and move on.

The Summary

I’d rather give you the summary than bore you with the details, so let me just say that one of our “Engines” (that’s what we call the servers that do the heavy statistics gathering) had a RAID controller failure today, which resulted in the corruption of all of the data. This my friends, is about as much damage as can occur in a single server.

RAID is a technology we utilize in our servers that mirrors the data across multiple hard drives. This means that if you have one hard drive fail, there is a copy of the data on another, and the machine continues operating. Unfortunately, on rare instances you can have a controller failure, which damages the data on the drives. This is when you pray to the God of backups that you have another copy of the data elsewhere.

The Good, The Bad, and the Ugly

The bad news is that as I write this update, a small – but still significant – percentage of the Woopra user base is without their live analytics fix! This can also cause some related issues such as WordPress plugin errors communicating with the server if you are trying to check your stats in the WordPress dashboard.

The ugly news is that for the period of time that the server continues to be down, there will be no statistics to report when it comes up. This is because when the server that your site is being tracked on is out of operation, our other servers do not temporarily pick up the slack. Currently we simply have to get the server back online as fast as possible.

The good news is – we do have the historical information backed up. Thank God that we are paranoid and built a whole system for the purpose of backing up the other systems. Our backup systems have backup systems. :-) So, once the physical server rebuild is completed we will restore the data from backup and be back in operation for the thousands of you who are currently affected.

What We’re Going to Do About It!

First of all, let me say how sorry we are for any inconvenience it is causing for those of you who are affected. I’m going to guess that for some of you this probably came at the worst possible time! Like, your site just hit the Digg homepage and you wanted to watch the glorious traffic in real time! Or you were watching people as they shop during this Labor day holiday.

Well, we feel your pain. And we are actually already several steps ahead of the curve on this one. We’ve been developing a network of internal systems to ensure redundancy and scalability in cases where these events occur so that they become transparent to the end user (that’s you guys).

In the near future, a complete failure of one machine would not result in interrupted service as another machine would temporarily take over. Unfortunately this failure exposed the weakness just before it went away.

The End

I suppose I didn’t really need to say that. But in summary just know that we are working very hard to get service restored to those of you who are affected, and we will be updating you via the normal Woopra Status Twitter account.

Elie Khoury, Lorelle and I (@johnpoz) also routinely give updates in case you wish to follow us or communicate directly with us.

Thanks for your patience and understanding as we work to resolve this matter.

22 Responses to “Our First Ever Complete Server Crash!”

  1. In the land of Poz, even Woopra is not invincible it seems! But with great service and communication to the users, I’m one of many whom I’m certain will forgive and forget. Keep up the great work.

  2. Wurreker says:

    Great stuff and well handled. Thanks for the update.

  3. It’s a good sign! It’s a sign of growth :)

  4. planetmitch says:

    Thanks for the update – I can hardly imagine what you’re going thru… my websites are never that critical. Thanks for letting us know why we can’t connect.

  5. Ron Pare says:

    Oh boy, I bet this caused some tense moments still. This is Murphy’s Law here to make us think. Yeah right, redundancy… (-=

    My service is running fine thankfully.
    Ron Pare
    http://www.modelersguild.com

  6. It is not a problem. Staying blind for a short period is not bothering if we count what we get at the rest of the time.
    Thanks again for your spent time fixing this problem.

    I just cannot imagine monitoring my web page without woopra :D

  7. John P. says:

    Thanks for the understanding and kind words all. It has been a little tense around here, but we’ll be back to normal soon.

    Cheers!

    John P.

  8. Sir Marky says:

    I don’t n’n'n’n'need it. I’m not really shaking in the corner with withdrawal symptoms. So alone…so very alone….

  9. Sir Marky says:

    It’s alive! I’m back in! I can stop shaking now! :-)

  10. tmpatton says:

    Well at least now I know why it hasn’t been working well since I moved to wordpress a week ago I guess LOL could this have been the cause days leading up to the crash or has it crashed several times

  11. My service seems up and down today?

  12. aTc says:

    wooow… after 2 years of running and first breakdown this is nothing lads.. Tho I feel the pain with you for server..
    keep up the proud face, you are doing amassing job over there
    Thanks for update!

  13. atc says:

    wooow… after 2 years of running and first breakdown this is nothing lads..
    keep up the proud face, you are doing amassing job over there
    Thanks for update!

  14. Llion says:

    I think we can overlook it on such a great product – especially in Beta. Thanks for everything.

    BTW, although the backup has worked for our site, we have lost all traffic data for this morning. No problem as we also run GA but is it lost forever?

    Thanks again.

    Llion

  15. Lightsource Media says:

    Hmm…

    I had perfectly working stats for today until a couple of hours ago (so midnight until 3pm UK time), but now they’ve been wiped and it’s only counting from 3pm onwards (ie the last 2 hours). So it was working but now they’ve been wiped?

  16. stellan says:

    Hi John!
    I must admit, I was quite pissed off there for a while, especially since one of our websites have broken its traffic record today.
    But I have to give thumbs up for your openness. My confidence in Woopra is intact – keep up the good work!

    /Stellan in Sweden

  17. E-TARD says:

    its cool
    things like this happen
    you just got to roll with it :)

  18. mitch1321 says:

    Just got off a conference call with my boss (who watches Woopra all day but has no idea what any of this info means) who said and I quote “They could crash once a month and I’d still love them. We are never going back to Google.”

  19. You guys handled it so well, I didn’t even notice it was down! Great job handling the outage!

  20. Erik says:

    You guys did a good job rebuilding the broken server that fast.

    Is there a chance for the old traffic information that was stored on the server to be recovered / restored?

  21. Gonzague says:

    very good job at communicating with your users !

Leave a Reply