Downtime Postmortem

During our launch at DEMO, we experienced some flapping availability that lasted several hours. We had promised we’d write more about it after DEMO was finished. It’s been quite a while after DEMO, but we still want to provide a look into what’s happening here and what we’re doing to try to give you the best possible experience on Upverter.
What happened:
Early in the morning of Sept 13th, around 4:30am, our monitoring system caught that our production server wasn’t responding. The box came back up after a full reboot through the AWS console. It continued to freeze up every 30min or so for the next several hours.

When the box would freeze up we noticed that the SSH daemon would still perform hand shaking. After trying to run several commands directly (instead of starting a shell) we were able to ascertain that the primary disk on the instance was suffering periodic write failures. When write failures happened the kernel would hang any process that required disk access.

Experience has taught us the hard way that disk errors are rarely isolated. So just to be safe, we proceeded to switch over to our hot spare/DB slave. In the switch over we noticed that our DNS TTL’s were set to 3600 seconds. To make sure that cached DNS entries wouldn’t fail we setup the original server to reverse proxy to our new server.

We weren’t at risk of losing any data. We make full, off-site backups in addition to partial backups multiple times a day; our database is also replicated on multiple servers.

What we’ve learned from the failure and improved:

  • Higher granularity alerting
  • High granularity, realtime machine stats monitoring
  • Full puppet definition for production configurations with pre-built packages
  • Elastic IP addresses for front end servers for faster failover
  • Faster deployment process

The real time monitoring provides quicker feedback that something is going wrong and helps us hunt down the root cause with historical machine metrics.

The use of puppet (puppetlabs.com) allows us to provision a full production instance in one click. The entire process takes just over 5 minutes. All of our packages (both binaries and Python eggs) are installed at specific versions. No more worrying that APT or easy_install/pip will bring in an incompatible package.

Elastic IPs are great for taking out front line servers–near instantaneous switch over, and no need to wait for DNS changes to propagate.

We now have a single command deploy mechanism. With this we can push changes and fixes several times a day.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s