When the box would freeze up we noticed that the SSH daemon would still perform hand shaking. After trying to run several commands directly (instead of starting a shell) we were able to ascertain that the primary disk on the instance was suffering periodic write failures. When write failures happened the kernel would hang any process that required disk access.
Experience has taught us the hard way that disk errors are rarely isolated. So just to be safe, we proceeded to switch over to our hot spare/DB slave. In the switch over we noticed that our DNS TTL’s were set to 3600 seconds. To make sure that cached DNS entries wouldn’t fail we setup the original server to reverse proxy to our new server.
We weren’t at risk of losing any data. We make full, off-site backups in addition to partial backups multiple times a day; our database is also replicated on multiple servers.
What we’ve learned from the failure and improved:
- Higher granularity alerting
- High granularity, realtime machine stats monitoring
- Full puppet definition for production configurations with pre-built packages
- Elastic IP addresses for front end servers for faster failover
- Faster deployment process
The real time monitoring provides quicker feedback that something is going wrong and helps us hunt down the root cause with historical machine metrics.
The use of puppet (puppetlabs.com) allows us to provision a full production instance in one click. The entire process takes just over 5 minutes. All of our packages (both binaries and Python eggs) are installed at specific versions. No more worrying that APT or easy_install/pip will bring in an incompatible package.
Elastic IPs are great for taking out front line servers–near instantaneous switch over, and no need to wait for DNS changes to propagate.
We now have a single command deploy mechanism. With this we can push changes and fixes several times a day.