Outages and improvements...

This past week, the primary Haskell.org server hosting the wiki and mailing systems went down due to RAID failures in its underlying hardware. Thanks to some (real) luck and hard work, we've managed to recover and move off the server to a new system, avoided some faulty hardware, and hopefully got some improved reliability.

The rundown on what happened

Before we started using Rackspace, eveyrthing existed on a single machine hosted by Hetzner, in Germany. This machine was named rock and ran several VMs with IPs allocated to them, which ran Hackage, Mediawiki (AKA www) and ghc.haskell.org on top of KVM.

This server had a RAID1 setup between two drives. These drives were partitioned to have a primary 1.7TB partition for most of the data that was striped, and another small partition also striped.

We had degredation on both partitions and essentially lost one drive completely. This is why the system had been so slow the past week and grinding to a halt so often.

A quick move

We began to investigate this when the drives became unable to fairly service IO almost at all. We found something really really bad: we hadn't had one of the drives in nearly two weeks!

We had neglected to install SMART and RAID monitoring in our Nagios setup. An enormous and nearly disastrous blunder. And we didn't even look at our SFTP access for backups that Hetzner provided, but it was close to out of space last we checked. But we seemed to be OK in the read-only workload.

Overall, it was some amateur mistakes that almost cost us. We'd been so focused on moving things out, we didn't even take the time to check the server when we moved Hackage a few weeks prior and making sure the existing things kept working.

We'd planned to do this move earlier, but it obviously couldn't wait. So Davean spent most of his Tuesday migrating all the services over to Rackspace on a new VM, and we migrated the data to our shared MySQL server, and got the mail relay running again.

The whole process took close to 12 hours or so I'd say, but overall went quite smoothly without any kind of read errors or problems on the remaining drive, and restored service.

Improvements

In the midst of anticipating this move, I migrated a lot of download data - specifically GHC and the Haskell Platform - to a new server, https://downloads.haskell.org. It's powered by a brand new CDN via Fastly, and provides the Platform, and GHC downloads and documentation. Don't worry: redirects for the main server are still in place.

We've also finally fixed a few bugs stopping us from deploying the new main website. We've deployed https://try.haskell.org now as an official instance, as well as fixed https://new-www.haskell.org to use it.

We've also moved most the MediaWiki data to our consolidated MariaDB server. However, because we were so hastily moving data, we didn't put the new wiki in the same DC as the database! As a result, we have somewhat of a performance loss. In the mean time, we've deployed Memcached to try and help offset this a bit. We'll be moving things again soon, but it will hopefully be faster and much less problematic.

We've also taken the time to archive a lot of our data from Rock onto DreamHost, who have joined us and given us free S3-compatible object storage! We'll be moving a lot of data in there soon.

Future improvements

We've got several improvements we're going to deploy in the near future:

  • We're obviously going to overhaul our Nagios setup, which is clearly inadequate and now out of date with all the recent changes.
  • We'll be moving Fastly in front of Hackage soon, which should hopefully dramatically increase download speeds for all packages, and save us a lot of bandwidth in the long run.
  • Beyond memcached, our MediaWiki instance will be using Fastly as a CDN, as it's one of the most bandwidth-heavy parts of the site besides Hackage. We'll be working towards integrating purge support from Mediawiki with the upstream CDN.
  • We'll also be moving our new wiki server to a separate DC so it can talk to the MySQL server more efficiently.
  • We're planning on doing some other tech-stack upgrades: a move to nginx across the remaining servers using apache, upgrading MediaWiki, etc.
  • We'll be working towards better backups for all our servers, possibly using a solution like Attic and S3 synchronization.

And some other stuff I'll keep secret.

Conclusion

Overall, the past few weeks have seen several improvements but a lot of problems, and some horrible mistakes that almost cost us weeks of data if we hadn't been careful. Most of our code, repositories, configurations, and critical data was mirrored at the time - but we still got struck where we were vulnerable, on old hardware we had issues with before.

For that, I take responsibility on behalf of the administration team (as the de-facto lead, it seems) for the outage this past week, which cost us nearly a day of email and main site availability.

The past week hasn't been great, but here's to hoping the next one will be better. Upwards and onwards.

Written by austin on Nov 19 2014, 10:22 PM.
Miracle Mad Max
Projects
None
Subscribers
None