Last week we pulled off two major changes to the infrastructure which runs Lanyrd, and both with no downtime - but what exactly did we do, and why?
The first major change was perhaps the tougher of the two - we changed the main database that Lanyrd runs off of from MySQL to PostgreSQL.
For those unfamiliar with the intricacies of databases, this is tough as it means we need to convert the entire site's data format in one go - and if it goes wrong, we have to roll back or, if things aren't monitored closely, there's a small risk of losing some data.
The reasoning for this move was mostly for database features - MySQL lacks a full transaction model, fast column addition, and it's quite bad at using multiple CPU cores, whereas PostgreSQL has all of these - allowing us to make changes to the site in the future with no downtime or read-only mode at all.
The second major change was us moving from Amazon Web Services - EC2 and RDS in particular - to running on dedicated hardware, which we rent through Softlayer.
There's nothing wrong with AWS - indeed, we still run a staging environment there - but our database benefits greatly from the low latencies of physical disks, and there aren't very many hosted PostgreSQL services on EC2 that fit our needs.
As part of the move, we also rearranged our services so that we have no single point of failure - everything is either running on multiple servers (like our Django code) or has a warm standby (the databases and load-balancer).
Both changes required us to stop saving new data to Lanyrd during the move, and so we opted to do them at the same time to minimise the amount of time we spent in read-only mode. There's some risk involved here, of course - doing two major changes at once requires more careful planning and rehearsal - but we want to minimise the time we spend in read-only mode (in fact, this is only the second time since Lanyrd's launch).
During the week before the move, I scripted the entire move process as much as possible, giving us one command that would sync all of our main database, our Redis data and our search data, and did several dry runs onto a test environment, with Tom and Simon helping out with checking and ideas over the week.
We caught a few bugs, mostly with the database conversion. The conversion was performed by a dump converter I'd written myself, and we had a few problems with escaping and missing indexes, but those were both spotted by the eagle-eyed Lanyrd team during the testing phase.
We'd analysed our traffic and picked Tuesday morning as the time that would impact the least number of people. One of the advantages of being a UK startup is that the time difference means that the US and Canada are asleep during the morning, giving a nice low-traffic area that's still in working hours.
With that set, the Monday was a final dry-run, a quick load-test of the new site using a traffic-replaying system we have, and then the move took place on the Tuesday, at 10am.
Apart from one minor hiccup with getting read-only mode turned on, the move went quite smoothly, and we were back up and out of read-only mode before midday. Lanyrd stayed available throughout the move, and read-only mode did its job admirably.
With only more more minor problem during last week - which we were able to deal with swiftly - things do seem to have gone rather well, and we're eager to start putting our more powerful servers and new database features to good use!
(If you're interested in more precise technical details on how the move went, there's a more in-depth article on Andrew's blog)