Today I set a new personal record: Ten full production deployments in a single day. All on a Friday afternoon, and at the end of it all I'm left feeling exhausted.
Why choose “ten deploys per day” as the title of this post? Ten has been something of a magic number to me: After having seen John Allspaw and Paul Hammond's talk on the subject and after having read Gene Kim's excellent Phoenix Project, ten seems like a number at which point deployments stop really being a problem - It becomes the point at which developers feel like they have full autonomy over shipping their code.
First of all some context: Around midday on Thursday, my team received a notification that for legal reasons, certain changes needed to be made across an entire estate of web products by this weekend. Those changes were trivial, but their complexity was expounded by the fact that they needed to be separately applied to ten distinctly different products. Altering the various codebases was relatively easy, and aside from a slight hiccup with a third party vendor we were quickly able to get those changes out of the way.
That's not what I want to talk about. Instead, I want to talk about how to go from having barely achieved ten, to making it a boring norm.
- Change management shouldn't feel like an inconvenience or a chore. Most immediately I felt incredibly frustrated by the number of hoops that filling out change requests entailed: For a high risk deployment, it's useful to know in advance that something is going to happen, however in this case I was releasing trivial changes and so having to fill out ten forms in order to release feels like an unnecessary extra step. Maybe this is something which, for further trivial changes, should be automated with the deployment process.
- Automate all the things! I was incredibly lucky that a few weeks ago I moved the last of these products to Jenkins. While many were running simple Capistrano wrappers, the remainder used a streamlined set of shell scripts designed to build, run tests and then deploy between dev, test and production environments. Moving the last of the Capistrano deployments to Jenkins cut deployment times down from ~15-25 minutes, to ~5-10, which made all the difference. Between these products, I was able to stick to 15 minute deployment windows with minimal hiccups. Essentially this halved the amount of time it would have taken, and some of the more ancient deployment processes of other teams can sometimes take hours. Automating those will make everybody happier, and will save significantly more time. I'll end this with a thought from Camille Fournier - “When you see a process which is more about "How?" than "What?", then ask yourself what can be automated”.
- Automated testing would have helped. Unfortunately for most of these products, the extent of their testing suites were syntax checking tools, meaning that while everything seemed okay I had to manually check each product as it was deployed - Usually by having to cachebust. With a great volume of deployments, this step feels unnecessary. For low volume deployments, the usual spiel applies, wherein functional tests allow for quicker and safer development with manual testing boiling down to a simplistic "rubber stamp". Again, ensuring that these are instituted across the board will make everybody happier, especially when faced with days as challenging as today.
- Always ensure that there is a fallback for when third party resources become unavailable. While this might seem incredibly obvious, in my experience not many people really pay attention to this. A few months ago the NodeJS world was rocked by the author of a very simplistic module unpublishing his module from npmjs.org - This immediately broke a lot of build processes causing a lot of confusion, but the longer term consequences were potentially more severe as anybody could have reclaimed that namespace which that the Snyk team have written about in detail. However, security and practicality concerns aside, what about when the network fails? What about when internal DNS is down for maintenance? Tests, builds and deployments should remain unaffected by this, and having to rely on developer heroics in order to get some code deployed to production is flat-out scary! (Ask me over a drink sometime!)
- Ensure that there's a fallback plan for when external resources fail. Am I repeating myself? Maybe, but this time I'd like to ask - What if your authentication service (or git repository, or CI server) goes down? In my case I needed to make a late change and unfortunately LDAP had gone wrong, with nobody around to fix it, meaning that I effectively had no access except to a single Jenkins box. Suddenly that service has become a single point of failure, and when there are multiple streams of deployments ongoing, that outage can become a huge, stressful blocker. While Github was under attack earlier this year, many were at a standstill as they did not have fallbacks for their SCM and deployments - I'd wager that a great deal many quickly were able to set-up alternatives on BitBucket thanks to the distributed nature of git - but even then, it should not have happened. N+1 resiliency is nothing new, but it seems to have been forgotten in the age of SaaS and SOA.
- I couldn't have done it alone. From doing the original work and getting it all signed off, to preparing everything for release, without the team I simply could not have done any of this alone. Ten deployments alone might have seemed a feat, but having the confidence and trust that what I was releasing was up to my own high standards saved enough time and reassured me enough to let this happen.
- Be prepared to delay a deployment in case something is wrong. This one is definitely my fault: I take automated deployments for granted, and if something goes wrong with the automation (likely for one of the reasons above) I will immediately attempt to fix it instead of moving onto the next deployment when its window rolls around. Given that it's all automated, it's probably not too late to back out!
- What's so special about Fridays anyway? Oh no he didn't! I am going to continue adhering as strictly as possible to Read-Only-Fridays, but I feel like the security and processes in-place today are close to the point at which we won't need to worry about what day of the week it is, or indeed what time it is. If it's not okay to break a system on a Friday as it'll require you to stay late, why is it okay to do the same on the Monday or Wednesday that week? The day of the week seems like a scapegoat for dysfunctional developmental processes.
It wasn't all sunshine and roses, but ultimately I'm pretty proud of having accomplished this. My first full-scale deployment was a whole-day ordeal which involved several people, a snowflake environment and a week of clean-up for the whole team. To have gone from that to having achieved the magical number of ten in a single afternoon feels like we've come a very long way.
For next steps, I'm hoping to tackle some of the points I've outlined above and to take the lessons we've learned back to other teams, as well as maybe even pushing forward all the more. I'm particularly excited by the prospects of ChatOps and canary deployments using containers, so watch this space - Onwards and upwards!