Lecture: Eight rollouts a day keeping downtime away

The philosophy of shared space in a development environment


Booking.com violates best practice in every conceivable way, taking "agile" to the next level. The talk describes why and how we are doing it, and how we are getting away with it.

Booking.com is an online travel agency which uses a very agile approach to development and operations. Company culture, internal process and technology are mixed together in a way that enables and requires about eight code rollouts a day, putting a sequence of very small changes life in very short cycle times.

All code changes are encapsulated in a framework for conditional code with instrumentation, which allows us to expose changes gradually to a growing subset of our users, and which is tracking the behavioral and system impact of each change.

Using measurements and metrics we can show quantitatively how certain features affect our performance. Using our framework for rollouts and rollbacks we are making sure measurements for changes are coming in quickly and accurately. This gives developers and business owners feedback on the real world impact of any change as quickly as possible, and often the results are counterintuitive or otherwise surprising.

Ideas from business owners, managers and other HIPPOs ("HIgest Paid Persons Opinion") fare no better than anyone elses idea, and so implementation and testing of any change as part of the experiments framework has become mandatory, calling for hard data and sound statistics as the basis for any change. That in turn makes it more important to implement an idea quickly to make it testable, than to implement it efficiently so that it can scale - code quality is totally secondary to implementation speed.

For successful experiments that makes a scaleable clean room reimplementation necessary, ideally through a second team that should not consult the poisonous original code but only the specs.

Because rollouts happen so often and so quickly, individual changes are small, and errors are usually spotted very quickly in the diff between the last known good code and the broken rollout. Often it is easier to disable the experiment with broken code, fix it and roll out a repaired version than to rollback. On the other hand we are now dependent on a quick rollout process to keep diffs small, because our debugging methods are optimized for this way of working: If we ever hold back rollouts for a few days, the first rollout after such a break is critical and will be very hard to debug.

Working in such an environment requires a certain culture, a specific sort of supporting technology and quite some getting used to - all of these are also part of the mix. The resulting culture violates a lot of rules from current "best practice" manuals, but works beautifully and effciently.

The talk gives a more complete overview of the philiosophy behind this, the actual process and the why and how.


Day: 2011-08-21
Start time: 11:15
Duration: 01:00
Track: Devops


Click here to let us know how you liked this event.