Paving the Road to Production at Coinbase: QCon Plus Q&A

As Coinbase scaled both their number of engineers and customers, they needed more projects, faster iteration, and more control over their growing infrastructure. In developing their in-house deployment tool by looking at what developers were doing and trying to help them, they created a culture of self-service.

Graham Jenson, infrastructure tech lead at Coinbase, spoke about how they paved the road to production at QCon Plus 2020.

On Heroku (used at the time), changing configurations, setting up new resources, or creating new projects required access to the admin panel, as well as knowledge of how to use it. This was a blocker to creating and deploying projects, Jenson mentioned, so it was slowing the company down. “We wanted our developers to be productive, we wanted them to not be blocked,” he said.

To scale deployment they wrote down all the steps that developers needed to take to get a new project deployed. Then they focused on removing blockers and simplifying steps:

The biggest blocker was Heroku, but what steps could replace Heroku? What would that look like, and how did we want developers to interact with it? We wanted to give developers more control over configuration, and building tools to help deploy safely and securely, but what specific steps did we want/need them to take to go from commit to release? Then we had to build those tools, systems and processes.

Coinbase wanted to create a culture of self-service where developers could manage their own configuration, resources and deployments. Jenson mentioned that they wanted deploys to have no ceremony, to be very safe, and encourage developers to deploy often. They wanted to invest in usability so that the information that was needed would be easily accessible.

Jenson explained why Coinbase decided to develop their tool for deployment, called Codeflow:

With Codeflow our developers could commit their code, have it automatically built and be ready to deploy within minutes. We built it inhouse because of our unique security requirements, but it was designed to leverage a lot of familiar technologies like Docker to make it easy for developers to use.

The most important task is to look at what developers are doing and try to help them, Jenson argued. A lot of projects just shift complexity around and don’t actually remove anything, or worse, they try to fix the wrong issue altogether. Jenson suggested that one clear way to find the right problems to fix is to ask a developer the steps they go through to get something deployed, then start simplifying and fixing from there.

InfoQ interviewed Graham Jenson about building and enhancing paved roads for deployment.

InfoQ: How would you describe the concept of paved roads to production?

Graham Jenson: A “road” is all the steps a developer has to take to get from committing their code to getting it deployed in production. Smoothing out this road, “paving” it, is done by identifying and removing steps that are slowing or blocking developers from seeing their work realized. The reward for this work is that the pace of iteration is increased and, somewhat counterintuitively, the risk to deploy code is reduced.

I like the metaphor of “paved roads” when talking about release infrastructure because it can help convey that although the cost is high to build and maintain them, roads are shared by many so improving them has massive value.

InfoQ: How did the deployment road look when Coinbase started?

Jenson: Deployment in early Coinbase used Heroku, a hosted Platform as a Service (PaaS). This handled most of the complexity of deployment as it was as simple as using a “git push”. You would build an application to match with their specifications of how an “Heroku” application should look, typically following “The Twelve-Factor App” guidelines. Then you could use their admin panel to configure the application, such as datastores, environment variables, and security. You could then deploy with a “git push” to a remote branch.

This was all fine with only a small number of developers, but as we grew it caused issues with deployment.

InfoQ: What challenges did you face when the number of projects at Coinbase grew? How did you deal with them?

Jenson: Coinbase has been growing constantly during the five years I have been working there. With both the number of projects and the number of engineers ever increasing, our deploy pipelines are always being stretched somewhere. Our constant playbook is to look at what developers are doing and try to automate tasks or simplify them.

The most recent challenge has been that scaling the number of projects across multiple repositories was getting cumbersome. Updating thousands of projects was taking a lot of effort even for minor upgrades. This led us to create a monorepo where we could have all our projects in a single repository so large that changes could happen atomically in a single commit. This has massively reduced the number of steps developers need to take if they update some fundamental component across the company.

InfoQ: You developed your own tool for deployment, called Codeflow. What does codeflow do, and how did you migrate from Heroku to Codeflow?

Jenson: Codeflow was a tool we built to replace Heroku. It is the self-service tool for developers to configure and deploy their projects to the cloud. The goal was to have all projects in Codeflow so that we would have a single paved road to invest in so improvements would scale with the company.

Migrating from Heroku to Codeflow was scary. It was a big shift not only in the ways in which we used to do things, but also in migrating our main application from Heroku to AWS. So there were layers of risk there. To mitigate the risks, we made a lot of lists of what to do before, during and after, who was responsible for action items, and what to do at any step if something went wrong. Sharing all these lists got us input from the stakeholders to help us not miss anything.

InfoQ: What did Coinbase do to create a culture of self-service?

Jenson: Culture isn’t created with software, but it can be hindered by it. So the first thing was to plan on removing any blockers that were currently stopping that culture from spreading. That means first removing Heroku (which had become a significant blocker at that point) from our stack and moving onto something that we could control. Once Heroku we could migrate to a new suite of tools that would encourage self-service, deploying often, scale deployments… Then the hard work would begin to gain developers’ trust that these tools could work and could be relied on.