In the early stages of a product, deployment seems like a trivial issue – a detail that has an easy solution. However, as a product grows the deployment steps become more complicated. Rather than just pushing code changes to a single server, now code pushes require coordination across multiple servers and server roles. The further complicate matters, cloud servers are ephemeral (last for a short time), which means that the number of servers and their host names will change frequently. In short, deployment becomes more challenging.
Common Deployment Problems in the Cloud
Cloud deployment becomes more complicated the more the product grows. To achieve a Realscale architecture, the application must be able to scale out to handle traffic bursts. Combined with the need to deploy fixes and new features multiple times per day or week, the deployment process becomes difficult to manage as servers become a moving target. Let’s look at some of the common problems of deploying a cloud native application when we use an outdated deployment strategy:
Scaling out becomes difficult: Unless we have a deployment strategy that supports adding new servers on demand, teams will be unable to scale out without a manual process. This is a common problem for teams transitioning from a fixed server mindset, where all servers are long-lived rather than scaled on demand. If we depend upon manual deployment processes, we will not be able to add additional server capacity to the application without a manual process.
Manual recovery from server outages: Cloud native applications must be resilient to server outages, with automated deployment to the replacement server. If manual deployment steps are required to replace a failed server, then recovery times will go from minutes to hours or days.
Releases only during maintenance windows: If deployments take too long, they will require scheduled downtime and prevent deploying features or bug fixes immediately.
Runtime errors during deployment: All servers must have the exact version of the codebase, to prevent some requests from behaving differently than others. Otherwise, some requests may yield different results than others, causing difficult-to-troubleshoot problems that seem to happen “randomly” and are difficult to debug. This will commonly happen during long deployment processes, or when we have a large number of servers that require a rolling deployment.
Unpredictable deployments: When deployment processes vary across environments, some deployments may succeed in certain environments while unknown issues arise in the UAT or production environment when it may be more expensive or too late to fix.
Deployment failures: If a bug is discovered after a deployment but cannot be rolled back, then the application will be forced to run the buggy version of the application until a fix is applied, rather than being rolled back to a known stable version. In the meantime, the application may be unavailable or operating in a degraded state.
Big-bang integrations: If applications are deployed into production infrequently, the chance of errors after deployment increases dramatically.
Goals for Cloud Deployment
To overcome these problems, new deployment goals are required that fit better in a cloud native environment.
Deployment Goal #1: Minimize Downtime
There are several approaches that can reduce the downtime for an application, including:
- Removing servers from load balancers, performing the deployment steps, then re-adding the servers back to the load balancers (sometimes called “serialized deployments”). This is a common strategy, as it works for nearly every combination of programming language, framework, and server environment
- Deploying updates to multiple servers at a time (sometimes called “parallelized deployments”). Note that this only works if the underlying services support zero downtime deployments, to prevent having limited or no available servers during the deployment window
- Executing multiple, independent deployment steps in parallel
- Swapping auto scale groups running older versions of the application with server groups containing the latest version
- Swapping entire infrastructure stacks, including all necessary infrastructure components, for a fresh stack that is running the latest version of the application (sometimes called “blue-green deployment”, “stack swapping”, or “immutable stacks”)
Not all approaches may be the best fit for the application and the team process. Selecting the appropriate approach to minimize downtime requires careful thought and automation support.
Deployment Goal #2: Rollback on Failure
Once processes have been automated, most deployments will work without error. However, the occasional deployment failure may occur due to a severe bug, failed deployment step, or infrastructure issue. Without the ability to rollback to a stable version of the application, applications may experience prolonged downtime.
If something goes wrong, be able to revert to the previous, stable version easily. The rollback process often requires coordinating a variety of steps, including:
- Quickly reverting to a previous stable version of the application and restarting associated processes
- Updating any DNS entries to reference previous versions of infrastructure resources and services
- Reverting recent database migration scripts/changes to match the previous stable version
Applications that are not designed to rollback to previous versions make this process more difficult. Be sure to build this into the application and perform rollback tests in pre-production environments to ensure confidence in rollback and database migration scripts.
Deployment Goal #3: Script Everything
Any step in the deployment process that isn’t scripted is a step that can introduce human error. As the deployment process is established, build scripts that perform repetitive tasks. Doing so will prevent skipped steps or errors in typing that can sabotage a deploy.
Deployment scripting may be handled using server configuration automation tools or build automation tools such as Jenkins, Codeship, Bamboo, GoCD, and others.
Deployment Goal #4: Version Control Everything
Code should be versioned and tagged upon release to ensure a complete snapshot of the application is available at any time. Additionally, version and tag deployment scripts alongside application releases. This will provide insight into changes over time and allow for application rollback using the proper set of deployment scripts. Script versioning will also serve to capture the change history of the deployment process over time.
Deployment Goal #5: Continuous Integration and Deployment
As the application grows, it is important to know when changes to the code break the application. Automated test coverage is a great way to ensure that the application is functioning as expected and fixed bugs don’t regress. By automating the build and integration testing of the application when code changes are commited to a central branch, teams can know immediately if those changes broke any tests. This technique is known as continuous integration (“CI”). CI builds upon the practice of using automated tests, automated deployment scripts, and version control for application deployment. It has become common place for many software product companies.
Continuous delivery is the practice of automating the complete process of building and deploying a release to a specific environment that may require additional review or acceptance before final deployment. The goal is to deploy early and often to minimize the number of changes between releases, thus avoiding the “big bang” deployment problems of major releases.
Continuous deployment varies from continuous delivery in that the goal is to fully automate the flow from code changes to production deployment through a series of automated processes within each application environment. While the feature may be deployed into production, the feature may be limited in exposure to internal teams, select customers, or all customers through the use of feature toggling.
Deployment Goal #6: Repeatable Deployment Across Environments
Applications commonly have more than one environment:
- Development/Integration – where developers deploy the most recent features for integration and developer testing
- QA/UAT – where internal testing and customer acceptance testing (where applicable) verify quality and expected behavior
- Staging/Pre-Production – mirrors a production environment, including copies of production data when possible to surface any final issues or data migration failures
- Production – the customer environment with production data
As the latest changes to the application moves forward to each environment, different teams qualify the changes to ensure a stable release into production. If the cloud infrastructure, resources, and/or settings vary greatly, bugs may be introduced that are difficult to troubleshoot (or missed completely until a production release). To avoid this possibility, apply versioned scripts to the infrastructure automation scripts as well as deployment scripts. This will prevent differences between environments from interfering with QA testing, particularly in environments closer to production when troubleshooting and fixing the bug may become more expensive.
Deployment Goal #7: Document the Process
While the goal is to automate every step of the process, sometimes things just go wrong. Be sure to document each step of the deployment process, including common manual or one-off tasks required to do things such as rebuilding CDN content, resetting caches, and refreshing system settings.