Gaining Confidence in Deploying Your Systems
Deploying a new version of a service can be a challenge to an organisation.
Firstly, downtime can cause loss in sales, might violate an SLA, damage the image of a company, corrupt customer data and a can have a myriad of other consequences.
Moreover, bad rollouts can have severe impact on team morale. Nobody wants to be the one who “broke production”. Some teams can even develop deployment anxiety. This can lead to less frequent, larger sets of changes being rolled out or an overly strict manual process which makes deployment become extremely time-consuming.
In my opinion, a team should feel confident with their deployment procedure and there are several ways to build this confidence. That’s why I will publish a series of blog posts, describing practices that helped me build confidence when deploying large and complex systems.
Visibility
Knowing when and what is going wrong with the system is the first step towards being confident with deploying it. When I need to deploy a system which I am new to, or have low confidence in its automated test suite, I try to have dashboards of the most crucial metrics open. Usually I will also keep an eye on the system’s error reports. The same applies when I am rolling a risky change.
Metrics
Real-time metrics are crucial in order to know how well an application is performing. As often is the case with metrics, it is essential to know which ones to pay most attention to. Some metrics are business-specific - i.e. number of orders placed is usually important for an ecommerce site. Other metrics are specific to a part of the system. Number of successful user logins is a great indicator of the health of the authentication system. Infrastrutcure-level metrics include CPU, memory and network utilisation. For a web application, HTTP response rates grouped by response code are an easy way to quickly determine whether the application is working as expected.
Metrics that are especially useful to quickly spot a degradation in system performance follow a predictable pattern. For example, your website might usually send 150 HTTP 200 responses per second. Or a specific endpoint might respond with 200 status codes 80 times per second. Except, on 12:15 UTC everyday, because that’s when people eat lunch and use your service less then it’s 50 req/s. It is important to recognize these patterns in your metrics in order to know what “normal” looks like. Then you can recognize abnormal behavior which might indicate that your deployment broke something.
Error Reporting
Recording metrics for every part of an application is rarely achieved in the real world. Depending on your system, it can be even be impractical. Error reporting helps reveal parts of a system that might not be important enough to produce metrics for, but still break functionality. Same goes for parts of the system that are difficult to measure performance of. A good example of this is sending a welcome email when a customer signs up. Measuring whether that email has been delivered can be quite complicated.
On the other hand, reporting an error if there was a network problem when communicating with the SMTP server is quite straightforward. That’s why error reporting is a great tool to spot issues that might not be reflected in your metrics.
To make error reporting really useful, it is important to keep noise there to a minimum. If your application keeps reporting multiple errors every second, a newly introduced error might be difficult to spot.
Conclusion
Ideally, the most crucial business, system and infrastructure metrics are tracked over time and the team can recognise when the metrics indicate normal behavior of the application. Additionally, the application error reporting is not noisy so newly introduced errors can be spotted quickly. This is a great first step to confident deployments.