Getting critical systems production-ready: an overview

Published in

Talkdesk Engineering

8 min readDec 17, 2019

Distributed systems are hard to design, implement, and maintain. Often, systems of this kind are at the heart of your business, and as such, keeping them running is critical for the business to be kept afloat. My team at Talkdesk developed and maintained one such system in the last year and things got quite bumpy sometimes. This is a collection of the best practices we have learned from running it in production.

While our experience is with distributed systems with soft-real time requirements, we believe many of these lessons are valid for other kinds of systems.

The system

For context, the kind of system we are talking about is an actor-model system, deployed on a multi-node environment. It exposes a REST API for the creation of jobs, and consumes updates via RabbitMQ messages. Jobs’ results are announced via messages published to RabbitMQ. These results can occur as a result of a process triggered by a consumed message, or by a job timing out. Job timeouts are controlled by long-lived processes. A database is used to store and retrieve operational data. The system is implemented in Elixir.

The deployment is done using Kubernetes on Google Cloud Platform’s Kubernetes Engine. The database is a PostgreSQL instance. The actors are distributed by the Kubernetes pods, which makes a multi-node deployment look like this:

Before going to production

Your main concern before going to production should be visibility. The same way you would not drive a car with a blindfold on, you don’t want to have your system in production without knowing how it is behaving. Ideally, you want to know about things going south before your users start complaining so that in the worst-case scenario you or your on-call team are at least ready to take cautionary measures if the customer experience is seriously impacted.

Three things are required for this:

Metrics: both real-time and historical, so you can monitor your system and build timelines when reporting the impact of an issue
Alarms: will let you or your on-call team know when an undesired issue is happening
Logging: enables you to know where in the code the issue occurred, and the cause of the misbehavior

Metrics

There are three kinds of metrics you want to keep track of: resource, boundaries, and business metrics.

Resource

Resource metrics show the health of your infrastructure. They map resources the system uses, like CPU, memory, and number of instances running. Keeping track of these metrics is useful to understand if a bug impacted performance, or if a hardware failure could have been the cause of an issue.

In our system’s scenario, we keep track of each BEAM’s* resources. These are pretty straightforward to obtain, as the VM’s stats are available via Erlang functions that can simply be called by a monitoring system like Prometheus.

Besides typical infrastructure metrics, there are also others that might be of interest to your specific use case. In our distributed system, for instance, a metric we keep track of is how many nodes each node can reach at any given time. This was useful for us at a time where we had a buggy behavior because of a situation where some machines’ network disabled them from seeing the whole topology, and two (or more) clusters of nodes were formed (typically called a netsplit). With this metric, we were able to pinpoint when the netsplit occurred and what nodes were impacted.

*The Erlang VM where elixir runs. Each Kubernetes pod runs an instance of the BEAM.

Boundaries

Boundary metrics indicate how the system is interacting with its surrounding ecosystem. These are your typical “number of HTTP errors”, “API response times” and the like. The information you get from these metrics gives you a picture of how the system is/was performing as a piece of the overall architecture.

In our case, the boundaries are API, RabbitMQ and DB-related, the contact points between systems.

Business

Business metrics are measurements of your system's behaviors that help fulfill a business need. These metrics help you understand how well the business is being handled, and in the event of any issues, how it was impacted by a customer’s point of view.

In our case, for instance, we kept track of jobs created, as we managed them in memory (the actors in the diagrams), and they mapped directly to the real-time needs of the customers. We wanted to keep track of how long the jobs lived, the availability of the resources they used, and on which nodes they were being handled. This helped us build timelines regarding the customer impact of the behavior of the system:

how the job handling time is affected
whether we have more/fewer jobs than at the same time on other days
whether certain groups of jobs are somehow not being prioritized, or not handled at all
SLOs and SLAs verification

Using the metrics

In my experience, boundaries are what usually raises your attention, business metrics help you see how your system is solving the problem it is meant to solve, and resource metrics will be checked on occasion to solve bugs.

Obviously, you might not know everything you need to measure before your first deployment. An approach for having some visibility from the start is: measure your boundaries and resources well, as those are usually straightforward, and measure the most basic business things you might need to know about if there is an incident. Then tweak what you measure as you run into production issues and understand better what you need to have visibility of. Reports of production incidents (called “post-mortems”) will help you understand this, as explained in a further section.

Alarms

Like I said earlier, you want to be aware of issues before clients start complaining. Alarms help you achieve this so that you don’t have to constantly be looking at your metrics and logs.

You will typically fire alarms from the code (eg: an unexpected error popped up that breaks a customer’s flow) and when metrics go over/under certain thresholds (eg: a RabbitMQ queue has 300 unprocessed messages)

The main thing about alarms is being conservative. This might not make sense at first glance, because the more alarms you have, the more aware you are of things, right? Wrong. The more alarms you have, the more unphased you will be by them. If everything is an alarm, you or your on-call team will be constantly being distracted up to the point where you simply start ignoring them. Then, when the trouble really hits the fan, awareness might take longer to set in.

You want to have a few alarms as possible, but for as many critical things as possible. A treatable error when committing a query to the DB happened? Not an alarm, retries are in place. HTTP error count just had a spike in the last 5 minutes? Totally an alarm, an issue is probably happening.

Logging

Logs should already be a part of your development process. If they are not, make them so. No pull request should be approved without appropriate logging.

When it comes to helping you with incidents, error level logs are the ones you will be looking at. Lower levels should not be enabled in production, as they create issues like low disk space, besides making tracing an issue a lot harder, because of too much noise.

The important thing to retain about logs is: make them meaningful and attach relevant IDs and error structs. Instead of:

Logger.error(“Error consuming message”)

you can write something like:

Logger.error(“Error consuming message for account #{account_id} for user #{user_id}. Message: #{inspect(message)}”)

Adding the account and user IDs greatly helps when searching for logs pertaining to those IDs. Be careful not to include any sensitive user information the system might handle, like names or emails, due to privacy concerns and regulations like GDPR.

An even better approach is using structured logging. This means that you decouple the message from the relevant metadata, enabling you to later query logs by metadata fields. In the previous example, this would look something like this:

Logger.error(“Error consuming message. Message: #{inspect(message)}”, [account_id: account_id, user_id: user_id])

Your logging system would then be able to filter messages by account_id and/or user_id and find this message.

Deploying to production

Deploying to production should follow a process. The main thing to retain here is that your process should be known by the whole team, and be used as a checklist of sorts. Do you need to update ENV vars? Scale the application beforehand? Warn your QA team? Document the process, make sure the whole team is aware of it and follow it strictly when deploying to production.

Two other things to bear in mind:

Keep a log of production changes. Note what commit you deployed, at what time, and what accounts/scope it affected. This will help others understand when what was deployed
If your deployment can cause downtime, schedule it to a low-traffic time and make sure to create awareness with other affected parties (both customers and internal teams)

Maintaining the system in production

If you follow the advice from the previous sections, your interaction with the production system will be mostly reactive, as in, if an incident happens, you will make use of your metrics, alarms, and logs to understand what happened and how to prevent it from happening in the future.

Post-mortems

Post-mortems are reports you write after an incident occurs. The goal is documenting how the system misbehaved, what the root cause was, and the stabilization/resolution measures that were taken. They work as a learning and improvement tool, not as a way to assign blame. At Talkdesk, this is the structure we use for our post-mortems:

a summary of the incident
who handled the incident
a timeline (depicting when: alerts were triggered; relevant metrics started showing signs; customer complaints happened; stabilization/resolution measures were taken; the system went back to regular behavior)
the customer impact, both feature and performance-wise
root cause analysis
stabilization measures detail

By far, the most important section is the root cause analysis, as it gives you insight into the system and how to improve it.

Post-mortems are great learning tools, and the lessons learned from them should be used to update your playbooks.

Playbooks

Your company probably has an on-call team to deal with incidents if they happen out of office hours, which may or may not be the people that implemented and maintain the system. Regardless of who is responding to incidents, documentation about how to operate the system is essential. This documentation is a system’s playbook.

Playbooks should contain both general information (like how to restart a node or scale to more nodes), and how to handle specific issues that might happen due to unfortunate circumstances (like the underlying infrastructure becoming degraded).

It is particularly useful to have instructions about each alarm so that an on-call person knows what it means, and what measures should be taken.

Summary

I hope this post has been insightful to you if you are thinking of starting a new distributed system in your company or maintaining an existing one.

The main takeaways are:

Do not go to production without metrics, alarms, and logging
Have a deployment process/checklist and follow it every time
Write post-mortems and keep an updated playbook of your system