A first glimpse of production constraints for developers

Bertrand Florat - May 24, 2021

(This article has also been published on DZone)

Sahara and sandbox

In most organizations (except truly DevOps teams), developers are not allowed to access the production environment for stability, security, or regulatory reasons. A major drawback of this approach is the mental distance created between developers and the real world. Likewise, monitoring is usually managed only by operators, and developers receive little feedback except when they must fix application bugs (ASAP, of course). As a result, most developers have very little idea of what a real production environment looks like and, more importantly, of the non-functional requirements needed to write production-proof code.

Involving developers in resolving production issues is beneficial for two main reasons:

It is highly motivating to see tangible evidence of a real running system on a large infrastructure (data centers, clusters, SAN, etc.) and gain insights about the performance or business metrics of their applications (number of transactions per second, number of concurrent users, etc.). It’s also common for developers to feel unaccountable as they are rarely directly contacted when an outage occurs.
It can significantly improve the quality of delivered code by guiding design considerations for operational aspects like logs, monitoring, performance, and integration.

What Do Developers Often Misunderstand about Production?

Concurrency Is Omnipresent

Production is highly concurrent, whereas development is mostly single-threaded. Concurrency can occur among threads of the same process (e.g., on an application server) or across different processes running locally or on other machines (e.g., among n instances of an application server running across different nodes). This concurrency can cause various issues like starvation (slowdowns when concurrently waiting for a shared resource), deadlocks, or scope issues (data overriding).

What can I do?

Perform minimal stress tests on the DEV environment or even on your own machine using injectors like JMeter or Gatling. When using frameworks like Spring, make sure to understand and correctly apply scoping best practices (for instance, don't use a Singleton with a state).
Simulate concurrency using breakpoints or sleeps in your code and check the context of each stalled thread.

Volume Is Huge: You Must Add Limits

In production, everything is XXXL (number of log lines written, number of RPC calls, number of database queries...). This has major implications on performance but also on operability. For instance, writing an Entering function x/Leaving function x type of log could help in development but can flood the Ops team with GiB of logs in production.

Keep in mind this metaphor: If your DEV environment is a sandbox, production is the Sahara

In production, real users or external systems will massively stress your application. If (for instance) you don't set a maximum size for attachment files, you will soon encounter network and storage issues (as well as CPU and memory as collateral damage). Many limits can be set at the infrastructure level (like circuit breakers in API Gateways), but most of them have to be coded into the application itself.

What can I do?

Make sure nothing is 'open bar': always paginate results from databases (using OFFSET and LIMIT for instance or using the seek method), restrict input data sizes, set timeouts on any remote call, ...
Think carefully about logs. Perform operability acceptance tests with real operators and Site Reliability Engineers (SRE).

Production Is Distributed and Redundant

While in DEV, most components (like an application server and a database) run inside the same node, they are usually distributed (i.e., some network link exists between them) in production. The network is very slow in comparison with local memory (at scale, if a local CPU instruction takes one second, a LAN network call takes a full year).

In DEV, the instantiation factor is 1: every component is instantiated only once. In any production environment dealing with serious high availability, performance, or fail-over requirements, every component is redundant. There are not only servers but clusters.

What can I do?

Don't hardcode URLs or make assumptions about the colocalization of components (I've seen code where the localhost hostname was hardcoded).
If possible, reduce dev/prod parity by using a locally distributed system on your workstation, like a local Kubernetes cluster (see K3S, for instance).
Even if this kind of issue should be detected in an integration testing environment, try to keep in mind that your code will eventually run concurrently on several threads and even nodes. This has implications for tuning the number of connections in datasources, among other considerations.
Always favor stateless architectures.

Anything Can Happen in Production

One of the most common phrases I’ve heard from developers dealing with a production issue is, "This is impossible, this can't happen." But it does happen. Due to the very nature of production (high concurrency, unexpected user behaviors, attacks, hardware failures...), very strange things can and will happen.

Even after thorough postmortem studies, the root cause of a significant proportion of production issues will never be diagnosed or solved (from my own experience, about 10% of cases). Some abeyant defects occur only with a rare combination of exceptional events. Some bugs happen once in 10 years or, by chance (or misfortune?), never occur during the entire application lifetime.

A small story: I recently encountered a bug in a node.js job that occurred about once every 10,000 runs (when a randomly generated password contained an unescaped double dollar character sequence).

Check out any production log, and you’ll probably see erratic errors here or there (this is rather scary, trust me).

Preventing expected issues is good practice, but truly good code should control and handle the unexpected correctly.

Hardware or network failures are very common. For instance, network micro-cuts can occur (see the 8 Fallacies of Distributed Computing): servers can crash, and the filesystem can become full.

Don't trust data coming from other modules, even your own. For example, an integration error could make a module call a deprecated version of your API. You might also get corrupted files with wrong encoding or incorrect dates. We recently received a payload from a partner containing a date of February 30th...

Don't even trust your own database (add as many constraints as possible, like NOT NULL, CHECK, ...): corrupted data can appear due to bugs in previous module versions, migrations, administration script issues, stalled transactions, integration errors on encoding or timezones... Let any application run for several years and perform some data sanity checks against your own database—you may be surprised.

Users and external batch systems should be treated as monkeys (with all due respect).

Don’t rely on human processes, but assume they can do anything. For example, two common PEBCAK problems I observed recently on front-end components:

Double submit (some users double-click instead of single-clicking). Some REST RPC calls are hence done twice, and concurrency oddities occur in the backend.
Private browsing: for various reasons, users switch to this mode, and strange things happen (like local data loss or browser extensions being disabled).

Most of the time, users won’t admit to or even realize these errors. They might use an unsupported browser, use a personal machine instead of a professional one, open the web app in multiple tabs, and do many other things you wouldn’t expect.

What can I do?

Make your code as robust as possible, write anti-corruption layers, and normalize strings. When parsing data, control time formats, encoding, formats (if using hexagonal architecture, perform these controls as soon as possible in the 'in' adapters).
Add as many constraint checks in your database as possible. Don’t rely solely on the domain layer code.
When possible, instead of writing your own controls, rely on a shared contract (like a JSON or XSD Schema) and ensure you actually validate every inbound but also your own and outbound streams.
Think about retries, robust error handling, double submission, replays from save points in batch jobs, etc.
When writing tests, consider as many borderline or seemingly impossible cases as possible.
Use chaos-engineering tools (like Simian Army) that randomly generate errors to test your code's resiliency.
Plan for rejected data handling.
To address human errors, identify problematic users, book a meeting, and observe them using your application before asking any direct questions to avoid guiding them.
Build a realistic testing dataset and maintain it. Add new data as soon as you're aware of a previously unconsidered special case. Manage these datasets like your code (versioning, cleanup, refactoring, documentation...).
Don't ignore weak signals. When something unusual happens in development, it will probably happen in production as well, and it will be far worse then.
When fixing an issue, make sure to identify all the places it can occur; don't just fix it in the localized instance.
Add clever logs to your code. A clever log includes:
- A canonical identifier in the message (an error code like ERR123 or an event ID like NEW_CLIENT). This greatly eases monitoring by enabling regex matching.
- All required debugging context (like a person UUID, timestamp, etc.).
- The appropriate verbosity level.
- Stack traces for errors, so developers can easily locate the problem in the code.

Issues Never Walk Alone

In production, things never get better on their own: hope is not a strategy. Due to Murphy’s law, anything that can fail will fail.

Worse: issues often occur simultaneously. An initial incident can induce another, even if they seem unrelated at first glance (for instance, an Out Of Memory error can create pressure on the JVM Garbage Collector, which in turn increases CPU usage, leading to queued work latency and ultimately generating client timeouts).

Sometimes, it’s even worse: truly unrelated issues may occur simultaneously by sheer misfortune, making diagnosis much more challenging and potentially leading down the wrong path during a post-mortem.

What can I do?

Don’t leave issues in logs in production or DEV unresolved. Most issues can be detected in development or acceptance environments. Often, we observe problems and ignore them, thinking they’re transient due to some integration issue or intermittent network glitch. This type of issue should instead be taken as an opportunity to reveal a real underlying problem and should not be ignored.
When you observe something unusual, stop immediately and take a few minutes to analyze the issue or add new test cases. Consider that you may have discovered an abeyant defect that could take days to diagnose and resolve later in production.

In Production, Everything Is Complicated and Time-consuming

For some good, but also not so good, reasons, every change should be controlled and traced in a regulated IS. A single SQL statement must be tested in several testing or pre-production environments and then applied by a DBA.

Any simple Unix command has to be documented in a procedure and executed by the Ops team, who are the only ones with access to the servers. Most of these operations must be planned, documented in depth, motivated, and traced in one or more ticket systems. Changing a simple file or a single row in a database can hardly take less than half a day when counting all involved personnel.

The costs increase exponentially as we approach production. See [Capers Jones, 1996] or [Marco M. Morana, 2006]: a bug can cost as little as $25 to fix in DEV but as much as $16,000 in a live production environment.

Even though modern software engineering promotes CD (Continuous Deployment) and the use of IaC (Infrastructure As Code) tools like Kubernetes, Terraform, or Ansible, deploying in production is still a significant event in most organizations, and many DevOps concepts remain theoretical. Deploying a release can’t be done daily but often occurs about once a week or even a month. Each release usually must be validated by the product owner's acceptance tests (which involve many manual and repetitive operations). Any blocking issue requires a hotfix, involving significant administrative and build work.

What can I do?

Perform as much unit, integration, and system testing as possible before reaching the production environment.
Add hot-reloading configuration capabilities to your modules (such as changing log verbosity via a simple REST call to an administrative endpoint).
Make sure that all processes and operations (ticket system, contacts, alert methods, etc.) are documented and quickly accessible. If not, document them to save significant time in future incidents.

In Production, an Error is an Error

Correctly monitored systems trigger alerts for each significant error, and someone is generally responsible for analyzing it.

If too many false positives occur, supervisors will lower their attention level and may increasingly overlook actual problems.

They may also add filters or exceptions, which take time to write and test and can sometimes be buggy, leading to missed genuine alerts.

When dealing with hypervision (a single centralized and consolidated view of large systems used in control rooms), things are mostly binary: it either works or it doesn’t. False positives can quickly disrupt this view and defeat its very purpose.

What can I do?

Ensure that your logs don’t generate false positives. For instance, if you’re using lazy authentication to re-authenticate to an API when a token has expired, don’t log an error when the token expires—only if re-authentication fails.
Make sure your logs are well-structured and categorized by package, module, etc., so that false positives can be filtered out if necessary.

Production Is Very Stressful

When an incident occurs in production, your stress level may vary depending on the industry you’re working in. But even if you work for a medium-sized e-commerce company rather than a nuclear facility or hospital, any problem generates a lot of pressure from customers, management, and other teams depending on you. Ops teams are accustomed to this and most are impressively calm when dealing with such events—it’s part of their job, after all. When the issue originates from your code, you may have to work alongside them and shoulder part of the pressure.

What can I do?

Make sure to be prepared before an incident by writing or familiarizing yourself with procedures (for instance, see the excellent SRE Book by Google, Chapter 14).
Be confident in your logs and monitoring metrics to help you find the root cause (for example, prepare insightful dashboards and centralized log queries in advance).
For any complex issue, begin the investigation by creating a post-mortem document to centralize notes, stack traces, logs, or graphs that illustrate your hypothesis.

In Production, You Don't Have a Single Version to Manage

In practical scenarios, it's typically not feasible to compel all your internal or external clients to simultaneously upgrade to your latest API or data model. You must handle complex paths that require supporting multiple versions concurrently. For example, in the extensive French Tax Information System, the most crucial central API (such as the Person API) provides three versions of each endpoint. Approximately every year, a new version is introduced, the second is deprecated and becomes the third, and the third is decommissioned. All three versions must coexist with a shared data model.

What can I do? Always include a version in your API URLs (for instance, /v1/foo/bar). Incorporate model versions into your data model. For example, in NoSQL, add a modelVersion attribute that allows your code or ETL tools to determine the version of each data item individually. Consider managing data of varying versions; for instance, write conditional code based on the data model version.

In Production, You Usually Don't Start from Scratch

In development, when your database structure (DDL) evolves, you simply drop and recreate it. In production, in most cases, data is already present, and you have to perform migrations or adaptations (using ETL or other tools). Similarly, if clients are already using your API, you can't change the signature without careful consideration; instead, you must think about backward compatibility. If you need to, you can deprecate certain code, but then you must plan for end-of-service.

What can I do?

In development, don't simply drop DDL; use incremental change tools like Liquibase to 'code' changes. These same tools should be used in production.
Check that your libraries or API remain backward compatible through integration tests.
Use Semantic Versioning conventions to signal breaking changes.

Security Is More Significant in Production

In any seriously protected production environment, many security systems are set up. These are often absent from other environments due to their added complexity and costs. For instance, you may find additional level 3 and 4 firewalls, WAFs (Web Application Firewalls operating at level 7 against HTTP(S) calls), API Gateways, IAM systems (SAML, OIDC, etc.), and HTTP(S) proxies or reverse proxies. Internet calls are usually forbidden from servers, which can only use replicated and cached data (such as local package repositories). Consequently, many security infrastructure differences can mask issues that will only be discovered in pre-production or even production.

What can I do?

Don't use the same values for different credential parameters. This can mask some integration issues in production, where parameters are more likely to differ, and unique passwords are typically used for each resource.
Make sure to understand the limitations of the security infrastructure before coding related user stories.
Test security infrastructure using containers.

Conclusion

It's beneficial for developers to be curious and learn about production on their own by reading blogs, books, or simply asking colleagues. As a developer, do you know how many cores a medium-range server has (per socket)? How much RAM is on each blade? Have you considered where the data centers running your code are located? How much power consumption your modules use in kWh each day? How data is stored in SAN? Are you familiar with fail-over systems like load balancers, RAID, standby databases, virtual infrastructure management, and SAN replications? You don't have to be an expert, but knowing the basics is important and rewarding.

I hope this provided developers with a first glimpse of production constraints. Production is a world where everything is amplified: the severity of issues, costs, and time required to fix systems. Always keep in mind that your code will eventually run in production, and simply functioning is not enough: your code must be production-proof to keep the organization’s IS running smoothly. Then, everything will be fine, and everyone will be home early instead of pulling out their hair late into the night.