A first glimpse of production constraints for developers
(This article has been also published at DZone)
In most organizations, developers are not allowed to access the production environment for stability, security, or regulatory reasons. This is a quite good practice (enforced by many frameworks like COBIT or ITIL) to restrict access to production but a major drawback is a mental distance created between developers and the real world. Likewise, the monitoring is usually only managed by operators and very little feedback is provided to developers except when they have to fix application bugs (ASAP, of course). As a matter of fact, most developers have very little idea of what a real production environment looks like and, more important, of the non-functional requirements allowing to write production-proof code.
Involving developers into resolving production issues is a good thing for two main reasons:
- It is highly motivating to get tangible evidences of a real running system on a large infrastructure (data centers, clusters, SAN...) and get insights about performances or business facts about their applications (number of transactions per seconds, number of concurrent users, and so on). It is also very common for developers to feel overwhelmed as they are rarely called directly when an outage occurs.
- It may improve substantially the quality of the code delivered by helping to design properly operation aspects like logs, monitoring, performances, and integration.
So, What Do the Developers Misunderstand the Most Often about the Production?
Concurrency Is Omnipresent
Production is highly concurrent while development is mostly single-threaded. Concurrency can happen among threads of the process (of an application server for instance) but also among different processes running locally or on others machines (e.g., among n instances of an application server running across different nodes). This concurrency can generate various issues like starvation (slow-downs when waiting concurrently for a shared resource), dead-locks or scope issues (data overriding).
What can I do?
- Perform minimal stress tests on DEV environment or even on your own machine using injectors like JMeter or Gatling... When using frameworks like Spring, make sure to understand and correctly apply scoping best practices (for instance don't use a Singleton with a state).
- Simulate concurrency using break-points or sleeps in your code and check the context of each staled thread.
Volume Is Huge: You Must Add Limits
In production, everything is XXXL (number of logs line written, number of RPC calls, number of database queries...). This has major implications on performances but also on operability. For instance, writing a
Entering function x/
Leaving function x type of log could help in development but can flood Ops team with GiB of logs in production. Likewise, when dealing with monitoring, make sure to make alerts useable. If your application generates tens of alerts every day, nobody will notice them anymore in a few days' time frame.
Keep in mind this metaphor: If your DEV environment is a sandbox, production is the Sahara
In production, real users or external systems will massively stress your application. If (for instance) you don't set attachment files maximum size, you will get soon network and storage issues (as well as CPU and memory as a collateral damage). Many limits can be set at the infrastructure level (like circuit breakers in API Gateways) but most of them have to be coded into the application itself.
What can I do?
Make sure nothing is 'open bar': always paginate results from databases (using
LIMITfor instance or using the seek method), restrict input data sizes, set timeouts on any remote call, ...
Think carefully about logs. Perform operability acceptance tests with real operators and Site Reliability Engineers (SRE).
Production Is Distributed and Redundant
While in DEV, most components (like an application server and a database) run inside the same node, they are usually distributed (i.e., some network link exists between them) in production. The network is very slow in comparison with local memory (at scale, if a local CPU instruction takes one second, a LAN network call takes a full year).
In DEV, the instantiation factor is 1: every component is instantiated only once. In any production environment having to deal with serious high availability, performance or fail-over requirements, every component is redundant. There are not only servers but clusters.
What can I do ?
- Don't hardcore URL or make assumptions about the colocalization of components (I already saw code where
localhosthostname was hardcoded)
- If possible, reduce dev/prod parity by using from your own workstation a locally distributed system like a local Kubernetes cluster (see K3S for instance).
- Even if this kind of issue should be detected in integration testing environment, try to keep in mind that your code will eventually run concurrently on several threads and even nodes. This has implications on datasources number of connections tuning among others considerations.
- Always favor stateless architectures.
Anything Can Happen in Production
One of the most common sentences I heard from developers dealing with a production issue is "This is impossible, this can't happen". But it does actually. Due to the very nature of the production (high concurrency, unexpected behaviors of users, attacks, hardware failures...), very strange things can and will happen.
Even after serious postmortem studies, the root cause of a significant proportion of the production issues will never be diagnosed or solved (I would say from my own experience in about 10% of the cases). Some abeyant defects can occur only on a combination of exceptional events. Some bugs can happen once in 10 years or even by chance (or misfortune?) never occur during the entire application lifetime. Small story: I was faced very recently with a bug in a node.js job that occurred about once every 10K (when a randomly generated password contained an unescaped double dollar characters sequence).
Check out any production log and you will probably see erratic errors here or there (this is rather scary, trust me).
Preventing expected issues is a good thing but truly good code should control and handle correctly the unexpected
Hardware or network failures are very common. For instance, network micro-cuts can occur (see the 8 Fallacies of Distributed Computing): servers can crash and the filesystem can be filled.
Don't trust data coming from others modules, even yours. As an example, an integration error can make a module to call a deprecated version of your API. You may also get corrupted files like with wrong encoding or wrong dates. Don't even trust your own database (add as many constraints as possible like
CHECK, ...): corrupted data can appear due to bugs in previous module versions, migrations, administration script issues, staled transactions, integration error on encoding or timezones... Let run any application over several years and perform some data sanity checks against your own database: you may be surprised.
Users and external batch systems should be treated as monkeys (with all due respect).
Don't rely on human processes but assume they can do anything. For instance, two common PEBCAK problems occurring on front parts I observed recently:
Double submit (some users double-clicking instead of single clicking). Some REST RPC calls are hence done twice and concurrency oddities occur in the backend;
Private navigation: for some reasons, users switch to this mode and strange things happen (like local data lost or browser extensions disabled).
Most of the time, users will never admit or figure out this kind errors. They can also use a wrong browser, use a personal machine instead of a pro, open the webapp twice in several tabs and many others things you would never imagine.
What can I do ?
- Make your code as robust as possible, write anti-corruption layers and normalize strings. When parsing data, control time formats, encoding, formats (if using hexagonal architecture, perform these controls as soon as possible in the 'in' adapters).
- Add as many constraints checks in your database as possible. Don’t just rely on the domain layer code.
- When possible, instead of writing your own controls, rely on a shared contract (like an JSON or XSD Schema).
- Think about retries, robust error handling, double submission, replays from save points in batch jobs, ...
- When writing your tests, think about as many border-line or apparently impossible cases as possible.
- Use chaos-engineering tools (like Simian Army) that generate errors randomly to test your code resiliency.
- Think about what to do with rejected data.
- To deal with human errors, identify problematic users, book a meeting and observe them using your application before asking any direct question to avoid directing them.
- Build a realistic testing dataset and maintain it. Add new data as soon as you're aware of a special case you didn't considered before. Manage these datasets like your code (versioning, cleanup, refactoring, documentation...).
- Don't ignore weak signals. When something strange happens in development, it will probably happen in production as well and will be far worse then.
- When fixing an issue, make sure to identify all the places where it can occur and don't only fix it in the place you localized it.
- Add clever logs in your code. A clever log comes with:
- A canonic identifier in the message (an errors code like
ERR123or an event ID like
NEW_CLIENT). This greatly eases monitoring by enabling regexp matching;
- All required debugging context (like a person UUID, the instant of the log...);
- The right verbosity level;
- Stack traces when dealing with errors so developers can easily localize the problem in their code
- A canonic identifier in the message (an errors code like
Issues Never Walk Alone
In production, things never ever get better on their own: hope is not a strategy. Due to the Murphy’s law, anything with the ability to fail will fail.
Worse: issues often occur simultaneously. An initial incident can induce another one, even if they look unrelated at a first glance (for instance, an Out Of Memory can create a pressure on the JVM Garbage Collector which in turn increases CPU usage, which induces queued work latency and finally generates timeouts from clients.
Sometimes, this is even worse: truly unrelated issues may occur simultaneity by misfortune making the diagnostic much more difficult by leading down on a wrong way when performing the post-mortem.
What can I do?
- Don’t leave issues in logs in production or in DEV unresolved. Most issues may be detected in development or acceptance environments. Often, we observe problems and we ignore them, thinking this is transient, due to some integration issue or intermittent network issue. This kind of issue should instead be taken as a chance to reveal a real issue and should not be ignored.
- When you observe something strange, stop immediately and take a few minutes to analyze the issue or to add new tests cases. Think you may have found an abeyant defect that would take days to diagnostic and resolve later in production.
In Production, Everything Is Complicated, and Time-consuming
For some good but also not so good reasons, every change should be controlled and traced in a regulated IS. Perform a single SQL statement must be tested in several testing or pre-production environment and finally applied by a DBA.
Any simple Unix command has to be documented in a procedure and executed by the Ops team who is the sole one to access the servers. Most of these operations must be planned, documented in depth, motivated, traced into one or several ticket systems. Changing a simple file or a single row in a database can hardly take less than half a man-day when counting all involved persons.
The costs increase exponentially when we are getting closer to production. See [Capers Jones, 1996] or [Marco M. Morana, 2006] : a bug can cost a low as $25 to fix in DEV and as high as $16K in running production.
Even if modern software engineering promotes CD (Continuous Deployment) and the use of IaC (Infrastructure As Code) tools like Kubernetes, Terraform, or Ansible, deploying in production is still a significant event in most organizations and most DevOps concepts are still theoretical. Deploying a release can't be done every day but about once a week or even a month. Any release usually has to be validated by the product owner's acceptance tests (a lot of manual and repetitive operations). Any blocking issue would require a hotfix coming with a lot of administrative and building work.
What can I do?
- Perform as much unit, integration and system testing as possible before the production environment.
- Add hot-reloading configuration capacities to your modules (like changing log verbosity using a simple REST call against an administrative endpoint).
- Make sure that all process with operations (ticket system, people to contact, way to alert...) are documented and quickly accessible. If not, document them to gain a lot of time the next time.
Production Is Very Stressful
When an incident occurs in production, your stress level may depend on the kind of industry you're working for but even if you work for a medium-sized e-commerce company and not a nuclear facility or a hospital, I can guarantee that any problem generates a lot of pressure coming from customers, management, others teams depending on you. Ops teams are used to it and most are impressively calm when dealing with this kind of event. It's part of their job after all. When the problem comes from your code, you may have to work with them and take on yourself a part of the pressure.
What can I do?
- Make sure to be prepared before the incident by writing or learning procedures (read for instance the great SRE Book by Google chapter 14).
- Be confident in your logs and monitoring metrics to help you to find the root cause (for instance, prepare in advance insightful dashboards and centralized logs queries).
- For any complex issue, begin the investigation by creating a post-mortem document centralizing any note, stack trace, log or graph illustrating your hypothesis.
In Production, You Usually Don't Start from Scratch
In development, when your database structure (DDL) evolves, you simply drop and recreate it. In production, in most of cases, data is already there and you have to perform migrations or adaptations (using ETL or others tools). Likewise, if some clients already use your API, you can't simply change the signature without asking questions but you have to think about backward compatibility. If you have to, you can depreciate some code but then, you have to plan the end of service.
What can I do?
- In development, don't just drop DDL but 'code' changes using incremental changes tools like Liquibase. The same tools should be used in production.
- Check that your libraries or API are still backward compatible using integration tests.
- Use Semantic Versioning conventions to alert for breaking changes.
Security Is More Pregnant in Production
In any seriously protected production environment, many security systems are set up. They are often absent from the others environments due to their added complexity and costs. For instance, you can find additional level 3 and 4 firewalls, WAF (Web Application Firewalls operating at level 7 against HTTP(S) calls), API Gateways, IAM systems (SAML, OIDC...), HTTP(S) proxies or reverse proxies. Internet calls are usually forbidden from servers which that can only use replicated and cached data (like local packages repositories). Hence, many security infrastructure differences can mask issues that will be only discovered in pre-production or even production.
What can I do?
- Don't use the same values for different credential parameters. This can hide some integration issues in production where parameters have more chances to be different and where different passwords are used for any resource.
- Make sure to understand the security infrastructure limitations before coding related user stories.
- Test security infrastructure using containers.
It's a good thing for developers to be curious and get information about the production by themselves by reading blogs, books or simply asking to colleagues. As a developer, do you know how many cores a medium-range server owns (by socket)? How much RAM by blade? Did you ask yourself where the data centers running your code are located? How much power consumption your modules use in KWH every day? How data is stored in SAN? Are you familiar with fail-over systems like load balancers, RAID, standby-databases, virtual infrastructure management, SAN replications...? You don't have to be an expert but it's important and gratifying to know the basics.
I hope I provided to developers a first glimpse of the production constraints. Production is a world where everything is multiplied: the gravity of issues, costs, time to fix systems. Always keep in mind that your code will eventually run in production and a working code is far from being enough: your code must be production-proof to make the organization IS run smoothly. Then, everything will be fine and everybody will be at home early instead of pulling out hair until late the night...