Beyond Murphy' Law
This article has also been published at DZone.
Murphy's Law ("Anything that can go wrong will go wrong, and at the worst possible time.") is a well-known adage, especially in engineering circles. However, its implications are often misunderstood, especially by the general public. It's not just about the universe conspiring against our systems; it's about recognizing and preparing for potential failures.
Many view Murphy's Law as a blend of magic and reality. As Site Reliability Engineers (SREs), we often ponder its true nature. Is it merely a psychological bias, where we emphasize failures and overlook our unnoticed successes? Psychology has identified several related biases, including Confirmation and Selection biases. The human brain tends to focus more on improbable failures than successes. Moreover, our grasp of probabilities is often flawed – the Law of Truly Large Numbers suggests that coincidences are, ironically, quite common.
However, in any complex system, a multitude of possible states exist, many of which can lead to failure. While safety measures make a transition from a functioning state to a failure state less likely, over time, it's more probable for a system to fail than not.
The real lesson from Murphy's Law isn't just about the omnipresence of misfortune in engineering but also how we respond to it: through redundancies, high availability systems, quality processes, testing, retries, observability, and logging. Murphy's Law makes our job more challenging and interesting!
Today, however, I'd like to discuss complementary or reciprocal aspects of Murphy's Law that I've often observed while working on large systems:
Complementary Observations to Murphy's Law
The Worst Possible Time Complement
Often overlooked, this aspect highlights the 'magic' of Murphy's Law. Complex systems do fail, but not so frequently that we forget them. In our experience, a significant number of failures (about one-third) occur at the worst possible times, such as during important demos.
For instance, over the past two months, we had a couple of important demos. In the first demo, the web application failed due to a session expiration issue, which rarely occurs. In the second, a regression embedded in a merge request caused a crash right during the demo. These were the only significant demos we had in that period, and both encountered failures. This phenomenon is often referred to as the 'Demo Effect'
The Impossibility Complement
Murphy's law states that everything that can go wrong" will indeed go wrong. Despite obviously being a syllogism, I would add that even what can't go wrong actually does.
Developers often note that even systems deemed infallible can fail, a sentiment captured by "it works on my machine".Failures can stem from various factors, including inadequate testing, unrealistic simulations, neglecting dev-prod parity, or overlooking the need for high-concurrency tests.
The causes are numerous: insufficient testing datasets, unrealistic loads, lack of robustness tests, disregarding the dev-prod parity principle, or failing to test in a highly concurrent environment.
This can occur with both functional and technical issues:
-
In a recent issue involving a current project, we had a serious technical crash when we received a invalid date ('29 february' in a non-leap year ) from a partner (date was typed as string by end-users without any control from their side). Our business analysts had explicitly advised against testing date validity, assuming such an error "couldn't happen".
-
Another incident involved a technical glitch (system out of memory), despite our belief that it was impossible after configuring our Java Virtual Machines to utilize all available memory at startup. In theory, Java limits memory usage. Yet, our Java application was terminated by the Linux kernel's 'oom-killer' to prevent a complete server freeze. This was possible because our program ran on a Virtual Machine managed by ESXi, which could perform 'ballooning,' a mechanism to force VMs to swap memory to disk. This process was largely unknown to developers, integrators, and most operators, proving challenging to understand.
The lesson learned: the importance of adopting highly defensive programming and creating robust systems cannot be overstated.
The Conjunction of Events Complement
The combination of events leading to a breakdown can be truly astonishing.
For example, I once inadvertently caused a major breakdown in a large application responsible for sending electronic payrolls to 5 million people, coinciding with its production release day. The day before, I conducted additional benchmarks (using JMeter) on the email sending system within the development environment. Our development servers, like others in the organization, were configured to route emails through a production relay, which then sent them to the final server in the cloud. Several days prior, I had set the development server to use a mock server since my benchmark simulated email traffic peaks of several hundred thousand emails per hour. However, the day after my benchmarking, when I was off work, my boss called to inquire if I had made any special changes to email sending, as the entire system was jammed at the final mail server.
Here’s what had happened:
- An automated Infrastructure as Code (IAC) tool overwrote my development server configuration, causing it to send emails to the actual relay instead of the mock server;
- The relay, recognized by the cloud provider, had its IP address changed a few days earlier;
- The whitelist on the cloud side hadn't been updated, and a throttling system blocked the final server;
- The operations team responsible for this configuration was unavailable to address the issue.
The Squadron Complement
Problems often cluster, complicating resolution efforts. These range from simultaneous issues exacerbating a situation to misleading issues that divert us from the real problem.
I can categorize these issues into two types:
-
The Simple Additional Issue: This typically occurs at the worst possible moment, such as during another breakdown, adding more work or slowing down repairs. For instance, in a current project I'm involved with, due to legacy reasons, certain specific characters inputted into one application can cause another application to crash, necessitating data cleanup. This issue arises roughly once every 3 or 4 months, often triggered by user instructions. Notably, several instances of this issue have coincided with much more severe system breakdowns.
-
The Deceitful Additional Issue: These issues, when combined with others, significantly complicate post-mortem analysis and can mislead the investigation. A recent example was an application bug in a Spring batch job that remained obscured due to a connection issue with the state-storing database, caused by intermittent firewall outages.
The Camouflage Complement
Using ITIL's problem/incidents framework, we often find incidents that appear similar but have different causes.
We apply the ITIL framework's problem/incident dichotomy to classify issues, where a problem can generate one or more incidents.
When an incident occurs, it's crucial to conduct a thorough analysis by carefully examining logs to figure out if this is only a new incident of a known problem or an entire new problem. Often, we identify incidents that appear similar to others, possibly occurring on the same day, exhibiting comparable effects but stemming from different causes. This is particularly true when incorrect error-catching practices are in place, such as using overly broad catch(Exception) statements in Java, which can either trap too many exceptions or, worse, obscure the root cause.
The Over-Accident Complement
Like chain reactions in traffic accidents, one incident in IT can lead to others, sometimes with more severe consequences.
I can recall at least three recent examples illustrating our challenges:
-
Maintenance Page Caching Issue: Following a system failure, we activated a maintenance page, redirecting all API and frontend calls to this page. Unfortunately, this page lacked proper cache configuration. Consequently, when a few users made XHR calls precisely at the time the maintenance page was set up, it was cached in their browsers for the entire session. Even after maintenance ended and the web application frontend resumed normal operation, the API calls continued to retrieve the HTML maintenance page instead of the expected JSON response due to this browser caching.
-
Debug Verbosity Issue: To debug data sent by external clients, we store payloads into a database. To maintain a reasonable database size, we limited the stored payload sizes. However, during an issue with a partner organization, we temporarily increased the payload size limit for analysis purposes. This change was inadvertently overlooked, leading to an enormous database growth and nearly causing a complete application crash due to disk space saturation.
-
API Gateway Timeout Handling: Our API gateway was configured to replay POST calls that ended in timeouts due to network or system issues. This setup inadvertently led to catastrophic duplicate transactions. The gateway reissued requests that timed out, not realizing these transactions were still processing and would eventually complete successfully. This resulted in a conflict between robustness and data integrity requirements.
The Heisenbug Complement
A 'heisenbug' is a type of software bug that seems to alter or vanish when one attempts to study it. This term humorously references the Heisenberg Uncertainty Principle in quantum mechanics, which posits that the more precisely a particle's position is determined, the less precisely its momentum can be known, and vice versa.
Heisenbugs commonly arise from race conditions under high loads or other factors that render the bug's behavior unpredictable and difficult to replicate in different conditions or when using debugging tools. Their elusive nature makes them particularly challenging to fix, as the process of debugging or introducing diagnostic code can change the execution environment, causing the bug to disappear.
I've encountered such issues in various scenarios. For instance, while using a profiler, I observed it inadvertently slowing down threads to such an extent that it hid the race conditions.
On another occasion, I demonstrated to a perplexed developer how simple it was to reproduce a race condition on non-thread-safe resources with just two or three threads running simultaneously. However, he was unable to replicate it in a single-threaded environment.
The UFO Issue Complement
A significant number of issues are neither fixed nor fully understood. I'm not referring to bugs that are understood but deemed too costly to fix in light of their severity or frequency. Rather, I'm talking about those perplexing issues whose occurrence is extremely rare, sometimes happening only once.
Occasionally, we (partially) humorously attribute such cases to Single Event Errors caused by cosmic particles.
For example, in our current application that generates and sends PDFs to end-users through various components, we encountered a peculiar issue a few months ago. A user reported, with a screenshot as evidence, a PDF where most characters appeared as gibberish symbols instead of letters. Despite thorough investigations, we were stumped and ultimately had to abandon our efforts to resolve it due to a complete lack of clues.
The Non-Existing Issue Complement
One particularly challenging type of issue arises when it seems like something is wrong, but in reality, there is no actual bug. These non-existent bugs are the most difficult to resolve! The misconception of a problem can come from various factors including: looking in the wrong place (such as the incorrect environment or server), misinterpreting functional requirements, or receiving incorrect inputs from end-users or partner organizations.
For example, we recently had to address an issue where our system rejected an uploaded image. The partner organization assured us that the image should be accepted, claiming it was in PNG format. However, upon closer examination (that took us several staff-days), we discovered that our system's rejection was justified: the file was not actually a PNG.
The False Hope Complement
I often find Murphy's Law to be quite cruel. You spend many hours working on an issue, and everything seems to indicate that it is resolved, with the problem no longer reproducible. However, once the solution is deployed in production, the problem reoccurs. This is especially common with issues related to heavy loads or concurrency.
The Anti-Murphy's Reciprocal
In every organization I've worked for, I've noticed a peculiar phenomenon, which I'd call 'Anti-Murphy's Law'. Initially, during the maintenance phase of building an application, Murphy’s Law seems to apply. However, after several more years, a contrary phenomenon emerges: even subpar software appears not only immune to Murphy's Law but also more robust than expected. Many legacy applications run glitch-free for years, often with less observation and fewer robustness features, yet they still function effectively. The better the design of an application, the quicker it reaches this state, but even poorly designed ones eventually get there.
I have only some leads to explain this strange phenomenon:
-
Over time, users become familiar with the software's weaknesses and learn to avoid them by not using certain features, waiting longer, or using the software during specific hours.
-
Legacy applications are often so difficult to update that they experience very few regressions.
-
Such applications rarely have their technical environment (like the OS or database) altered, to avoid complications.
-
Eventually, everything that could go wrong has already occurred and been either fixed or worked around: it's as if Murphy's Law has given up.
However, don't misunderstand me: I'm not advocating for the retention of such applications. Despite appearing immune to issues, they are challenging to update and increasingly fail to meet end-user requirements over time. Concurrently, they become more vulnerable to security risks.
Conclusion
Rather than adopting a pessimistic view of Murphy's Law, we should be thankful for it. It drives engineers to enhance their craft, compelling them to devise a multitude of solutions to counteract potential issues. These solutions include robustness, high availability, fail-over systems, redundancy, replays, integrity checking systems, anti-fragility, backups and restores, observability, and comprehensive logging.
In conclusion, addressing a final query: can Murphy's Law turn against itself? A recent incident with a partner organization sheds light on this. They mistakenly sent us data and relied on a misconfiguration in their own API Gateway to prevent this erroneous transmission. However, by sheer coincidence, the API Gateway had been corrected in the meantime, thwarting their reliance on this error. Thus, the answer appears to be a resounding NO.