Make Your Jobs More Robust with Automatic Safety Switches

Bertrand Florat - Aug 28, 2023

This article has also been published at DZone.

In this article, I'll refer to a 'job' as a batch processing program, as defined in JSR 352. A job can be written in any language but is scheduled periodically to automatically process bulk data, in contrast to interactive processing (CLI or GUI) for end-users. Error handling in jobs differs significantly from interactive processing. For instance, in the latter case, backend calls might not be retried as a human can respond to errors, while jobs need robust error recovery due to their automated nature. Moreover, jobs often possess higher privileges and can potentially damage extensive data.

Consider a scenario: What if a job fails due to a backend or dependency component issue? If a job is scheduled hourly and faces a major downtime just minutes before execution, what should be done?

Based on my experience with various large projects, implementing automatic safety switches for handling technical errors is a best practice.

Enhancing Failure Handling with Automatic Safety Switches

When a technical error occurs (e.g., timeout, storage shortage, database failure), the job should attempt several retries (as per best practices outlined below) and halt immediately at the current processing step. It's advisable to record the current step position, allowing for intelligent restarts once the system is operational again.

Only human intervention, after thorough analysis and resolution, should reset the switch. While in a disabled state, any attempt to schedule the job should log that it's inactive and cannot initiate. This is also the opportune moment to create a post-mortem report, valuable for future failure analysis and potential adjustments to code or configuration for improved robustness (e.g., adjusting timeouts, adding retries, or enhancing input controls).

The switch can then be removed, enabling the job to recommence or complete outstanding steps (if supported) during the next scheduled run. Alternatively, immediate execution can be forced to prevent prolonged downtime delays, especially if job frequency is low. Delaying a job's execution excessively can lead to end-user latency and potential accumulation of such delays, eventually overwhelming the job's capacity.

Rationale for Automatic Safety Switches

Prevention of Data Corruption: They can avert significant data corruption resulting from bugs by halting activity during unexpected states.
Error Log Management: They help prevent system flooding with repetitive error logs (such as database access error stack traces). Uncontrolled log volumes might also exacerbate issues like filesystems filling.
Facilitating System Repair: A system without an automatic safety switch significantly complicates the diagnostic and fixing process. Human operators cannot make decisions with clarity since the system remains enabled and could potentially jam again as soon as it's scheduled."
Resource Exhaustion Mitigation: Continuing periodic jobs during technical errors caused by resource exhaustion (memory, CPU, storage, network bandwidth, etc.) worsens the situation. Automatic safety switches act as circuit breakers, stopping jobs and freeing up resources. After resolving the root problem, operators can restart jobs sequentially and securely.
Security Enhancement: Many attacks, including brute force attacks, SQL injections, or Server Side Injection (SSI), involve injecting malicious data into a system. Such data might be processed later by jobs, potentially triggering technical errors. Stopping the job improves security by forcing human or team analysis of the data. Similarly, halting a job after a timeout can help foil a resource exhaustion-type attack, such as a ReDOS (Regular Expression Denial of Service).
Promoting System Analysis: Organizations that overlook job robustness often allow failed jobs to run in subsequent schedules, adopting a risky approach. Automatic safety switches necessitate human intervention, detecting every failure. This encourages systematic analysis, post-mortem documentation, and long-term improvements.
Preventing Excessive Costs: Implementing a throttling mechanism that pauses operations upon hitting predetermined thresholds, along with an automated safety feature that requires analysis, can protect organizations from incurring significant additional costs due to bugs or intentional attacks when interacting with external systems that incur charges.
Code Reuse: Besides emergency handling, the code written for this purpose can be repurposed to disable a job without altering the scheduling. This is similar to the Suspend: true attribute in Kubernetes CronJobs. In a recent project, we utilized this functionality to conveniently initiate job maintenance. By setting the stop flag, the maintenance script then awaits the completion of all jobs.

Implementing Effective Safety Switches

Simple Implementation: The most straightforward approach involves each job, during scheduling, checking for a persistent stop flag. If present, the job exits with a log. The flag can be implemented, for example, through a file, a database record, or a REST API result. For robustness, a stop file per job is preferable, containing metadata like the reason for stopping and the date. This flag is set on technical errors and removed only by a human operator's initiative (using commands like rm or more advanced methods like a shell script for instance).
Coupling with Retrying Mechanism: Safety switches must work alongside a robust retry solution. Jobs shouldn't halt and require human intervention at the first sign of intermittent issues like database connection saturation or occasional timeouts due to backups slowing down the SAN. Effective systems, such as the Spring Retry library, incorporate exponential backoff with jitter. For instance, setting 10 tries, including the initial call, results in retries spaced exponentially apart (1-second interval, then 2 seconds, and so on). This entire process spans 10 to 15 minutes before failing if the root cause isn't resolved within that timeframe. Jitter introduces small random intervals to avoid retry storms where all jobs simultaneously retry.
Ensure Exclusive Job Launches: Like any batch processing solution, guarantee that jobs are mutually exclusive—ensuring a new job isn't launched while a previous instance is still running.
Business Error Handling: Business errors (e.g., poorly formatted data) shouldn't trigger safety switches, unless the code lacks defensive measures and unexpected errors arise. In such cases, it's a code bug and qualifies as a technical error, warranting the safety switch trigger and requiring hotfix deployment or data correction.
Facilitate Smooth Restarts: When possible, allow seamless restarts using batch checkpoints, storing the current step, processing data context, or even the presently processed item.
Monitoring and Alerting: Ensure that monitoring and alerting systems are aware of job stoppage triggered by automatic safety switches. For example, email alerts could be sent or jobs could be highlighted in red within a monitoring system.
Semi-automatic Restarts: While we always advocate for thorough system analysis during production issues, there are moments when having jobs halted for human intervention isn't practical, especially during weekends. A middle-ground solution between routine automatic job restarts and a complete halt is to authorize an automatic restart after a predetermined period. In our scenario, we've set a mechanism to remove the stop flag after 8 hours. This allows the job to try restarting if no human intervention has addressed the issue by then. This approach merges the benefits of an automatic safety switch, such as preventing data corruption or log overflow, with certain drawbacks. For instance, it might overlook the importance of a systematic analysis and the resulting continuous improvement. Hence, we believe this solution should be implemented judiciously.

Conclusion

Automatic safety switches prove invaluable in handling unexpected technical errors. They significantly reduce the risk of data corruption, empower operators to address issues thoughtfully, and foster a culture of post-mortems and robustness improvements. However, their effectiveness hinges on not being overly sensitive, as excessive interventions can burden operators. Thus, coupling these switches with well-designed retry mechanisms is crucial.