Bertrand Florat Tech articles Others articles Projects Cours et papiers CV Contact

My Top 11 Integration Blueprints

Upside-down house

IT integration involves configuring complex systems within large infrastructures to ensure all components work harmoniously. This challenging task requires a blend of coding skills and unique expertise. The following blueprints are derived from my pragmatic experience gained from the trenches of various large projects and apply to system integrators, DevOps engineers (with a focus on automation), and Site Reliability Engineers (SRE) who prioritize availability.

1. Less is Better

Distributed systems, such as microservice architectures, involve numerous configuration parameters, modules, and infrastructure components. Regular cleanups and refactoring are essential to avoid errors, cognitive fatigue, security risks, and wasted effort:

  • Obsolete parameters can override newer ones, causing unexpected issues.
  • Keeping obsolete and unmaintained modules online poses significant security risks.
  • Integration refactoring should focus on renaming and documenting active parameters, not obsolete ones.

My advice:

  • Like a Boy Scout, leave the workspace cleaner than you found it by removing obsolete parameters.
  • Question every parameter addition: is it really necessary? Can its value be standardized across environments? If so, wouldn't it be simpler to hardcode the same value in every environment from testing to production (e.g., for a login or a database name)?
  • When adding a new integration artifact, immediately consider its future removal. Document a removal date or condition for new feature flags. For instance, add a removal date or condition against any new feature flag with a comment like: 'remove this FF once the xyz feature is fully deployed'.

2. Factorize

Managing multiple environments (testing, staging, UAT, performance, pre-production, production) requires effective parameter management.

My advice:

  • Consolidate parameters at various levels, such as globally or by environment type (once you have dropped useless parameters identical in any environment as stated previously). For instance, in my current organization, we have from 3 to 5 UAT (User Acceptance Test) environments, but all of them share many parameter values (such as database tuning). We can set parameters globally (for all environments), for an environment kind (like UAT or performance tests), or for a specific environment (like UAT2 or UAT5). Note that, while being handy, it comes with a minor issue: it makes consolidated views more challenging to get.
  • Use or develop tools to get a flattened view of all parameters for an environment.
  • Try to find the right parameter 'grain' when dealing with composite parameters like URLs made of a protocol, a domain, and optional ports and context paths. Most of the time, it's better to split such parameters into several independent ones (like myservice.port and myservice.host instead of myservice.url). Parts that never change (like protocol) can be hard-coded to avoid useless parameter proliferation (see the 'less is better' principle before). This strategy allows better reuse and factorization of parameters.

3. Enforce Production Parity

Testing environments should allow detecting integration issues as soon as possible. Like with code, integration issues cost exponentially more when discovered near the production stage. Common integration issues include encoding of content or filenames between Windows and Unix systems, tuning differences among component versions (such as a database or application server), timezone issues, libraries or kernel dependencies, permissions issues linked with container isolation, and system tuning like memory swapping.

My advice:

  • In testing environments, avoid using the same value for different parameters (like user=foo, pwd=foo) because doing so prevents detecting errors in code such as an erroneous copy-paste (user=conf('user'); pwd=conf('user') instead of pwd=conf('pwd')).
  • If your production environment uses a vault to store secrets, set up at least one testing environment to do so as well.
  • Know when to use the root user in IaC code or operational scripts. Avoid running processes as root for security reasons. The same applies to simple scripts that can use regular service accounts. Most of the time, root should only be used to deploy binaries and application configuration but not at runtime. This also applies to containers. Using root can create issues by generating files that can't be read or updated by other processes.
  • First deploy in an iso-production pre-production environment as similar to production (including network and security equipment) as possible. Think not only about the servers but also the infrastructure as a whole (proxies, routers, firewalls, API gateways).
  • Always perform at least minimal benchmarks on sensitive new features in an iso-production environment (pre-production is a good match for this purpose).

4. Ratchet Quality

It is common to see problems occurring repeatedly. It is crucial to set up continuous improvement throughout the process.

My advice:

  • Automate sanity checks, for instance, using Git server hooks.
  • Detect configuration changes in code by using tools like GitLab approvals. They make it mandatory for developers to inform integrators.
  • When automation is not possible, write down a procedure every time you have to do something more than once. Ensure the procedure is documented as soon as possible so it can be updated each time a new step is added or removed. This will prevent steps from being forgotten. Good procedure candidates include deployment in production or pre/post steps of a starting/ending Scrum iteration.
  • After each deployment, make sure to write a report identifying and suggesting improvements. In such a deployment report, we include for each issue a flag indicating whether the issue was detected before or after deployment in production and a list of mitigation actions and linked tickets to track each action. This report can be used to gather and analyze data, as well as to visualize how the quality evolves over time and thus assess the risk of each new deployment.
  • The dev team sometimes asks to remove some controls at critical moments (for instance, temporarily removing a code coverage quality check in an extreme rush). If everyone agrees, it is okay to disable such systems, but always enforce a way to reset back to its initial state (creating a task in a Kanban board, for instance) to avoid forgetting to re-enable it.

5. Doing 'On Rails'

In software design, 'On Rails' principles advocate minimizing the number of configurations and using conventions in code. This is an excellent principle, and its corollary is that we should apply it to the remaining configuration as well. Try to design and enforce conventions whenever possible.

Enforcing the same parameter names and structures makes refactoring easier (it is then straightforward to perform global replacements) and allows for effective tooling (it is simple to write tools using regular expressions to extract values, for instance).

My advice:

  • When applying, always add unity in the parameter name itself (and not only in documentation). This makes it 'screaming' and avoids errors.

Example: Not long ago, we had serious availability issues because we only discovered that a timeout was expressed in seconds and not milliseconds. It allowed a poorly written transaction to block the database connection much longer than expected, leading to exhaustion and eventually unavailability of the application. If the parameter had been called timeoutSecs instead of timeout, the costly error would probably have been avoided.

  • When possible, use namespacing like system1_database_name, system1_database_user, etc.
  • Make parameter names self-explanatory (this can be called "screaming" integration naming). If you don't know how to name a parameter, use the rubber duck method: explain its purpose aloud and use the answer as the name. Don't hesitate to use long names when required. The main requirement is clarity and lack of ambiguity.
  • When possible, use strong typing (such as an integer instead of a string for a number).
  • Follow the least astonishment principle. For instance, if all numbers are expressed as integers, don't use a Long for only one of them without reason.

Note: the least astonishment principle applies to the existing codebase but also (and more importantly) to state-of-the-art conventions outside the organization. This should improve the onboarding time for new members.

  • Apply a common scheme for parameter documentation. For this purpose, we use JSON-formatted comments in our .properties or .yaml files, including role, default value, etc. The JSON format allows the generation of consolidated web pages and validates the completeness of comments using a JSON Schema. For instance:
  {"desc": "Max heap value of the JVM in GiB", "type": "integer", "min": 1, "max": 5, "default": 2}

6. Improve Monitoring

Monitoring is challenging. It must not only alert when issues occur but also serve as a proactive tool to detect problems before the users do.

My advice:

  • For on-premise systems: use both local monitoring systems, which may be affected by the same network or system problems as the observed system, and external uptime check tools that, being outside the LAN, can detect issues like high latency or internet breakages. Configure quick alert channels like pagers or phone messages to ensure you don't need to check your emails before being aware of an incident.
  • In post-mortems, note if the detection channel was the users themselves contacting support or proactively our monitoring systems. Collect statistics to determine if the tools are effective.
  • Fix or ignore any false positive alerts ASAP. In my experience, any false positive occurring more than once a day makes the whole alerting system useless because it becomes untrusted. If you temporarily silence an alert, make sure to track a task to re-enable it as soon as possible.
  • Configure correct thresholds for alerts (like 10% remaining free disk), but also alert on unexpectedly low numbers of events by setting up an alert when nothing happens during a given time (we call this 'nologs' probes) as it may hide a technical issue.
  • Perform regular exploratory checks even on apparently healthy systems. Check for system supervision (memory usage, CPU peaks, deviations in business indicators). Even with adequate monitoring, slow memory leaks, for instance, can be hidden by periodic reboots. Sudden surges or decreases in business indicators can be caused by undetected technical issues. Check for rare stack traces in logs as well and try to identify the root cause, often hiding complex issues like concurrency bugs.

7. Apply a Strict and Comprehensive Versioning Scheme

Comprehensive versioning is paramount in complex systems. It allows module traceability and ensures that the correct code is running. Semantic versioning provides additional important metadata about module versions.

My advice:

  • Everything should be versioned automatically as soon as it forms a module or an independent item. This includes modules (an API, a GUI, or a batch mainly) but also independent sets of DDL-SQL statements or a JSON Schema, for instance.
  • Never reuse an existing version once it has been merged into the main branch (in most workflows, it is when the Merge/Pull Request has been merged). Developers often discover a problem just after merging and may be tempted (if the system allows it) to retag a branch with the same version or do so with a set of SQL statements. This invariability leads to many issues due to collisions. When a version has been used, never reuse it but increment it, even if it may seem overkill.
  • Always use the semantic versioning scheme for libraries and modules.
  • For applications based on numerous microservices made of many modules, we often still need a global 'logical' version encompassing all constituting modules. This kind of version is not linked with the code versioning itself like we do with independent modules but with end-user deployment visibility and integration. We use this scheme: [MAJOR].[MINOR].[PATCH] with:
    • MAJOR incremented when a new significant feature is actually deployed in production to end users. Note that we increment this even when canarying, but other strategies are possible.
    • MINOR incremented when at least a single module's new version has been deployed in production.
    • PATCH incremented when no new module version has been released, but at least a single commit has been done and deployed in production in IaC code (like a simple configuration change).
  • Expose a version endpoint (like a /version) in each API. We developed simple scripts based on curl that download from a centralized Git repository the expected version of any of the ~40 modules in text-based (ASCIIDOC) format and compare the list with actually deployed modules by calling their version endpoint.

8. Automate Toils

Automating toils increases developers' and integrators' productivity, avoids human errors, and allows them to focus on what's really important.

My advice:

  • Start with long and error-prone procedures. Our best example is the tagging at each end of the sprint for our ~40 modules. We wrote a shell script to handle Maven and NPM bumping, Git tagging, and GitLab pipeline orchestration via API calls. This saved a lead tech an entire day every 3-week sprint and eliminated most errors.
  • Don't automate if there is no pain. Even if automation comes with the previously stated benefits, like any code, it must be maintained, and you won't replace a chore with another.
  • Start small. As you grow, integration scripts often benefit from being rewritten from a shell to more advanced script-friendly languages like Python, Go, or Groovy.
  • Don't overdesign; sometimes manual or low-tech semi-manual solutions can offer a great ROI.

9. Learn How to Deal with Incidents and Problems

Finding and fixing production issues is a large part of an SRE's daily work. It leverages rigorous procedures, training, and team cooperation.

My advice:

  • Write down, exercise, and continuously improve your incident procedure so everybody knows exactly what to do in case of an incident.
  • Once it seems that the problem is not caused solely by the infrastructure, involve the dev team ASAP to accelerate issue identification and fixing.
  • When investigating, change only a single parameter at a time and test again. Don't try to chase two rabbits at once.
  • Always begin by reproducing the problem (when possible); there is no more difficult issue to fix than one that doesn't exist.
  • Write down a post-mortem after significant incidents or problems. A simple tracking ticket can be enough most of the time, but it must be created systematically for each occurrence, even if it will probably be closed as a duplicate and regardless of the supposed severity. Document any log, monitoring chart, hypothesis, etc., picked up during the analysis. Don't spend hours reading logs before centralizing into the ticket; write them down immediately like police search evidence. Logs, especially on complex microservices systems, are hard to collect and curate.
  • Tag issues as incidents or problems once you find the root cause or figure out that this is a duplicated issue. An incident becomes a problem if something can be done about it, and then it should introduce countermeasures tasks to be followed.
  • Link issues with one another when pertinent.
  • Be stubborn when identifying root causes. Trace the work done in the tracking tickets. Only give up when there are no more hypotheses to test or leads to follow. It is often a good idea to leave a difficult issue for a week or two and reopen the case with new facts, ideas, or points of view.
  • Don't ignore weak signals but always open a ticket, even if you have no time to investigate. It can be used as a base for further investigations if the issue returns.

10. Invest in Documentation

Documentation should not be considered a complementary task but embedded into most integrator actions: post-mortems, tools code documentation, tools manuals, troubleshooting, procedures, etc.

My advice:

  • Communicate troubleshooting via documentation for the dev team, other integrators, or even yourself in a few months when you will have totally forgotten the issue. Always paste stack traces and/or error messages to allow quick searching and identification of issues and - if any - workarounds.
  • How to know if you should document something? My litmus test is: "Can someone external to the very subject guess it by themselves?" If the answer is 'no', document it.
  • As stated in the previous sections, don't over-document self-explanatory and on-rails things. Document all that needs to be documented BUT ONLY what should be documented. Don't forget that useful documentation requires a lot of work and maintenance.
  • As stated before in the 'less is better' section, drop or upgrade any obsolete document ASAP. Most people ignore documentation that failed them several times. If your document is not fully updated, the whole thing will be thrown out.
  • Try to include living documentation into scripts or applications as much as possible (by coding a usage() method in scripts displaying all available arguments) and leverage this documentation to guide the user (for instance, suggest a workaround when they make a common error).
  • Avoid text processor formats but use raw text like Markdown or, even better, Asciidoc.
  • Don't ignore spelling and grammar. Due to the broken windows theory, most people could take your documentation unseriously if it contains spelling errors. Make sure to write simple sentences without confusion or text that can create misunderstandings.

11. Leverage Communication

Though mainly technical, integration requires a lot of soft skills. Communication with the development team and the operations team is paramount.

My advice:

  • Like anyone working actively on complex systems, integrators can and will make mistakes (like setting an incoherent parameter value). The organization should work in a non-blaming environment so the integrators can be transparent and report issues ASAP.
  • Don't be afraid to ask for help from others.
  • In borderline domains dealing with the developer teams, like new parameters, I strongly advise setting up operational meetings before each release. To avoid forgetting important changes related to integration (like a parameter renaming), we leverage Merge Requests approval on some sensitive parts of the code (like a .properties file listing all parameters of a module).
  • Don't forget that trust doesn't exclude control. Always double-check important facts.
  • As stated in the previous section, communication comes with documentation. Documentation is a way to transfer important information across time and space to others and often to yourself in the future.

Conclusion

Integration in IT is a complex but vital task requiring a mix of technical and soft skills. By following these blueprints, integrators, DevOps engineers, and SREs can ensure smoother operations, better communication, and more effective problem-solving. Adopting these practices not only improves system reliability but also fosters a culture of continuous improvement and collaboration.

Additional resources suggestions