Bertrand Florat Tech articles Others articles Projects Cours et papiers CV Contact

Hi ! This is my personal page and blog. You will find here some articles or projects I'm involved in and few thoughts (mainly about IT).

I design, code and integrate large IT projects. I like to work in agile environments to bring as much value as possible to my customers, while dealing with budgets and timelines. My main current positions are Software and infrastructure architect on the first side of the coin, DevOps engineer on the other.

Last technical articles :

May 24, 2021 - A first glimpse of production constraints for developers

(This article has been also published at DZone)

Sahara and sandbox

In most organizations, developers are not allowed to access the production environment for stability, security, or regulatory reasons. This is a quite good practice (enforced by many frameworks like COBIT or ITIL) to restrict access to production but a major drawback is a mental distance created between developers and the real world. Likewise, the monitoring is usually only managed by operators and very little feedback is provided to developers except when they have to fix application bugs (ASAP, of course). As a matter of fact, most developers have very little idea of what a real production environment looks like and, more important, of the non-functional requirements allowing to write production-proof code.

Involving developers into resolving production issues is a good thing for two main reasons:

  • It is highly motivating to get tangible evidences of a real running system on a large infrastructure (data centers, clusters, SAN...) and get insights about performances or business facts about their applications (number of transactions per seconds, number of concurrent users, and so on). It is also very common for developers to feel overwhelmed as they are rarely called directly when an outage occurs.
  • It may improve substantially the quality of the code delivered by helping to design properly operation aspects like logs, monitoring, performances, and integration.

So, What Do the Developers Misunderstand the Most Often about the Production?

Concurrency Is Omnipresent

Production is highly concurrent while development is mostly single-threaded. Concurrency can happen among threads of the process (of an application server for instance) but also among different processes running locally or on others machines (e.g., among n instances of an application server running across different nodes). This concurrency can generate various issues like starvation (slow-downs when waiting concurrently for a shared resource), dead-locks or scope issues (data overriding).

What can I do?

  • Perform minimal stress tests on DEV environment or even on your own machine using injectors like JMeter or Gatling... When using frameworks like Spring, make sure to understand and correctly apply scoping best practices (for instance don't use a Singleton with a state).
  • Simulate concurrency using break-points or sleeps in your code and check the context of each staled thread.

Volume Is Huge: You Must Add Limits

In production, everything is XXXL (number of logs line written, number of RPC calls, number of database queries...). This has major implications on performances but also on operability. For instance, writing a Entering function x/Leaving function x type of log could help in development but can flood Ops team with GiB of logs in production. Likewise, when dealing with monitoring, make sure to make alerts useable. If your application generates tens of alerts every day, nobody will notice them anymore in a few days' time frame.

Keep in mind this metaphor: If your DEV environment is a sandbox, production is the Sahara

In production, real users or external systems will massively stress your application. If (for instance) you don't set attachment files maximum size, you will get soon network and storage issues (as well as CPU and memory as a collateral damage). Many limits can be set at the infrastructure level (like circuit breakers in API Gateways) but most of them have to be coded into the application itself.

What can I do?

  • Make sure nothing is 'open bar': always paginate results from databases (using OFFSET and LIMIT for instance or using the seek method), restrict input data sizes, set timeouts on any remote call, ...

  • Think carefully about logs. Perform operability acceptance tests with real operators and Site Reliability Engineers (SRE).

Production Is Distributed and Redundant

While in DEV, most components (like an application server and a database) run inside the same node, they are usually distributed (i.e., some network link exists between them) in production. The network is very slow in comparison with local memory (at scale, if a local CPU instruction takes one second, a LAN network call takes a full year).

In DEV, the instantiation factor is 1: every component is instantiated only once. In any production environment having to deal with serious high availability, performance or fail-over requirements, every component is redundant. There are not only servers but clusters.

What can I do ?

  • Don't hardcore URL or make assumptions about the colocalization of components (I already saw code where localhost hostname was hardcoded)
  • If possible, reduce dev/prod parity by using from your own workstation a locally distributed system like a local Kubernetes cluster (see K3S for instance).
  • Even if this kind of issue should be detected in integration testing environment, try to keep in mind that your code will eventually run concurrently on several threads and even nodes. This has implications on datasources number of connections tuning among others considerations.
  • Always favor stateless architectures.

Anything Can Happen in Production

One of the most common sentences I heard from developers dealing with a production issue is "This is impossible, this can't happen". But it does actually. Due to the very nature of the production (high concurrency, unexpected behaviors of users, attacks, hardware failures...), very strange things can and will happen.

Even after serious postmortem studies, the root cause of a significant proportion of the production issues will never be diagnosed or solved (I would say from my own experience in about 10% of the cases). Some abeyant defects can occur only on a combination of exceptional events. Some bugs can happen once in 10 years or even by chance (or misfortune?) never occur during the entire application lifetime. Small story: I was faced very recently with a bug in a node.js job that occurred about once every 10K (when a randomly generated password contained an unescaped double dollar characters sequence).

Check out any production log and you will probably see erratic errors here or there (this is rather scary, trust me).

Preventing expected issues is a good thing but truly good code should control and handle correctly the unexpected

Hardware or network failures are very common. For instance, network micro-cuts can occur (see the 8 Fallacies of Distributed Computing): servers can crash and the filesystem can be filled.

Don't trust data coming from others modules, even yours. As an example, an integration error can make a module to call a deprecated version of your API. You may also get corrupted files like with wrong encoding or wrong dates. Don't even trust your own database (add as many constraints as possible like NOT NULL, CHECK, ...): corrupted data can appear due to bugs in previous module versions, migrations, administration script issues, staled transactions, integration error on encoding or timezones... Let run any application over several years and perform some data sanity checks against your own database: you may be surprised.

Users and external batch systems should be treated as monkeys (with all due respect).

Don't rely on human processes but assume they can do anything. For instance, two common PEBCAK problems occurring on front parts I observed recently:

  • Double submit (some users double-clicking instead of single clicking). Some REST RPC calls are hence done twice and concurrency oddities occur in the backend;

  • Private navigation: for some reasons, users switch to this mode and strange things happen (like local data lost or browser extensions disabled).

Most of the time, users will never admit or figure out this kind errors. They can also use a wrong browser, use a personal machine instead of a pro, open the webapp twice in several tabs and many others things you would never imagine.

What can I do ?

  • Make your code as robust as possible, write anti-corruption layers and normalize strings. When parsing data, control time formats, encoding, formats (if using hexagonal architecture, perform these controls as soon as possible in the 'in' adapters).
  • Add as many constraints checks in your database as possible. Don’t just rely on the domain layer code.
  • When possible, instead of writing your own controls, rely on a shared contract (like an JSON or XSD Schema).
  • Think about retries, robust error handling, double submission, replays from save points in batch jobs, ...
  • When writing your tests, think about as many border-line or apparently impossible cases as possible.
  • Use chaos-engineering tools (like Simian Army) that generate errors randomly to test your code resiliency.
  • Think about what to do with rejected data.
  • To deal with human errors, identify problematic users, book a meeting and observe them using your application before asking any direct question to avoid directing them.
  • Build a realistic testing dataset and maintain it. Add new data as soon as you're aware of a special case you didn't considered before. Manage these datasets like your code (versioning, cleanup, refactoring, documentation...).
  • Don't ignore weak signals. When something strange happens in development, it will probably happen in production as well and will be far worse then.
  • When fixing an issue, make sure to identify all the places where it can occur and don't only fix it in the place you localized it.
  • Add clever logs in your code. A clever log comes with:
    • A canonic identifier in the message (an errors code like ERR123 or an event ID like NEW_CLIENT). This greatly eases monitoring by enabling regexp matching;
    • All required debugging context (like a person UUID, the instant of the log...);
    • The right verbosity level;
    • Stack traces when dealing with errors so developers can easily localize the problem in their code

Issues Never Walk Alone

In production, things never ever get better on their own: hope is not a strategy. Due to the Murphy’s law, anything with the ability to fail will fail.

Worse: issues often occur simultaneously. An initial incident can induce another one, even if they look unrelated at a first glance (for instance, an Out Of Memory can create a pressure on the JVM Garbage Collector which in turn increases CPU usage, which induces queued work latency and finally generates timeouts from clients.

Sometimes, this is even worse: truly unrelated issues may occur simultaneity by misfortune making the diagnostic much more difficult by leading down on a wrong way when performing the post-mortem.

What can I do?

  • Don’t leave issues in logs in production or in DEV unresolved. Most issues may be detected in development or acceptance environments. Often, we observe problems and we ignore them, thinking this is transient, due to some integration issue or intermittent network issue. This kind of issue should instead be taken as a chance to reveal a real issue and should not be ignored.
  • When you observe something strange, stop immediately and take a few minutes to analyze the issue or to add new tests cases. Think you may have found an abeyant defect that would take days to diagnostic and resolve later in production.

In Production, Everything Is Complicated, and Time-consuming

For some good but also not so good reasons, every change should be controlled and traced in a regulated IS. Perform a single SQL statement must be tested in several testing or pre-production environment and finally applied by a DBA.

Any simple Unix command has to be documented in a procedure and executed by the Ops team who is the sole one to access the servers. Most of these operations must be planned, documented in depth, motivated, traced into one or several ticket systems. Changing a simple file or a single row in a database can hardly take less than half a man-day when counting all involved persons.

The costs increase exponentially when we are getting closer to production. See [Capers Jones, 1996] or [Marco M. Morana, 2006] : a bug can cost a low as $25 to fix in DEV and as high as $16K in running production.

Even if modern software engineering promotes CD (Continuous Deployment) and the use of IaC (Infrastructure As Code) tools like Kubernetes, Terraform, or Ansible, deploying in production is still a significant event in most organizations and most DevOps concepts are still theoretical. Deploying a release can't be done every day but about once a week or even a month. Any release usually has to be validated by the product owner's acceptance tests (a lot of manual and repetitive operations). Any blocking issue would require a hotfix coming with a lot of administrative and building work.

What can I do?

  • Perform as much unit, integration and system testing as possible before the production environment.
  • Add hot-reloading configuration capacities to your modules (like changing log verbosity using a simple REST call against an administrative endpoint).
  • Make sure that all process with operations (ticket system, people to contact, way to alert...) are documented and quickly accessible. If not, document them to gain a lot of time the next time.

Production Is Very Stressful

When an incident occurs in production, your stress level may depend on the kind of industry you're working for but even if you work for a medium-sized e-commerce company and not a nuclear facility or a hospital, I can guarantee that any problem generates a lot of pressure coming from customers, management, others teams depending on you. Ops teams are used to it and most are impressively calm when dealing with this kind of event. It's part of their job after all. When the problem comes from your code, you may have to work with them and take on yourself a part of the pressure.

What can I do?

  • Make sure to be prepared before the incident by writing or learning procedures (read for instance the great SRE Book by Google chapter 14).
  • Be confident in your logs and monitoring metrics to help you to find the root cause (for instance, prepare in advance insightful dashboards and centralized logs queries).
  • For any complex issue, begin the investigation by creating a post-mortem document centralizing any note, stack trace, log or graph illustrating your hypothesis.

In Production, You Usually Don't Start from Scratch

In development, when your database structure (DDL) evolves, you simply drop and recreate it. In production, in most of cases, data is already there and you have to perform migrations or adaptations (using ETL or others tools). Likewise, if some clients already use your API, you can't simply change the signature without asking questions but you have to think about backward compatibility. If you have to, you can depreciate some code but then, you have to plan the end of service.

What can I do?

  • In development, don't just drop DDL but 'code' changes using incremental changes tools like Liquibase. The same tools should be used in production.
  • Check that your libraries or API are still backward compatible using integration tests.
  • Use Semantic Versioning conventions to alert for breaking changes.

Security Is More Pregnant in Production

In any seriously protected production environment, many security systems are set up. They are often absent from the others environments due to their added complexity and costs. For instance, you can find additional level 3 and 4 firewalls, WAF (Web Application Firewalls operating at level 7 against HTTP(S) calls), API Gateways, IAM systems (SAML, OIDC...), HTTP(S) proxies or reverse proxies. Internet calls are usually forbidden from servers which that can only use replicated and cached data (like local packages repositories). Hence, many security infrastructure differences can mask issues that will be only discovered in pre-production or even production.

What can I do?

  • Don't use the same values for different credential parameters. This can hide some integration issues in production where parameters have more chances to be different and where different passwords are used for any resource.
  • Make sure to understand the security infrastructure limitations before coding related user stories.
  • Test security infrastructure using containers.

Conclusion

It's a good thing for developers to be curious and get information about the production by themselves by reading blogs, books or simply asking to colleagues. As a developer, do you know how many cores a medium-range server owns (by socket)? How much RAM by blade? Did you ask yourself where the data centers running your code are located? How much power consumption your modules use in KWH every day? How data is stored in SAN? Are you familiar with fail-over systems like load balancers, RAID, standby-databases, virtual infrastructure management, SAN replications...? You don't have to be an expert but it's important and gratifying to know the basics.

I hope I provided to developers a first glimpse of the production constraints. Production is a world where everything is multiplied: the gravity of issues, costs, time to fix systems. Always keep in mind that your code will eventually run in production and a working code is far from being enough: your code must be production-proof to make the organization IS run smoothly. Then, everything will be fine and everybody will be at home early instead of pulling out hair until late the night...

Apr 16, 2021 - Release of the first version of our Project architecture document template

Four years after the first release of a template in French, we release a revisited English version.

This architecture template is applicable to most management IT projects, regardless of the general architecture chosen (monolithic, SOA, micro-service, n-tier, …​). It has already been used on several important projects including large organizations. It is maintained on a regular basis.

Discover it at GitHub.

Dec 22, 2020 - Proper strings normalization for comparison purpose

(This article has also been published at DZone)

Illuminated initials, sixteenth-century

TL;DR

In Java, do:

String normalizedString = Normalizer.normalize(originalString,Normalizer.Form.NFKD)
.replaceAll("[^\\p{ASCII}]", "").toLowerCase().replaceAll("\\s{2,}", " ").trim();

Nowadays, most strings are Unicode-encoded and we are able to work with many different native characters with diacritical signs/accents (like ö, é, À) or ligatures (like æ or ʥ). Characters can be stored in UTF-8 (for instance) and associated glyphs can be displayed properly if the font supports them. This is good news for cultural specificities respect.

However, we often observe recurring difficulties when comparing strings issued from different information systems and/or initially typed by humans.

Human brain is a machine to fill gaps. Hence it has absolutely no problem to read or type 'e' instead of 'ê'.

But what if the word 'tête' ('head' in French) is correctly stored in an UTF-8 encoded database but you have to compare it with an end-user typed text missing accents?

We also have often to deal with legacy systems or modern ones filled up with legacy data that doesn't support the Unicode standard.

Another simple illustration of this problem is the use of ligatures. Imagine a product database storing various items with an ID and a description. Some items contain ligatures (a combination of several letters joined together to create a single character like ’Œuf’ - egg in French). Like most French people, I have no idea of how to produce such a character, even using a French keyboard. I would spontaneously search the items descriptions using oeuf. Obviously, our code has to take care of ligatures if we want to return a useful result containing ’Œuf’.

How to fix that mess?

Rule #1: Don't even compare human text if you can

When you can, never compare strings from heterogeneous systems. It is surprisingly tricky to do it properly (even if it is possible to handle most cases like we will see below). Instead, compare sequences, UUID or any other ASCII-based strings without spaces or ‘special’ characters. Strings coming from different information systems have a good probability to store data differently (lower/upper case, with/without diacritics, etc.). On the contrary, good ids are free from encoding issues being plain ASCII strings.

Example:

System 1 : {"id":"8b286f72-b366-47a4-9537-59d39411979a","desc":"Œeuf brouillé"}

System 2 : {"id":"8b286f72-b366-47a4-9537-59d39411979a","desc":"OEUF BROUILLE"}

If you compare ids, everything is simple and you can go home early. If you compare description, you'll have to normalize it as a prerequisite or you'll be in big trouble.

Characters normalization is the action of computing a canonical form of a string. The basic idea to avoid false positives when comparing strings coming from several information systems is to normalize both strings and to compare the result of their normalization.

In the previous example, we would compare normalize("Œeuf brouillé") with normalize("OEUF BROUILLE"). Using a proper normalization function, we should then compare 'oeuf brouille' with 'oeuf brouille' but if the normalization function is buggy or partial, strings would mismatch. For example, if the normalize() function doesn't handle ligatures properly, you would get a false positive by comparing 'œuf brouille' with 'oeuf brouille'.

Rule #2: Normalize in memory

It is better to compare strings at the last possible moment and to do so in memory and not to normalize strings at storage time. This at least for two reasons:

  1. If you only store a normalized version of your string, you lose information. You may need proper diacritics later for displaying purpose or others reasons. As an IT professional, one of your tasks is to never lose information humans provided you.

  2. What if some items have been stored before the normalization routine has been set up? What if the normalization function changed over time?

To avoid these common pitfalls, simply compare in memory normalize(<data system 1>) with normalize(<data system 2>). The CPU overhead should be negligible if you don't compare thousands of items per second...

Rule #3: Always trim externally and internally

Another common trap when dealing with strings typed by humans is the presence of spaces at the beginning or in the middle of a sequence of characters.

As an example, look at these strings: ' Wiliam' (note the space at the beginning), 'Henry ' (note the space at the end), 'Gates III' (see the double space in the middle of this family name, did you notice it at first?).

Appropriate solution:

  1. Trim the text to remove spaces at the beginning and at the end of the text.
  2. Remove surnumerous spaces in the middle of the string.

In Java, one of the way to achieve it is:

s = s.replaceAll("\\s{2,}", " ").trim();

Rule #4: Harmonize letters casing

This is the most known and straightforward normalization method: simply put every letters to lower or upper case. AFAIK, there is no preference for one or the other choice. Most of developers (me included) use lower case.

In Java, just use toLowerCase():

s = s.toLowerCase();

Rule #5: Transform characters with diacritical signs to ASCII

When typed, diacritical signs are often omitted in favor of their ASCII version. For example, one can type the German word 'schon' instead of 'schön'.

Unicode proposes four Normalization forms that may be used for that purpose (NFC, NFD, NFKD and NFKC). Check-out this enlightening illustration.

Detailing all these forms would go beyond the scope of this article but basically, keep in mind that some Unicode characters can be encoded either as a single combined character or as a decomposed form. For instance, 'é' can be encoded as \u00e9 code point or as the decomposed form '\u0065' ('e' letter) + '\u0301' (the diacritic '◌́'') afterward.

We will perform a NFD ("Canonical Decomposition") normalization method on the initial text to make sure that every character with accent is converted to its decomposed form. Then, all we have to do is to drop the diacritics and only keep the 'base' simple characters.

In Java, both operations can be done this way:

s = Normalizer.normalize(s, Normalizer.Form.NFD)
	.replaceAll("[^\\p{ASCII}]", "");

Note: even if this code covers this current issue, prefer the NFKD transformation to deal with ligatures as well (see below).

Rule #6: Decompose ligatures to a set of ASCII characters

The other thing to understand is that Unicode maintain some compatibility mapping between about 5000 ‘composite’ characters (like ligatures or roman precomposed roman numeral) and a list of regular characters. Characters supporting this feature are documented (check the 'decomposition' attribute in Unicode characters documentation).

For instance; the roman numeral Ⅻ (U+216B) can be decomposed with NFKD normalization as a 'X' and two 'I'. Likewise, the ij (U+0133) character (like in 'fijn' - 'nice' in Dutch) can be decomposed into a 'i' and a 'j'.

For these kinds of 'Siamese twins' characters, we have to apply the NFKD ("Compatibility Decomposition") normalization form that both decompose the characters (see 'Rule #5' previously) but also maps ligatures to several 'base' characters. You can then drop the remaining diacritics.

In Java, use:

s = Normalizer.normalize(s, Normalizer.Form.NFKD)
	.replaceAll("[^\\p{ASCII}]", "");

Now the bad news : for obscure reasons, Unicode doesn't support decomposition equivalence of some widely used ligatures like French 'œ' and 'æ' or the German eszett 'ß'. If you need to handle them, you will have to write your own replacements before applying the NFKD normalization :

	s = s.replaceAll("œ", "oe");
	s = s.replaceAll("æ", "ae");
	s = Normalizer.normalize(s, Normalizer.Form.NFKD)
	.replaceAll("[^\\p{ASCII}]", "");

Rule #7: Beware punctuation

This a more minor issue but according to your context you may want to normalize some special punctuation characters as well.

For example, in a literary context like a text-revision software, it may be a good idea to map the em/long dash ('—') character to the regular ASCII hyphen ('-').

AFAIK, Unicode doesn't provide mapping for that, just do it yourself the old good way:

s = s.replaceAll("—", "-");

Final word

String normalization is very helpful to compare strings issued from different systems or to perform appropriate comparisons. Even fully English localized projects can benefit from it, for instance to take care of case or trailing spaces or when dealing with foreign words with accents.

This article exposes some of the most important points to take into consideration but it is far from exhaustive. For instance, we omitted Asian characters manipulation or cultural normalization of semantically equivalents items (like 'St' abbreviation of 'Saint') but I hope it is a good start for most projects.

References

http://www.unicode.org/reports/tr15/

https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html

https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

https://minaret.info/test/normalize.msp

Nov 5, 2020 - Why did I rewrite my blog using Eleventy ?

Reasons to change

This personal home page and blog was previously self-hosted using a great Open Source Wiki engine: Dokuwiki. It worked great for long years but few months ago, I felt than it was time to change lanes and embrace the JAM Stack (JavaScript / API & Markdown).

Issues with traditional wikis

  • Security: many spam in comments, possible PHP vulnerabilities
  • Regular upgrades to be performed against the engine
  • Many plugins required to make something useful. Old ones, conflicting ones...
  • Not so easy to customize the rendered pages
  • Slower than a static website
  • Much larger electricity consumption to serve pages
  • Requires PHP modules to be installed and tunned along with the HTTP server
  • Most wiki engines require a database (even if it is not the case of Dokuwiki)
  • Not so easy reversibility. One way way is to use Pandoc to translate wiki syntax to markdown.

Opportunities with the JAM Stack

  • Ability to write articles using a more widespread markdown languages than one of the numerous Wiki syntaxes around
  • None vulnerability possible (except from the Web server itself) as the produced website is only static HTML
  • Using Git (advanced version control) and associated ecosystem (Merge Requests...)
  • Possibility to use CI/CD tools to deploy new pages
  • Can be deployed on CDN (even if I continue to self-host it)
  • Possibility to use great IDE to write articles (like VSCode and all its extensions)
  • Faster preview of rendered page : I can now see in my browser the result in less than a single second
  • Containers-friendly (using a nginx docker image typically)
  • It's the new trend ! (OK, it's a kind of RDD but it may be useful in current professional context)

The not-so-good using the JAM Stack

  • You have to rely on external services to perform some basic features like adding comments (already disabled in my case, too many spam messages) or full-text searches

Eleventy

Well, I finally decided to switch to the JAM Stack. But it is very crowded. I already use Antora at work to generate great technical documentation using Asciidoc but it was not suitable for a blog. I also used Jekill for a long time with Github pages (see Jajuk website) but I find it complicated, aging and too restrictive.

After a quick look at the most popular platform (Hugo), I gave up. Basically, I felt than I had to learn a full world before being able to make a website and I haven't this time.

Then, I heart about a new simple platform: Eleventy. I loved the Unix-like idea behind it: a very low level tool leveraging on existing templating engines like Liquid or Nunjucks and allowing to mix HTML and markdown contents. It also leverages a convention over configuration principle enabling results in no time.

Last but not least: it is very fast (near as fast as Hugo). It is a JavaScript tool great for most frontend developers who can use npm, sass... Look at this page if you want to see sample code using Eleventy.

I finally rewrote my website in raw CSS, HTML, Markdown and Liquid templates thanks to Eleventy. It only toke me a single day to grasp basic Eleventy concepts and port the existing website. I finally got a full control over my pages.

Note that another common strategy is to use an existing theme (like a Bootstrap-based theme) and to make the HTML generic using templating templates. I gave up this method because I wanted something simple, very light and something I fully control and understand...

Full tech articles list here