Bertrand Florat Tech articles Others articles Projects Cours CV Contact

Hi ! This is my personal page and blog. You will find here some articles or projects I'm involved in and few thoughts (mainly about IT).

I design, code and integrate large IT projects. I like to work in agile environments to bring as much value as possible to my customers, while dealing with budgets and timelines. My main current positions are Software and infrastructure architect on the first side of the coin, DevOps engineer on the other.

Last technical articles :

Apr 16, 2021 - Release of the first version of our Project architecture document template

Four years after the first release of a template in French, we release a revisited English version.

This architecture template is applicable to most management IT projects, regardless of the general architecture chosen (monolithic, SOA, micro-service, n-tier, …​). It has already been used on several important projects including large organizations. It is maintained on a regular basis.

Discover it at GitHub.

Dec 22, 2020 - Proper strings normalization for comparison purpose

(This article has also been published at DZone)

Illuminated initials, sixteenth-century

TL;DR

In Java, do:

String normalizedString = Normalizer.normalize(originalString,Normalizer.Form.NFKD)
.replaceAll("[^\\p{ASCII}]", "").toLowerCase().replaceAll("\\s{2,}", " ").trim();

Nowadays, most strings are Unicode-encoded and we are able to work with many different native characters with diacritical signs/accents (like ö, é, À) or ligatures (like æ or ʥ). Characters can be stored in UTF-8 (for instance) and associated glyphs can be displayed properly if the font supports them. This is good news for cultural specificities respect.

However, we often observe recurring difficulties when comparing strings issued from different information systems and/or initially typed by humans.

Human brain is a machine to fill gaps. Hence it has absolutely no problem to read or type 'e' instead of 'ê'.

But what if the word 'tête' ('head' in French) is correctly stored in an UTF-8 encoded database but you have to compare it with an end-user typed text missing accents?

We also have often to deal with legacy systems or modern ones filled up with legacy data that doesn't support the Unicode standard.

Another simple illustration of this problem is the use of ligatures. Imagine a product database storing various items with an ID and a description. Some items contain ligatures (a combination of several letters joined together to create a single character like ’Œuf’ - egg in French). Like most French people, I have no idea of how to produce such a character, even using a French keyboard. I would spontaneously search the items descriptions using oeuf. Obviously, our code has to take care of ligatures if we want to return a useful result containing ’Œuf’.

How to fix that mess?

Rule #1: Don't even compare human text if you can

When you can, never compare strings from heterogeneous systems. It is surprisingly tricky to do it properly (even if it is possible to handle most cases like we will see below). Instead, compare sequences, UUID or any other ASCII-based strings without spaces or ‘special’ characters. Strings coming from different information systems have a good probability to store data differently (lower/upper case, with/without diacritics, etc.). On the contrary, good ids are free from encoding issues being plain ASCII strings.

Example:

System 1 : {"id":"8b286f72-b366-47a4-9537-59d39411979a","desc":"Œeuf brouillé"}

System 2 : {"id":"8b286f72-b366-47a4-9537-59d39411979a","desc":"OEUF BROUILLE"}

If you compare ids, everything is simple and you can go home early. If you compare description, you'll have to normalize it as a prerequisite or you'll be in big trouble.

Characters normalization is the action of computing a canonical form of a string. The basic idea to avoid false positives when comparing strings coming from several information systems is to normalize both strings and to compare the result of their normalization.

In the previous example, we would compare normalize("Œeuf brouillé") with normalize("OEUF BROUILLE"). Using a proper normalization function, we should then compare 'oeuf brouille' with 'oeuf brouille' but if the normalization function is buggy or partial, strings would mismatch. For example, if the normalize() function doesn't handle ligatures properly, you would get a false positive by comparing 'œuf brouille' with 'oeuf brouille'.

Rule #2: Normalize in memory

It is better to compare strings at the last possible moment and to do so in memory and not to normalize strings at storage time. This at least for two reasons:

  1. If you only store a normalized version of your string, you lose information. You may need proper diacritics later for displaying purpose or others reasons. As an IT professional, one of your tasks is to never lose information humans provided you.

  2. What if some items have been stored before the normalization routine has been set up? What if the normalization function changed over time?

To avoid these common pitfalls, simply compare in memory normalize(<data system 1>) with normalize(<data system 2>). The CPU overhead should be negligible if you don't compare thousands of items per second...

Rule #3: Always trim externally and internally

Another common trap when dealing with strings typed by humans is the presence of spaces at the beginning or in the middle of a sequence of characters.

As an example, look at these strings: ' Wiliam' (note the space at the beginning), 'Henry ' (note the space at the end), 'Gates III' (see the double space in the middle of this family name, did you notice it at first?).

Appropriate solution:

  1. Trim the text to remove spaces at the beginning and at the end of the text.
  2. Remove surnumerous spaces in the middle of the string.

In Java, one of the way to achieve it is:

s = s.replaceAll("\\s{2,}", " ").trim();

Rule #4: Harmonize letters casing

This is the most known and straightforward normalization method: simply put every letters to lower or upper case. AFAIK, there is no preference for one or the other choice. Most of developers (me included) use lower case.

In Java, just use toLowerCase():

s = s.toLowerCase();

Rule #5: Transform characters with diacritical signs to ASCII

When typed, diacritical signs are often omitted in favor of their ASCII version. For example, one can type the German word 'schon' instead of 'schön'.

Unicode proposes four Normalization forms that may be used for that purpose (NFC, NFD, NFKD and NFKC). Check-out this enlightening illustration.

Detailing all these forms would go beyond the scope of this article but basically, keep in mind that some Unicode characters can be encoded either as a single combined character or as a decomposed form. For instance, 'é' can be encoded as \u00e9 code point or as the decomposed form '\u0065' ('e' letter) + '\u0301' (the diacritic '◌́'') afterward.

We will perform a NFD ("Canonical Decomposition") normalization method on the initial text to make sure that every character with accent is converted to its decomposed form. Then, all we have to do is to drop the diacritics and only keep the 'base' simple characters.

In Java, both operations can be done this way:

s = Normalizer.normalize(s, Normalizer.Form.NFD)
	.replaceAll("[^\\p{ASCII}]", "");

Note: even if this code covers this current issue, prefer the NFKD transformation to deal with ligatures as well (see below).

Rule #6: Decompose ligatures to a set of ASCII characters

The other thing to understand is that Unicode maintain some compatibility mapping between about 5000 ‘composite’ characters (like ligatures or roman precomposed roman numeral) and a list of regular characters. Characters supporting this feature are documented (check the 'decomposition' attribute in Unicode characters documentation).

For instance; the roman numeral Ⅻ (U+216B) can be decomposed with NFKD normalization as a 'X' and two 'I'. Likewise, the ij (U+0133) character (like in 'fijn' - 'nice' in Dutch) can be decomposed into a 'i' and a 'j'.

For these kinds of 'Siamese twins' characters, we have to apply the NFKD ("Compatibility Decomposition") normalization form that both decompose the characters (see 'Rule #5' previously) but also maps ligatures to several 'base' characters. You can then drop the remaining diacritics.

In Java, use:

s = Normalizer.normalize(s, Normalizer.Form.NFKD)
	.replaceAll("[^\\p{ASCII}]", "");

Now the bad news : for obscure reasons, Unicode doesn't support decomposition equivalence of some widely used ligatures like French 'œ' and 'æ' or the German eszett 'ß'. If you need to handle them, you will have to write your own replacements before applying the NFKD normalization :

	s = s.replaceAll("œ", "oe");
	s = s.replaceAll("æ", "ae");
	s = Normalizer.normalize(s, Normalizer.Form.NFKD)
	.replaceAll("[^\\p{ASCII}]", "");

Rule #7: Beware punctuation

This a more minor issue but according to your context you may want to normalize some special punctuation characters as well.

For example, in a literary context like a text-revision software, it may be a good idea to map the em/long dash ('—') character to the regular ASCII hyphen ('-').

AFAIK, Unicode doesn't provide mapping for that, just do it yourself the old good way:

s = s.replaceAll("—", "-");

Final word

String normalization is very helpful to compare strings issued from different systems or to perform appropriate comparisons. Even fully English localized projects can benefit from it, for instance to take care of case or trailing spaces or when dealing with foreign words with accents.

This article exposes some of the most important points to take into consideration but it is far from exhaustive. For instance, we omitted Asian characters manipulation or cultural normalization of semantically equivalents items (like 'St' abbreviation of 'Saint') but I hope it is a good start for most projects.

References

http://www.unicode.org/reports/tr15/

https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html

https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

https://minaret.info/test/normalize.msp

Nov 5, 2020 - Why did I rewrite my blog using Eleventy ?

Reasons to change

This personal home page and blog was previously self-hosted using a great Open Source Wiki engine: Dokuwiki. It worked great for long years but few months ago, I felt than it was time to change lanes and embrace the JAM Stack (JavaScript / API & Markdown).

Issues with traditional wikis

  • Security: many spam in comments, possible PHP vulnerabilities
  • Regular upgrades to be performed against the engine
  • Many plugins required to make something useful. Old ones, conflicting ones...
  • Not so easy to customize the rendered pages
  • Slower than a static website
  • Much larger electricity consumption to serve pages
  • Requires PHP modules to be installed and tunned along with the HTTP server
  • Most wiki engines require a database (even if it is not the case of Dokuwiki)
  • Not so easy reversibility. One way way is to use Pandoc to translate wiki syntax to markdown.

Opportunities with the JAM Stack

  • Ability to write articles using a more widespread markdown languages than one of the numerous Wiki syntaxes around
  • None vulnerability possible (except from the Web server itself) as the produced website is only static HTML
  • Using Git (advanced version control) and associated ecosystem (Merge Requests...)
  • Possibility to use CI/CD tools to deploy new pages
  • Can be deployed on CDN (even if I continue to self-host it)
  • Possibility to use great IDE to write articles (like VSCode and all its extensions)
  • Faster preview of rendered page : I can now see in my browser the result in less than a single second
  • Containers-friendly (using a nginx docker image typically)
  • It's the new trend ! (OK, it's a kind of RDD but it may be useful in current professional context)

The not-so-good using the JAM Stack

  • You have to rely on external services to perform some basic features like adding comments (already disabled in my case, too many spam messages) or full-text searches

Eleventy

Well, I finally decided to switch to the JAM Stack. But it is very crowded. I already use Antora at work to generate great technical documentation using Asciidoc but it was not suitable for a blog. I also used Jekill for a long time with Github pages (see Jajuk website) but I find it complicated, aging and too restrictive.

After a quick look at the most popular platform (Hugo), I gave up. Basically, I felt than I had to learn a full world before being able to make a website and I haven't this time.

Then, I heart about a new simple platform: Eleventy. I loved the Unix-like idea behind it: a very low level tool leveraging on existing templating engines like Liquid or Nunjucks and allowing to mix HTML and markdown contents. It also leverages a convention over configuration principle enabling results in no time.

Last but not least: it is very fast (near as fast as Hugo). It is a JavaScript tool great for most frontend developers who can use npm, sass... Look at this page if you want to see sample code using Eleventy.

I finally rewrote my website in raw CSS, HTML, Markdown and Liquid templates thanks to Eleventy. It only toke me a single day to grasp basic Eleventy concepts and port the existing website. I finally got a full control over my pages.

Note that another common strategy is to use an existing theme (like a Bootstrap-based theme) and to make the HTML generic using templating templates. I gave up this method because I wanted something simple, very light and something I fully control and understand...

May 6, 2020 - Comment faire de bons ADR (décisions d'architecture) ?

Registre des décisions d'architecture

Un registre de décisions d'architecture sert à consigner les décisions importantes d'architecture (les ADR, Architecture Decision Record).

Le but est de permettre la connaissance et la compréhension des choix a posteriori et de partager les décisions. Le dossier d'architecture quant à lui ne reprend pas ces choix mais ne fait apparaître que la décision finale.

Il n'y a qu'un seul registre d'ADR par projet.

Format d'un ADR

Chaque ADR est constitué d'un fichier unique au format asciidoc avec ce nom : [séquence XYZ démarrant à 001]-[decision].adoc.

Format de la décision : en minuscule sans espaces avec des tirets comme séparateur. Exemple : 007-API-devant-bases-existantes-perennes.adoc.

Chaque ADR contient idéalement le contenu suivant (adaptable en fonction des besoins) :

1) Historique

  • Donner le statut et l'historique des changements d'états
  • Les statuts possibles sont : TODO (à rédiger), WIP (Work In Progress), PROPOSE,REJETE, VALIDE, DEPRECIE, REMPLACE.
  • Si le statut est VALIDE, détailler la date et les décideurs qui ont validé.
  • Si le statut est REMPLACE, donner la référence de l'ADR à prendre en compte.
  • Ne jamais supprimer un ADR (le mettre en statut DEPRECIE) et ne pas réutiliser l'ID d'un autre ADR du même module.
  • Mentionner l'éventuel ADR qui le remplace. Exemple: Remplacé par l'ADR 002-...

2) Contexte

Présente les choix possibles, les problématiques, les forces en jeu (techniques, organisationnelles, réglementaires, financières, humaines ...). Donner les forces, faiblesses, opportunités et risques de chaque solution (voir méthode SWOT).

Note:

  • Si un point est rédhibitoire, l'indiquer.
  • Numéroter les solution pour y référer sans ambiguïté
  • Pour les cas les plus simples, deux paragraphes avantages/inconvénients pour chaque solution peuvent suffire.
  • Dans certains cas, l'ADR ne peut contenir qu'une seule solution, le but étant de documenter les raisons de cette architecture.

3) Décision

Donner la décision retenue (être affirmatif et rappeler le numéro de solution retenue). Exemple: Nous effectuerons les signatures de PDF au fil de l'eau (solution 1).

4) Conséquences

Donner les éventuelles conséquences de la décision en terme de mise en œuvre. Ne pas reprendre les forces, faiblesses des solutions mais plutôt les conséquences pratiques de la décision. Donner les actions permettant de réduire les éventuels risques induis par la solution.

Exemples :

* Il conviendra de prévoir des logs spécifiques pour le traitement

* Le risque d'indisponibilité sera couvert par des astreintes renforcées

Format du registre

Idéalement, un registre d'ADR propose un rendu visuel de tous les ADR avec leur statut et leur historique respectifs de façon à disposer d'une vue globale sur la situation de chaque décision. Statuts et historiques ne doivent en aucun cas être dupliqués car implique une double maintenance qui a très peu de chance d'être faite correctement. Dans la plupart des cas, mieux vaut ne faire figurer ces informations que dans chaque ADR même si cela implique de les ouvrir un pas un. Une alternative est de classer les ADR dans des sous-répertoires par statut mais cela rend le parcours des ADR plus difficiles.

Si vous utilisez Asciidoc (ce que je recommande vivement), une astuce existe : l'inclusion de tags. L'idée est de laisser le statut et l'historique mais chaque ADR mais de les inclure dans un tableau pour former le registre. Exemple :

Dans 001-dedoublonnage-requetes.adoc :

## Statut
// tag::statut[]
`VALIDE`
// end::statut[]

## Historique
// tag::historique[]
Validé le 26 nov 2019 avec xyz
// end::historique[]

et dans le registre (README.adoc) :

.Table Liste et statuts des ADR RECE
[cols="2,1a,4a"]
|===
|ADR |Statut |Historique

|link:001-dedoublonnage-requetes.adoc[001-dedoublonnage-requetes]
|include::001-dedoublonnage-requetes.adoc[tags=statut]
|include::001-dedoublonnage-requetes.adoc[tags=historique]

|link:002-appels-synchrones.adoc[002-appels-synchrones]
|include::002-appels-synchrones.adoc[tags=statut]
|include::002-appels-synchrones.adoc[tags=historique]
...
|===

Exemple complet d'ADR

    ## Historique
    Statut: `VALIDE`

    * Validé par xyz le 28 janvier
    * Proposé par z le 02/01/2020

    ## Contexte

    <Présentation générale de la problématique>

    # Solution 1: <description solution>
    ## Forces
    - Limite l'utilisation du réseau

    ## Faiblesses
    - Moins robustesse

    ## Opportunités

    ## Risques
    - [rédhibitoire] Nécessite que la signature se fasse en synchrone ou en fil l'eau

    # Solution 2: <description solution>
    ## Forces
    ## Faiblesses
    ## Opportunités
    ## Risques

    ## Décisions
    La solution 2 est retenue

    ## Conséquences
    - Vérifier la configuration des JVM pour utiliser un générateur d'aléas

Conseils d'utilisation

  • Ne pas hésiter à ajouter des images/schémas... Penser à Mermaid et Plantuml.
  • Ne pas horodater les modifications de l'ADR lui-même, c'est le rôle de l'outil de gestion de version (GIT). Utiliser des messages de commit explicites.
  • Un bon ADR doit être :
    • court ;
    • clair ;
    • pertinent (explique bien le contexte, les choix possibles et la décision retenue) ;
    • accessible de tous (Wiki, Github..., pas de documents bureautique) ;
    • tracé (changelog, commits Git, ...) ;
    • transparent : s'il manque des éléments de décision, les mentionner.

Autres resources

Liens : liste des templates d'ADR courants

Full tech articles list here