Designing Human-Targeted Random IDs

Bertrand Florat - Apr 10, 2022

Designing Human-Targeted Random IDs

Article also published on DZone.

ℹ️ NOTE We don't deal here with technical ID used as primary keys in relational databases. See my previous article here if you seek a great way to generate them.

Context

During one of my recent projects, I have been asked to design a scheme of IDs highly usable by humans. The business requirement was mainly to create pseudo-random values that can't be inferred or guessed in order to be used as a secret token printed on some official documents for future controls.

Later on, we had a similar requirement with lower security concerns: generating human-readable file numbers that can be printed on associated documents, verbalized on phone or typed when doing searches.

Another well-known example (in France at least) is the ID (aka “SNCF number”) attached by the French railway company with each train travel so one can open easily any travel details from your smartphone without being fully authenticated.

Main Criteria

After having compared existing solutions and analyzed the business stakeholder's requirements, these criteria emerged:

These IDs have to be short to be easily typed, read or verbalized on phone by a human (no more than six to ten characters).
They have to integrate systems that prevent and detect typos.
They don't have to be unique (and can't because of their small size and thus variability). However, the system has to prevent collisions either by coupling these IDs with some others values (like a person last name) or by retrying another attempt when a shuffle value already exists (the solution we use). You’ll have to remind that closed items may own the same ID (when doing search by ID, for instance, make sure to make status into account).
When possible, avoid generating offending terms or acronyms (like F*** in English). We didn't actually searched for a solution so far but maintaining a dictionary per targeted language seems the best guess (thanks for Rumen Dimov for his feedback).

How To Make These Values Truly Usable?

Limit the number of possible characters by using more than base-10 (decimal) numbers but add lowercase and uppercase letters. Avoid using others characters (punctuation marks, diacritics,...) that are more difficult to read. Hence, in theory, we can generate numbers made of up to 10 digits + 26 lowercase ASCII letters + 26 uppercase ASCII letters = base-62 numbers.
Ease typing and reading as much as possible: the number should be composed of no more than four or five characters easily memorized as a whole, like aGty3. If longer, split the ID using hyphens (avoid underscores that could be difficult to read when used as an hyperlink).
Make sure that these values can be easily pasted using a single command into clearly separated text fields.

How To Prevent And Detect Typos ?

Exclude confusing characters. Keep in mind that the similarity depends as well on the used fonts: a 'l' can be easily distinguished from a '1' when using a plain old monotype font but less when using a sans-serif one. We advise excluding the most problematic cases: 'O' and '0' (zero), 'Z' and '2' or 'l' and '1'. By dropping these characters, we now deal with base-56 numbers.
Reserve some bits as a CRC or checksum in order to detect most typos early on the frontend. Such systems are used by banks for decades on IBAN account for instance (using the MOD97 algorithm). Users will thank you for notifying them early and this GUI-side surface control prevents issuing some useless server-side queries and ugly error logs on the backend.

ℹ️ NOTE Some light CRC solution can’t detect all but most of the possible typos.

What About The Security ?

If these human-readable IDs are used in serious maters dealing with money, security or official documents, make sure to use a cryptographically secure pseudorandom number generator (CSPRNG) to generate the numbers that you will then convert to your base-56 value. For instance, when using a Linux server, make sure to use /dev/random and not /dev/urandom. This will greatly reduce the risk of collisions (the fact of generating twice the same value in a short amount of time).
The ID length should be proportional with the required difficult to guess it.

Some Examples Please

Imagine you want only want to avoid '0'/'O' and '1'/'l' confusions and you want to generate ID with a collision risk as low as 1/2,6.10¹⁷, you can generate numbers (using a CSPRNG) like:

aTy2-5fTk-rp9z

bUD5-64kP-hlA4

For less critical use cases, fewer characters may be enough:

aTy2-5fTk

64kP-hlA4

For short-live and low-risk ID, see what SNCF does for travel files (only six capital letters):

XSDTGE

Conclusion

Generating readable random IDs for human can be easily achieve but a bunch of requirements must be taken into account. Their scheme has to vary according to the targeted usage but keep in mind that changing an existing scheme is cumbersome and can require maintaining several IDs schemes during a long time. I hope that this article will help you to think about the not-so-obvious criteria making it easier to design them right at the first attempt. I would be glad to get feedback if I have forgotten important or obvious points.