Datasets staticity level

Bertrand Florat - Jul 16, 2023

Datasets staticity level

[Article also published on DZone.]

A common challenge when designing applications is determining the most suitable implementation based on the frequency of data changes. Should a status be stored in a table to easily expand the workflow? Should a list of countries be embedded in the code or stored in a table? Should we be able to adjust the thread pool size based on the targeted platform?

In a current large project, we categorize datasets based on their staticity level, ranging from very static to more volatile:

Level 1 : Very static datasets

These types of data changes always involve business rules and impact the code. A typical example is the list of states in a workflow (STARTED, IN_PROGRESS, WAITING, DONE, etc.). The indicative size of this dataset is usually between 2 to 20 entries.

From a technical perspective, it is often implemented as an enumeration (a finite list of literal values like Enumerated Types in PostgreSQL, enums in Java, or TypeScript, for instance). Alternatively, it can be managed as constants or a list of constants.

You can use the following litmus test: "Does any item from this list need to be included in an 'if' statement in the code?".

Changing this type of data requires a new release and/or a Data Definition Language (DDL) change and is not easily administrable.

Level 2: Rarely changing datasets

Think of datasets like a list of countries/states or a list of currencies. These datasets rarely exceed a few tens of entries. We refer to them as "nomenclatures".

From a technical standpoint, they can be managed using a configuration file (JSON/YAML/CSV/properties, etc.) or within a database (a table if using a relational database like PostgreSQL, a document or a list of documents if using a NoSQL Document database like MongoDB, etc.).

It is often a good idea to provide an administration GUI that allows adding, changing, or removing entries of this kind if your budget permits.

These lists are often required to initiate the use of an application, even if the data may change later on. Therefore, it is advisable to package the application with a minimal dataset before its first use. For example, a Liquibase configuration can be released with the application to create a minimal set of countries in the database if it doesn't exist yet. However, be cautious to use an idempotent "CREATE IF NOT EXIST" scheme to avoid conflicting with preexisting data.

Depending on the packaging and technologies used, a change in this type of data may or may not require a new release. If your application includes a mechanism for embedding a minimal dataset (such as a configuration file or a Liquibase or SQL script executed automatically), it will likely require a new release. While this may initially be seen as a constraint, it ensures that your application is self-contained and always operational from its deployment, which is often worthwhile.

When storing nomenclatures in a database, a common strategy is to create a table for each nomenclature (e.g., a table for currencies, a table for countries). If, like us, your application requires a more flexible approach, you can use a single NOMENCLATURE table for each microservice and differentiate the nomenclatures using a simple column (e.g., a NOMENCLATURE name). All nomenclatures are then consolidated in a single technical table, and it is straightforward to retrieve a specific nomenclature using a WHERE clause on the nomenclature name. If you want to maintain an ordering, you can further enhance this approach by assigning an ordinal value to each nomenclature entry.

Level 3: Volatile datasets

Most applications persist large amounts of data, which we refer to as "volatile data". This type of data can involve an unlimited number of records managed by an application, such as user profiles, addresses, or chat discussions.

A change, addition, or removal of a record in this kind of dataset should never require a new release (although backups are still necessary). The code is generally designed to handle such changes in a generic manner rather than on a case-by-case basis.

This type of data is typically not administrable through code changes but is managed through regular front/back-office GUIs or batch programs.

Summary

Choosing the appropriate level of staticity is crucial to ensure the maintainability and modifiability of an application and can help avoid potential pitfalls. Using an incorrect solution to handle a particular staticity level can lead to unnecessary integration and release tasks or make the application less maintainable.

Level	Change frequency	Indicative size	Administrable?	Change requires a new release?	Technical solution examples
1	low	2-20	no	yes	List of constants, Java enum, Enumerated PostgreSQL type
2	medium	10-100	yes	Depends on choosen solution	Nomenclature table, configuration file
3	high	> 100	no	no	Regular database records