Heterogeneous Collections
A common problem in database design is heterogeneous collections or references. This is solved in many programming languages with sum types, but these are not typically supported in databases, especially relational ones.
The example I will use throughout this article is a cutlery store with a shopping cart. The cart could contain many types items: knifes, forks and spoons. Each item has its own set of attributes which we may want to sort or compare on. These items should be able to be inserted in to our database in a way that does not permit erroneous data, and it should also be possible to display the contents of the cart with a single query returning all attributes for each item. How do we model this?
Contents:
- Tagged Key
- Multiple Tables with Unique IDs
- Multiple Tables with Integrity
- Product Types > Sum Types
- Subclassing
- The Big Bag of Attributes
- Schemaless
Terminology
Referential integrity is what a foreign key constraint provides. It ensures that an identifier references a valid row in another table at all times.
Entity integrity is what a primary key constraint provides. It ensures that an attribute or set of attributes in a table reference a unique row at all times.
Note
SQL snippets will be PostgreSQL specific. The result of the cart query will always look something like this:
item_type | item_id | price | attributes
-----------+---------+-------+--------------------
Fork | 3 | 6.0 | {"prongs" : 2}
Knife | 1 | 5.0 | {"sharpness" : 10}
Spoon | 4 | 3.0 | {"spooniness" : 8}
Fork | 2 | 6.0 | {"prongs" : 1}
Tagged Key
This uses a tag to indicate the type of the ID. It is used in
Django’s GenericForeignKey
.
One way to display the cart in a single query is like this:
Advantages
- It can refer to anything.
Disadvantages
- It can refer to anything.
- No enforced referential integrity.
- An item cannot be referenced by a single ID. To reference an item both the ID and type are needed.
Multiple Tables with Unique IDs
If it can ensured that the ID of an item is unique amongst all items, it is possible to discard the tag. This can be achieved by by using a suitable UUID or sharing a sequence between all item types.
The cart query is similar to the tagged key query, but simplified as the type does not need to be checked:
Advantages
- An item can be referenced by a single ID. There is no need to keep track of an item’s type as in the ‘tagged key’ example.
Disadvantages
- No enforced entity integrity. Primary keys cannot span across tables without additional tricks. A knife, spoon and fork could all have the same ID.
- No enforced referential integrity.
Multiple Tables with Integrity
Some of the disadvantages of separate tables can be corrected by changing the way the cart is modelled:
Then the cart can be displayed with a union of joins:
Advantages
- Entity integrity.
- Referential integrity.
Disadvantages
- Cumbersome to create a table per type every time there is a need to refer to an item.
- An item cannot be referenced by a single ID.
Product Types > Sum Types
This approach adds multiple nullable ID columns and one of them has an ID.
Disadvantages
- Cumbersome to add a field per type.
- An item cannot be referenced by a single ID.
- To avoid accidental sporks, it requires a check that one and only one ID is used. In PostgreSQL it would look something like:
Subclassing
This is the traditional representation, and usually the preferred one. The main reason I see developers avoiding it is because it is not well supported by some ORMs which often have problems querying a set of heterogeneous types (the table definition is the easy part). I recommend building an ORM object on top of the view rather than, or in addition to, the tables themselves.
It’s also fine to use this pattern there are no shared attributes and the base table consists only of a primary key. This ensures entity and referential integrity.
Incidentally, if only the common properties are needed, this is a simple join. For example, for the total price of the items in the cart:
Advantages
- Trivial to enforce entity and referential integrity.
- Attributes that are common to all items can be placed in the base table to avoid repetition.
- An item can be referenced by a single ID.
Disadvantages
- With basic SQL there is no simple way to enforce how many
subclasses are used. In this example the items are complete and
disjoint, i.e. an item can only be a knife, fork or spoon; however,
without further constraints it is possible to have a knife, fork and
spoon with the same
item_id
or even have an item with no subclass at all. There are a few tricks to enforce this constraint, such as using triggers. - Sometimes has poor ORM support.
The Big Bag of Attributes
Advantages
- No joins needed, so it is one of the best performing choices.
- An item can be referenced by a single ID.
- Well suited to simplistic ORMs as it can map directly to a single object.
Disadvantages
- Table bloat.
- Accidental sporks. Awkward to enforce which set of attributes are used for each type and make sure your spoon doesn’t have prongs.
- Probably a maintenance nightmare. Forgoes database design principles.
Schemaless
Advantages
- No joins needed, so it is one of the best performing choices.
- A disadvantage of all the schema-based approaches are that adding or removing new types of items means manipulating the schema via DDL statements. The schemaless approach does not constrain data and so attributes do not have to be defined in advance. This is useful for user defined attributes.
- An item can be referenced by a single ID.
- Well suited to simplistic ORMs as it can map directly to a single object.
- Simplicity. Initially, at least. It moves the costs to the application layer.
Disadvantages
The reasons why I dislike using schemaless data could be a blog post in itself, but I will try to briefly summarise:
- The burden is on the application developer to document and enforce constraints.
- Only a limited number of types are available for the extended attributes. Importantly, the lack of foreign keys sacrifices referential integrity.
- The application developer must carefully migrate old data or handle the old schema if attributes are changed. This can increase application complexity. The database won’t protect you from errors.