On the proper selection of a primary key (Article)

Description :: And of the inconsequential nature of the choice

Terms
A candidate key is any set of attributes (fields) of a relation (table) which, when used together, uniquely identify a tuple (row) of that relation. A unique constraint is insufficient if the attributes are nullable, and the database allows duplicate nulls even in a 'unique' attribute.

A primary key is a designated candidate key which will be used by anyone and anything referring to tuples in the given relation. It is the 'chosen one' among the candidate keys, but by nature, all candidate keys are equal. There is no requirement that a primary key be chosen.

A foreign key constraint is a constraint on attributes in a relation that requires that the values of those attributes, together, always refer to an existing tuple in the referred relation.

A surrogate key is an attribute or set of attributes created for the purpose of having a candidate key, whether or not another one already exists. Surrogate keys are most often a single field, and most likely integers generated as a semi-continuous sequence -- which is why they're often called 'meaningless'. Alternatives include GUID fields (much larger, random, with no particularly obvious 'order'), and two-field surrogate keys, generally comprised of a sequence number and a 'location' identifier for distributed databases. Surrogate keys may also be known as artificial keys to distinguish them from natural keys, that is, candidate keys already present in the data 'naturally'.

Examples
City and county names are (that I know of) required to be unique within a state, and state names are unique within our country. City and county names are duplicated between states, however. When uniquely identifying a city or county, you should use the city or county name only in combination with a state name. By themselves, they are not uniquely identifying. Oddly, this is a rare case of a 'natural' candidate key.

Global positioning systems provide us an almost unique method of identifying locations -- but what of time? Three coordinates for space, and one for time! If you use earth-based coordinate to identify, say, a tree in the middle of a field in south africa, will those coordinates stay true forever? Plate tectonics, if nothing else, will eventually move that tree (and everything around it) by slight amounts. How, then, will you identify the tree? By space and time? At a later date, how will we know which tree of many was originally at the coordinates specified? Time is an important factor. Business names get reused (as I recall, 'unix', as a trademark, has been used by several businesses, and even for a while been without an owner!) What of "the artist formerly known as Prince"?

Social security numbers do not uniquely identify people. They identify retirement accounts. On rare occasions, two people (or rather, their retirement accounts) have been assigned the same social security number. As I recall, at least one such case involved two children born in the same hospital on the same date -- meaning that combining the social security number with the date of birth is not sufficient. One person may also have multiple social security numbers; although the SSA prefers not to do this, they can, particularly in cases of serious identity theft and fraud, or to help you escape an ex who just won't take 'no' for an answer. What could we use to uniquely identify a person? Well, if we assume that all of the cells in a person's body have the exact same DNA, and if we also assume that should parts of the body become separated, only one would be considered 'the person', and again assuming that it were practical to use a transcription of that DNA as a series of letters -- we would still have the problem of identical twins and clones (if we're going to assume that every cell in the body shares the same DNA, we might as well force ourselves to assume that identical twins and clones will also have perfect copies of the DNA.) Luckily, only one body can occupy a given space at a given time, but identifying a person by their location is difficult at best. And there's the issue of 'two people, one body' -- conjoined twins or other clinical cases that aren't necessarily the most pleasant to think about, but do happen. Uniquely identifying a person by some natural external attribute is really difficult.

Maybe we'll have more luck with companies (tax entities)? Is a business name unique? Generally, yes. If you consider the name only in all-caps, it should probably be unique within a certain jurisdiction. Larger companies need their name to be 'theirs' across many jurisdictions (particularly international companies) and there's plenty of room for legal wrangling there. But on some level, yes, names are unique. Names change however -- just as with people, businesses may change their name if they so chose. Aside from this, there's the split/join issue: two or more businesses may merge into one, one business may fork into several. When that happens, the name or names used by an original company may continue to be used, or dropped entirely. If you identify a business by its name (and jurisdiction), such mergers and splits may become a problem.

Several candidate keys?
To continue the above example, consider a database I work with regularly: for businesses, we provide a name field, but also a federal tax id (FEIN), a DEA number for those entities who deal with drug prescriptions, an ID number that comes from another accounting system, etc. Our assumption is that any given business may be identified by any of these ID's, if they have them. The FEIN is the only one we can really expect to be routinely present, though we don't always have it for some one-time vendors. A requirement of candidate keys is that all tuples (rows) should have a non-blank value for the attributes involved. Because any of these might be missing, and our users are too lazy to make sure they get the ones that should always be available, we can't use them as candidate keys. (FEIN's and SSN's must be requested after the 'birth' of the entity or person to which they will refer -- there is a period during which a business, and a child, will not have a government-assigned identifier. If you must track such entities during their 'infancy', a blank identifier may be needed.) But let's assume we could, and that all of them were given to us, all the time. Is it not obvious that any one of them by itself uniquely identifies a tuple, and that none have any priority over the rest? A candidate key is just as good as a primary key. We could refer to tax entities by their FEIN number in one case, and by their DEA number in another, with no harm done. Using the same candidate key everywhere (and therefore, picking a primary key) simply makes things easier -- easier to remember, easier to use also (you can match things up without having to cross-reference from one ID system to another.)

Source of almost-natural candidate keys
In the examples above, the US government (in its various divisions) is providing candidate keys for entities. The SSA assigns social security numbers, the DEA assigns its own set of ID's, the IRS assigns others. We don't necessarily know the intimate details of how these identifiers are created and assigned, and we only vaguely know that they're unique (barring clerical mistakes.) But we often hear that these are to be considered 'natural' candidate keys in that we, as an organization, have not created them. They were provided to us by 'the world'.

Why not roll your own, though? Unique identifiers are strangely rare. When provided to us by outside agencies, can we really trust that their system will be as infallible as they suggest? Social security numbers were unique until someone messed up, should we be so trusting as to ignore this? What of someone changing their unique identifier? A business could request that its FEIN number be changed, invalidating the old one, but remaining one-to-one with the business. Changing the value of a unique identifier is a painful process: all tuples referring to this one must be changed.

There are alternatives to using 'natural' candidate keys. You can avoid trusting 'natural' (yet artificial) candidate keys by creating your own surrogate keys. You should do so carefully, of course.

- Keep some sort of constraints on any 'natural' candidate keys you think you've discovered. SSN's may not be be absolutely unique, but you should probably keep tabs on them, and at least very strongly hint at any issues. You might even require them to be unique, until you discover a valid 'edge' case. But by maintaining your own primary key, you avoid having to deal with changing the constraints on which fields (together) uniquely identify an entity. You won't have to add fields to all tables, change all the foreign key constraints, and change code in various layers of your application.

- Use whatever sequence/generator facility is provided by your database system. If you attempt to use a query to find the "last identifier used" and increment by one, you're likely to run into concurrency issues when two isolated transactions simultaneously request a new identifier for a tuple in a relation. (Sequence) Generators act outside any transactions; you may have gaps in the sequence from unused identifiers, but these gaps are well worth the cost, considering the alternative. Any attempt to use transaction-bound queries to find and recycle unused identifiers will encounter problems as well.

- Be mindful of the 'meaning' of the tuples you're asserting. Our software, for example, uses a function to find the identifier for rows in lookup tables by name, even though names are not the primary key. Why? Because if we were to reload the database, and those lookup table entries were to be assigned different primary key values (which would be consistent within the database), we need our software to still identify the correct rows. But we're still vulnerable in other cases -- what if we translate our software to another language? Suddenly, the text field we're using is no longer safe. The id values, were we to make sure we always used the same ones, would in fact be safer. Try to think about how you'll deal with similar situations.

Surrogate (artificial) keys are pointers!
We hear this so often, it's sad. So let's talk about this. For one thing, you'll probably hear that natural primary keys (as rare as I think we've shown them to be, and more often than not, they're actually artificial surrogate keys provided to you by an outside entity over which you have no control) are required by relational theory. Someone might even suggest an 'information principle' or one of Codd's rules. The closest is Codd's first rule, which requires that all values in a relational database be explicitly represented as such. Nothing may be hidden. This is directly tied to the concept of surrogate keys as pointers, as we'll now see.

The pointer issue is the result of the dreaded CODASYL model of (non-relational) databases. If you wanted to represent a one-to-many relationship between, say, people and addresses, you would have the following:
- a people "table",
- an address "table",
- each person tuple would have a pointer to the first relevant address tuple,
- each address tuple would have a pointer to the next relevant address tuple,
- the last address tuple would have a pointer back to the person tuple, to let you know you'd explored all the addresses.

Optionally, lists could be doubly-linked: addresses would point both to the 'next' and the 'previous' address, the first and last addresses would both point back to the person, and the person would point to both the first and last address.

These systems were fast for particular types of queries, but were both difficult to use and difficult to maintain. The pointers above don't mean anything, just as with surrogate keys. The difference, however, is in the concept of navigation.

To find all addresses for a person, you must first find the person's row. You then call a function to get you the first address. You then continue calling functions to get the next address, until you're told you've reached the end of the addresses. This is not the same as requesting all addresses related to a given person. Order matters, for one thing. In a relational database, using a "person_id" on each address, there is no order implied.

But worse, updates are difficult. What if an address should be moved to another person? You must now fix the pointer chains for both people: on one side, you must 'unhook' the address from the sequence of addresses, and on the other, you must 'hook' the address back into the existing chain, if any. All of these operations are extremely error-prone: if a single pointer gets messed up, the two chains could become linked, so that navigating all addresses from one person may land you on another person's addresses, and eventually that other person. In a relational database, changing an address's owner involves only changing the "person_id" of that address, avoiding touching any other data -- this greatly reduces the risk of error, and increases concurrency (you'll need some form of locking to make sure two transactions, modifying the same data, don't interfere with each other -- the less you touch, the less you need to protect, the more you can do at once.)

Furthermore, to find out, from one address, what other addresses are attached to the same person as the one you already have, you must navigate the chain of pointers to find the person, then loop around to find the rest of the addresses, until you get back to where you started. In a relational database, you can simply request all addresses that have the same "person_id" as the one on the address you already have, without even looking at the person's row.

But most importantly, the primary key of a row, and the values of any foreign key fields, travels with the rest of the row's data. Not only are the values immediately visible, but they are not tied to the physical location of the row in the database. Restoring a database from a backup, reorganizing the structure of the database (a sort of "defragmentation"), and other changes will not affect primary and foreign key values. You can store a primary key outside the system and use it later without fear that a change in the database structure will have invalidated the key. A pointer is a physical concept, tied to storage, a key is a logical concept, tied to identification.

To illustrate this, consider the annoying scenario of a primary key changing in a relational database; if you've been using SSN as a person's identifier, and they change their SSN (say, they've had a lot of fraud and had to ask the SSA to assign them a new number), you now have to not only update their 'person' record, but also update all records attached to that 'person' -- every single 'address' record, 'bill' record, etc. must be modified to reflect the change, as that's how they're attached. Now consider how this would work in a non-relational setting, with pointers: the SSN would never have the option of being the pointer itself, and a change to the SSN would never modify the pointer. All attached records could be left alone. On the other hand, re-organizing the database could change the physical location of the 'person' record, even though nothing about the person changed, and all attached records would need to be updated. That's the difference between a logical identifier and a physical identifier, and the difference between a primary key (even artificial) and a pointer. [Above paragraph added 2007/08/26]

But to further prove that surrogate keys are not pointers, consider non-surrogate candidate keys: are they the same as pointers? As all candidate keys are logically the same, that is, they all behave identically from a relational-theory point of view, what you say of natural candidate keys is also true of surrogate keys. Either they uniquely identify entities, or they don't. Either they are physical and require 'navigation', or they are logical, and can be used in relational queries (particularly in join operations). Opponents of artificial/surrogate keys will likely never rant about the evils of natural candidate keys (as that's exactly what they want you to use), so the argument pretty much dies here.

But I'm not one to stop when I've proven my point -- oh no. So here's some more! My girlfriend's cousin's son has an issue with naming people. He names them all after Disney characters: as I recall, his grandmother is "Minnie", someone else is "Pluto", etc. Nobody's entirely sure how this young child decided on the names, but he's stuck by them for a few years now. His family has started referring to each other by those names on occasion too, because they find it cute. Meaningless primary keys! But they work, right? Right.

In fact, the above example demonstrates one other difference between pointers and keys: an object can only be at one location in memory (only one pointer value truly refers to it), while a row may have several candidate keys, all of which are equally good at identifying it. Physically, there's only one of you, but logically you can be known by several names, and it's possible for those names to be unique. (In the example above, no two people were given the same nickname; that may not be globally true, but for my purposes it suffices.) [Above paragraph added 2008/08/15]

We hope you've enjoyed the ride. Tip your waitress, and come back next week. Maybe I'll be in a political mood by then.