Primary Keys and Distributed Environments

September 22, 2010 14:24 by wjchristenson2

In architecting a new application or enhancing an existing one, important considerations need to be made when choosing how to implement primary keys/IDs.  From my own personal research and experience; unique, clustered, auto-incrementing, integer identity columns as the PK perform the best when it comes to DB performance.  They result in more compact clustered indexes, less IO when the table is queried, and they yield more efficient joins.  However, they have inherent problems when used in a distributed environment.

 

Auto-Increment (Identity Columns)
Auto-Increment columns on a SQL Server have seed and and increment properties.  Because these properties are fixed and can be selected from a finite number of possible values, the probability that you’ll get a PK collision is very high.  Therefore an auto-incrementing integer identity column should only be used in non-distributed/download-only situations.  No other server would be creating PK values and thus, you would not have PK conflicts.

There is one way where you could use this type of PK.  Each node which generates PKs could be assigned an identity range to avoid conflicts.  I can see some negatives to this solution: 
1) Have to manage nodes/ranges
2) Page splitting
3) Increased IO on new record inserts as the next identity value isn’t necessarily the the max
4) Loss of PK value possibilities as ranges may not be fully utilized

 

GUIDs - NEWID()
Using a GUID or UNIQUEIDENTIFIER in SQL Server is attractive because the generation algorithm provides a sufficiently high degree of probability that the same key would never be generated twice on any 2 nodes at any given time.

Although GUIDs meet the uniqueness required of a distributed environment, they have some glaring drawbacks that should be considered:
1) Huge in size (4x larger than an integer / 16 bytes). 
2) Large size negatively affects indexes.
3) Because they are random, they would be inserted into random locations within the clustered index resulting in fragmented clustered indexes which yields poor IO performance when the table is queried.
4) Generation of a new GUID in SQL Server does cost more resources than auto-incrementing an integer.  When bulk inserting, this has some glaring performance problems.
5) NEWID() uses the the server’s network card’s ID number plus a unique number from the CPU clock to generate the GUID.  The server would need to have a NIC to ensure GUID uniqueness.

 

GUIDs – NEWSEQUENTIALID()
The answer to the randomness problems with generating a new GUID is to generate sequential GUIDs.  SQL Server has the NEWSEQUENTIALID() function which will generate a GUID that will be greater in value than the previously generated one.  This solves the problems of fragmented clustered indexes.  It also solves the performance problems with generating a new GUID.  Because the new “sequential” GUID takes the previous GUID and increments it per se, the overhead is minimal and performs almost as well as an auto-incremented integer.

So what are the drawbacks to a sequential GUID?
1) Still Huge in size (4x larger than an integer / 16 bytes).
2) Large size negatively affects indexes.
3) NEWSEQUENTIALID() uses the the server’s network card’s MAC address to assist in the GUID generation process.  If privacy is a concern, then this probably isn’t a viable option for you.  Again, the server would need to have a NIC to ensure GUID uniqueness.

 

Natural Keys
In this strategy, you use business keys to uniquely identify records (e.g. social security number for a person).  The main drawbacks with this approach is:
1) Many columns may be needed to create uniqueness of the record.
2) Inconsistent PK columns/data types.
3) Not all records may have a glaring natural key to setup a PK for.

 

Node Identifiers
For this approach, the PK will combine a value that is unique to the node and a value that is unique to the topology.  For instance, you could have a PK that combines an auto-increment identity column and a unique id for the node/server column.  You could also create a custom system to generate values for each inserted row based on the rowID/nodeID.

 

Summary
In summary, there are many ways to architect a distributed database environment in regard to primary keys.  Personally sequential GUIDs and node identifiers are my top 2 picks.  If I had to choose between the two, I think I’d opt for node identifiers.  GUID size, privacy concerns, and ease of use appear to be a bit too much in my opinion.

Bookmark and Share