SQL Server Comma-Delimited List

April 8, 2011 07:15 by wjchristenson2

I’ve come across a few solutions to creating comma-delimited lists in SQL Server.  The two most common solutions include CURSORS/WHILE loops and COALESCE.  CURSORS and WHILE loops are a performance drag so I won’t even go there.  COALESCE can work but the function requires variable declarations and function encapsulation if you want to return more than one record with delimited results without looping.  The approach I prefer is FOR XML PATH.  Below is an example using the AdventureWorks database.  I show a comma-delimited list of address types for 3 customers.

 

SELECT 
	customer.[CustomerID]
	,customer.[FirstName]
	,customer.[LastName]
	,SUBSTRING((
		SELECT 
			', ' + customer_address.AddressType
		FROM [SalesLT].[CustomerAddress] AS customer_address 
		WHERE customer_address.CustomerID = customer.CustomerID
	FOR XML PATH('')), 3, 200000) AS AddressTypes
FROM [SalesLT].[Customer] AS customer
WHERE customer.CustomerID IN (29544,29545,29559);

Bookmark and Share

SQL Server Triggers and Multiple Rows

March 4, 2011 09:51 by wjchristenson2

A common mistake that I’ve seen is the assumption that a DML trigger is fired once for each record modified.  This is not true.  A good SQL developer will use set-based DML statements to alter multiple records at a time.  Regardless of how many records may have been changed, the table’s DML trigger will fire only once.  The trigger’s INSERTED & DELETED tables will contain all the records affected.  I don’t like triggers, but if you find yourself in a pickle and need one… make sure you code for the possibility that more than one record was changed.

 

BAD trigger:
CREATE TRIGGER trgTableA ON TableA FOR UPDATE
AS
	DECLARE @ID int, @name varchar(50);

	SELECT @ID = i.ID, @name = i.Name FROM inserted i;

	UPDATE TableB SET Name = @name WHERE ID = @ID;
GO

 

GOOD trigger:
CREATE TRIGGER trgTableA ON TableA FOR UPDATE
AS
	UPDATE b SET 
		b.Name = i.Name
	FROM TableB b
		INNER JOIN inserted i ON i.ID = b.ID;
GO
Bookmark and Share

ADO.NET and SQL Server 2008 User-Defined Table Types

February 12, 2011 04:22 by wjchristenson2

I recently came across a situation where I needed to batch insert, update, and delete 1 to 100,000+ records to SQL Server at any given time.  Traditionally I would generate an xml string and parse it to a SQL Server table variable and do batch set operations from there.  However I heard about SQL Server 2008’s new UDTTs (user-defined table types) which I could pass a DataTable to a stored procedure as TVPs (table-valued parameters).  It sounded very attractive as I wouldn’t have to generate & parse xml strings to table variables in SQL Server.  I also thought it would give me huge performance gains.  I quickly found that UDTTs were not my saving grace after all and were less than impressive in my opinion.

Before I show examples on how to use UDTTs/TVPs from ADO.NET, I’ll give you my takeaways as to why I chose not to use them:

1) They do not scale.  If you profile SQL Server you’ll notice that for each record in your DataTable passed via ADO.NET, SQL Server will essentially create a table variable perform an insert for each record.  Imagine 10,000 inserts into a table variable before you can begin batch set operations… not a good thing.

2) “sp_executesql” does not play well with TVPs from ADO.NET.  I was in a situation where I had to dynamically generate SQL based on the schema of the table.  Therefore I wanted to cache the execution plan with “sp_executesql” as the schema would not change much.  However if you try and use “sp_executesql” with a SqlCommand with the CommandType as a stored procedure and are passing in a UDTT as a SqlParameter, it will fail.  Error:  Could not find stored procedure 'sp_executesql'.  This error only occurs if you try and pass a TVP from ADO.NET.  If you want to pass a TVP to “sp_executesql” with ADO.NET, you have to get creative with having the SqlCommand command’s type be text and use “sp_executesql” that way.

3) SQL Server compiles the execution plan every time.  You can see more info here and here.  If you use SQL Server Management Studio and pass TVPs to a stored procedure, no SQL compilations are performed.  However if you pass a TVP from ADO.NET, compilations will occur every time.  I won’t go into great detail over it as you can read the 2 links above.  However if you have a stored procedure which will be called a lot, you may want to consider this fact.

4) When passing a DataTable as a TVP from ADO.NET, the DataTable must have the exact same number of columns in the exact same order and the same compatible data types.  I can understand the data types, however there are times where I have a DataTable that may be used for other programmatic purposes outside of TVP/SQL operations. If I want to pass it as a TVP to SQL Server, I essentially have to make a new DataTable and refill it perfectly so the column count is the same and in the same order.  Very finicky and requires coordination and communication if anyone changes the UDTT as it will break code easily.

There’s my 4 reasons why I was not impressed with UDTTs/TVPs.  One nice thing is they are very simple to pass via ADO.NET and easy to use within a stored procedure once you set things up correctly.  Below is a quick example how to wire things up if you wish to use TVPs with ADO.NET.

 

Create the UDTT

IF EXISTS (SELECT * FROM sys.types st JOIN sys.schemas ss ON st.schema_id = ss.schema_id WHERE st.name = N'udtt_PersonNames' AND ss.name = N'dbo')
DROP TYPE [dbo].[udtt_PersonNames];
GO

CREATE TYPE [dbo].[udtt_PersonNames] AS TABLE(
	ID [int] NOT NULL,
	[Name] [varchar](255) NOT NULL,
	UNIQUE CLUSTERED ([ID] ASC) WITH (IGNORE_DUP_KEY = OFF)
);
GO

 

Create the Stored Procedure accepting the TVP as a UDTT

IF EXISTS (SELECT * FROM sysobjects WHERE type = 'P' AND name = 'usp_InsertPersonNames')
	BEGIN
		DROP PROCEDURE dbo.usp_InsertPersonNames
	END
GO

CREATE PROCEDURE dbo.usp_InsertPersonNames (
	@PersonNamesDT AS dbo.udtt_PersonNames READONLY
)
AS
	SET NOCOUNT ON;
	
	INSERT INTO dbo.PersonNames([ID],[Name])
	SELECT [ID], [Name] FROM @PersonNamesDT;

	SET NOCOUNT OFF;
GO

 

Create an ADO.NET DataTable (same # of columns and order as the UDTT)

        Dim dt As DataTable = New DataTable()
        dt.Columns.Add(New DataColumn("ID", GetType(Integer)))
        dt.Columns.Add(New DataColumn("Name", GetType(String)))

        dt.Rows.Add(New Object() {1, "Joe"})
        dt.Rows.Add(New Object() {2, "Mark"})
        dt.Rows.Add(New Object() {3, "Ted"})

 

Pass the ADO.NET DataTable as a TVP to the Stored Procedure

        Using conn As SqlConnection = New SqlConnection("Data Source=localhost\sqlexpress;Initial Catalog=UDTT_test;Integrated Security=True;")
            conn.Open()

            Dim cmd As SqlCommand = New SqlCommand()
            With cmd
                .Connection = conn
                .CommandType = CommandType.StoredProcedure
                .CommandText = "dbo.usp_InsertPersonNames"

                Dim tvParam As SqlParameter = New SqlParameter("@PersonNamesDT", dt)
                tvParam.SqlDbType = SqlDbType.Structured
                tvParam.TypeName = "udtt_PersonNames"
                .Parameters.Add(tvParam)

                .ExecuteNonQuery()
            End With
        End Using

UserDefinedTableTypes_Soln.zip (82.82 kb)
 

Bookmark and Share

Primary Keys and Distributed Environments

September 22, 2010 14:24 by wjchristenson2

In architecting a new application or enhancing an existing one, important considerations need to be made when choosing how to implement primary keys/IDs.  From my own personal research and experience; unique, clustered, auto-incrementing, integer identity columns as the PK perform the best when it comes to DB performance.  They result in more compact clustered indexes, less IO when the table is queried, and they yield more efficient joins.  However, they have inherent problems when used in a distributed environment.

 

Auto-Increment (Identity Columns)
Auto-Increment columns on a SQL Server have seed and and increment properties.  Because these properties are fixed and can be selected from a finite number of possible values, the probability that you’ll get a PK collision is very high.  Therefore an auto-incrementing integer identity column should only be used in non-distributed/download-only situations.  No other server would be creating PK values and thus, you would not have PK conflicts.

There is one way where you could use this type of PK.  Each node which generates PKs could be assigned an identity range to avoid conflicts.  I can see some negatives to this solution: 
1) Have to manage nodes/ranges
2) Page splitting
3) Increased IO on new record inserts as the next identity value isn’t necessarily the the max
4) Loss of PK value possibilities as ranges may not be fully utilized

 

GUIDs - NEWID()
Using a GUID or UNIQUEIDENTIFIER in SQL Server is attractive because the generation algorithm provides a sufficiently high degree of probability that the same key would never be generated twice on any 2 nodes at any given time.

Although GUIDs meet the uniqueness required of a distributed environment, they have some glaring drawbacks that should be considered:
1) Huge in size (4x larger than an integer / 16 bytes). 
2) Large size negatively affects indexes.
3) Because they are random, they would be inserted into random locations within the clustered index resulting in fragmented clustered indexes which yields poor IO performance when the table is queried.
4) Generation of a new GUID in SQL Server does cost more resources than auto-incrementing an integer.  When bulk inserting, this has some glaring performance problems.
5) NEWID() uses the the server’s network card’s ID number plus a unique number from the CPU clock to generate the GUID.  The server would need to have a NIC to ensure GUID uniqueness.

 

GUIDs – NEWSEQUENTIALID()
The answer to the randomness problems with generating a new GUID is to generate sequential GUIDs.  SQL Server has the NEWSEQUENTIALID() function which will generate a GUID that will be greater in value than the previously generated one.  This solves the problems of fragmented clustered indexes.  It also solves the performance problems with generating a new GUID.  Because the new “sequential” GUID takes the previous GUID and increments it per se, the overhead is minimal and performs almost as well as an auto-incremented integer.

So what are the drawbacks to a sequential GUID?
1) Still Huge in size (4x larger than an integer / 16 bytes).
2) Large size negatively affects indexes.
3) NEWSEQUENTIALID() uses the the server’s network card’s MAC address to assist in the GUID generation process.  If privacy is a concern, then this probably isn’t a viable option for you.  Again, the server would need to have a NIC to ensure GUID uniqueness.

 

Natural Keys
In this strategy, you use business keys to uniquely identify records (e.g. social security number for a person).  The main drawbacks with this approach is:
1) Many columns may be needed to create uniqueness of the record.
2) Inconsistent PK columns/data types.
3) Not all records may have a glaring natural key to setup a PK for.

 

Node Identifiers
For this approach, the PK will combine a value that is unique to the node and a value that is unique to the topology.  For instance, you could have a PK that combines an auto-increment identity column and a unique id for the node/server column.  You could also create a custom system to generate values for each inserted row based on the rowID/nodeID.

 

Summary
In summary, there are many ways to architect a distributed database environment in regard to primary keys.  Personally sequential GUIDs and node identifiers are my top 2 picks.  If I had to choose between the two, I think I’d opt for node identifiers.  GUID size, privacy concerns, and ease of use appear to be a bit too much in my opinion.

Bookmark and Share