In SQL Server, an index is used to speed the retrieval of data from a table. Indexes are built from one or more columns in the table, and SQL Server uses these Indexes to look up and retrieve data quickly and efficiently. In order to effectively use Indexes in SQL Server, it’s important to understand what the different types of Indexes are. In this Index Basics post, we will cover the basics about Clustered Indexes by trying to answer a few common questions.
What is a Clustered Index?
It has been said that a clustered index is the most important index you can put on a table, because a clustered index tells SQL Server how to physically order the table’s data. Let that sink in for a moment: A clustered index determines the physical order of data in a table. If a table does not have a clustered index, its data is stored in an unordered structure called a “heap”. While there are times that a heap can be useful (such as bulk loading huge amounts of data), as a general rule a heap should be avoided unless you have a specific case where you need it. According to Microsoft:
With few exceptions, every table should have a clustered index.
A classic example of a clustered index is a telephone directory. The names in the directory are listed in order of last name, first name. If I asked you look up “John Doe”, you could easily locate that record regardless of how many thousands of names may be in the directory. However, if there were no order specified for the records in a telephone directory, and everything was just stored randomly, it may take quite a while for you to locate “John Doe”. You would have to scan through each of the records individually until you found the one you were looking for. In database terms, this is known as a table scan. The record you’re looking for might be the 1st one, or the 10th one, or the 10,000th one. Since the data is not stored in any specific order there is no way to know. At a basic level, this is the difference between a table with a clustered index (ordered data) and a heap (unordered data).
How do I add a Clustered Index to my table?
Thankfully, SQL Server makes adding a clustered index easy. Since having a clustered index is so important, it will usually do it for you behind the scenes. In fact, if you have a primary key specified on a table you most likely already have a clustered index on that table as well. When you create a primary key and a clustered index is not specified, SQL Server will automatically create a clustered index using the column(s) you specified as the primary key.
However, if you want your primary key and clustered index to be on separate table columns you can do that as well. Let’s look at a few CREATE TABLE statements to see the difference.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
/*Create a table with a primary key, but do not specify a clustered index. The clustered index will be created automatically by SQL Server on the same column as the primary key. */ CREATE TABLE [dbo].[employee]( [ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY, [firstName] [varchar](50) NULL, [lastName] [varchar](50) NULL, [email] [varchar](50) NULL ) /*Create a table with a primary key, and also specify that the primary key is the clustered index.*/ CREATE TABLE [dbo].[employee]( [ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED, [firstName] [varchar](50) NULL, [lastName] [varchar](50) NULL, [email] [varchar](50) NULL ) /*Create a table with a clustered index but without a primary key. */ CREATE TABLE [dbo].[employee]( [ID] [int] IDENTITY(1,1) NOT NULL UNIQUE CLUSTERED, [firstName] [varchar](50) NULL, [lastName] [varchar](50) NULL, [email] [varchar](50) NULL ) /*Create a table with a primary key and no clustered index. */ CREATE TABLE [dbo].[employee]( [ID] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY NONCLUSTERED, [firstName] [varchar](50) NULL, [lastName] [varchar](50) NULL, [email] [varchar](50) NULL ) |
At this point it is important to note that you can only have one clustered index on a table. Since the clustered index specifies the physical storage order of the data in the table, a table can contain only one clustered index.
Are a Clustered Index and a Primary Key the same thing?
No. As illustrated in the previous example, a primary key and a clustered index are two separate entities, even though you can have a primary key and a clustered index on the same column(s) in a table. If you chose to do so, you can have your clustered index and primary key on different columns.
What should I use for a Clustered Index on my table?
The clustered index can be a single column, or it can be multiple columns, like the way a telephone directory is organized by last name and first name. Choosing a good clustered index on your table is critical! It affects more than you realize, and a poor choice can cause problems such as wasted disk space, wasted memory, higher execution times, fragmentation, and more.
While using multiple columns (also called a composite index) for a clustered index may sound like a good idea at first, you want to try to create your clustered index with as few columns as possible. All of your nonclustered indexes will contain the clustered index as a pointer back to the data in your table. We’ll discuss this more in a future article, but for now just know this: If you have a large clustered index, it will be replicated across all of your nonclustered indexes, which can eat up some storage and memory pretty quickly if you’re not careful.
As a general rule, your clustered indexes should always be unique, narrow, static and ever-increasing. Kimberly Tripp does an excellent job explaining this here . Another great article with a lot of details about this is Effective Clustered Indexes by Michelle Ufford.
A good example of a clustered index that is unique, narrow, static and ever-increasing would be an identity column on an INT datatype. It is always a unique number that keeps increasing, it is only 4 bytes, and it is static because SQL Server won’t let you modify it easily.
An example of a bad choice of clustered index would be a UNIQUEIDENTIFIER (or GUID, as it is commonly known). While it does meet the criteria of being unique and static, it is not ever-increasing or narrow. A 16 byte GUID is four times the size of an INT datatype, and the randomness of the generated GUID will cause rampant fragmentation in your table. Even if you generate the GUID using SQL Server’s NEWSEQUENTIALID(), which will cut down on some of the fragmentation, you still have a large clustered index which can lead to slower queries, larger tables, wasted memory, etc… For more information, see Kimberly Tripp’s GUIDs as PRIMARY KEYs and/or the clustering key .
The Bottom Line…
Having a clustered index on your table is very important, and is something that you should consider for every table unless you have a specific reason not to. A good choice of clustered index on your table helps SQL Server store and retrieve the data more efficiently, while a poor choice can cause performance and resource problems.
Resources:
For a more detailed explanation of Clustered Indexes, see the below resources:
- Create Clustered Indexes (MSDN)
- Clustered and Nonclustered Indexes Described (MSDN)
- Effective Clustered Indexes
- GUIDs as PRIMARY KEYs and/or the clustering key
- Ever-increasing clustering key – the Clustered Index Debate……….again!
Nice article, Eric. Thanks for taking the time to write it and post it.
A Clustered Index (CI) does not guarantee physical order, though. It only guarantees logical order yet still can’t be relied on to properly return data in the desired sort order without an ORDER BY. Depending on the size of the CI, pages may be a part of a “mixed extent” where the pages of more than one table are interleaved. On larger CI’s with no mixed extents, the physical presence of pages and extents, even in uniform extents, may not actually be in physically correct order either in memory or on disk. A single page split takes care of the notion that the CI is in physical order. Even the rows in a page aren’t guaranteed to be in physical order. One expansion of a single variable size datatype on a given row will demonstrate that.
A CI does, however, maintain the correct logical order base on a combination of the doubly linked list or previous and next page entries and the “Row Offset” values on each page.
Last but not least, I’m not so sure about the “ever-increasing” part of the requirements for CIs. I have to say “It Depends” especially on SANs, which already have a huge amount of random activity going on. Having a CI in “ever-increasing” order can make a rather contentious hot-spot at the logical end of the table especially when there a large number of inserts from even just a couple of concurrent sources.
With that in mind, I’ve been doing some experiments over the last 2 years and am finally wrapping up my findings. To summarize my findings, fragmentation doesn’t actually matter. Only the average fullness of pages on the table/index matter. For batch files, the R/W heads on SANS are jumping all over the place anyway and single lookups aren’t affected by fragmentation at all. I’ve also found that the use of GUIDs isn’t nearly as bad as some would have you believe and can have the advantage of actually reducing single hot-spot contention on heavy write systems by spreading the load into 16 different areas (based on the hexadecimal values in GUIDs). I will agree that I hate typing them though.
p.s. There actually are two places where you can rely on the logical order of a Clustered Index without an ORDER BY but each of those is well beyond the scope of a general knowledge article.
Thanks again for taking time to write this article.
Pingback: Index Basics: Hidden Costs Associated With Indexes - SQL Nuggets
Pingback: Index Basics: Find Tables That Do Not Have A Clustered Index - SQL Nuggets