Understanding SQL Server Indexes

This post follows on from the previous two about execution plans in SQL Server, in those posts I covered what an execution plan is, what they tell us and why we might need to use them. Indexes are actually nothing at all to do with execution plans but when talking about one, you often end up talking about the other. As with execution plans, there is a whole plethora of information already out there on the topic of indexing but as something that will be mentioned in abundance in this blog, I thought it best to write a post about the topic in my own words. The focus of this topic is Microsoft SQL Server so the information is specific to that, though indexes are used in all DMBS products and a lot of what is written here about SQL Server indexing is likely to cross over to other platforms.

What is an index?

So straight into it, what is an index? Simply, an index is a structure in the database that allows data in a table to be accessed in a more efficient manner and therefore (generally) more quickly.

As a classic example, an index in a database works much like an index in a text book. A text book has an index at the back – a list of words, each with a page number next to them, if I am reading a book about cars and I want to find what pages the word “engine” appears on, without an index I would have to look through every page to see if the word appeared. With an index, that work is vastly reduced as I can lookup the page numbers and then go directly to the pages I need.

The B Tree

On a more technical level, the index is a Binary Tree (B Tree) and more specifically, within SQL Server, indexes are stored as B+Trees (other DBMS platforms may differ) Taking our book about cars as an example, let’s say we want to find all the pages with the word “engine” The B Tree may look something like this

To find the word “Engine” we can traverse the tree starting at the root node, through the intermediate node and to the leaf node, the path we take would be:

This is more efficient than scanning a list of the words because the B Tree is ordered by the values we are searching for and structured in a way we can quickly find the value we need.

Types of Index

There are many types of index, below are just some that are available in SQL Server – Clustered, Nonclustered, Filtered, Unique, Columnstore

I will just focus on the concepts of Clustered vs Nonclustered in this post, I am sure the others will come later though again, there is plenty of information already out there that has been written about this.

Clustered Index

A clustered index is an index on a table that logically stores the data in order, a phone book is a good example of this – the data of the phone book is stored in surname, firstname order. Because data is stored in the order of the column(s) on which we create the index (known as the key) there can be no more than one clustered index per table.

While the key of a clustered index can be on one or more columns, the leaf node pages contain the values of all of the columns even those not in the key, meaning we can traverse the B Tree on the key columns but are able to retrieve all columns for the row from the leaf node.

In SQL Server, by default a clustered index is created on a column that has a primary key defined on it.

When data is inserted or updated into a table with a clustered index it must be inserted at the correct point to respect the order of the index. A table without a clustered index is called a heap (data is just thrown on a heap in any order)

Nonclustered Index

This is essentially another copy of the table sorted in the order of the key, which only contains a subset of the columns and can be thought of as similar to the index at the back of a text book. As a Nonclustered index is a separate copy of the table, we can have multiple indexes on a table on various columns but while it’s technically possible to create multiple indexes on the same column (for example with different included columns or filters), this is rarely desirable.

The leaf nodes contain the data from the key column(s) and can also contain included columns. Included columns are not used to traverse the tree but since they are stored on the leaf pages of the index, we can return them in queries.

Each Nonclustered index also contains a “secret” included column which is either the column(s) on which the clustered index is created or if the table is a heap, a pointer to the page and row number.

The “secret” included column is so that the database engine can go back to the underlying table to retrieve any columns that are not in the key, an operation known as a key lookup or RID lookup on Clustered indexes and heaps respectively (the key lookup is sometimes known as a bookmark lookup)

Practical Example

Let’s look at a practical example of index use. I will use the Stack Overflow Database, I am using the small 2010 version provided under cc-by-sa 4.0 license from Stack Exchange Data Dump and running in compatibility level 130 on SQL Server 2022 on a VM with 8 cores and 46GB RAM.

The following query below retrieves some basic information about users called John

SELECT	DisplayName,
		Location,
		LastAccessDate
FROM	dbo.Users
WHERE	DisplayName = 'John';

With no Nonclustered indexes explicitly created on the database yet, lets see how SQL Server accesses this data behind the scenes by looking at the actual execution plan:

We can see SQL Server performed a Clustered Index scan. As the Id Column on the Users table is a primary key, a Clustered Index was created on that column automatically so the data is stored in order of Id. SQL Server scanned that index from start to finish and looked at every row to see if DisplayName = ‘John’, the tooltip confirms this

299399 rows were read (the row count of the table) and 932 records were found.

We can see how much work this entailed in terms of page reads by adding

SET STATISTICS IO ON

to the top of the query. This will output extra information to the messages tab showing page reads. For the above query, we can see that 7788 pages were read (logical reads 7788)

I won’t delve any further into this output as again, it is a separate topic all of its own but just understand that 7788 data pages were read.

We can make our query more efficient by creating a Nonclustered index that is ordered by the column we are filtering:

CREATE INDEX IX_DisplayName ON dbo.Users
(
	DisplayName
);

Here we tell SQL Server that we are creating an index on the table dbo.Users on the column DisplayName. When we run this, SQL Server goes and makes a copy of the table that contains only the DisplayName column, ordered by DisplayName. If we run our SELECT query again, we see something different:

Here, we see a Nonclustered Index seek and we can see it read far less rows than before because it was able to use the B-Tree to jump to the DisplayNames with a value of John and read the values from there rather than reading all the DisplayNames that come before and after John also – much more efficient. This is confirmed in the number of rows read property of the Index Seek operator

And we can see the logical reads have reduced as a result:

You may have noticed there was another operator that appeared however – the Key Lookup.

A key lookup occurs when SQL Server seeks an index to do a more efficient search but the index doesn’t contain all of the columns it needs to fully satisfy the query.

If we look again at our query

SELECT	DisplayName,
		Location,
		LastAccessDate
FROM	dbo.Users
WHERE	DisplayName = 'John';

and our Nonclustered Index

CREATE INDEX IX_DisplayName ON dbo.Users
(
	DisplayName
);

We can see the select list of the query contains some columns which are not in the index – Location and LastAccessDate.

What a key lookup does is retrieve the Data from the leaf node of the Nonclustered index along with the value of the Key of the Clustered index which is stored in the leaf node (Id in this case) and for each row it retrieves, it takes the clustering key and uses that to perform a lookup into the Clustered index and retrieve the value for the two missing columns.

What the execution plan shows us is happening is

Index seek returns first DisplayName = ‘John’ record
The key lookup uses that record’s Id to return the record’s Location and LastAccessDate
Index seek returns next DisplayName = ‘John’ record
The key lookup uses that record’s Id to return this record’s Location and LastAccessDate
So on for every John returned by the index seek

We can see that the key lookup is executed multiple times, one for each John returned from the clustered index and we can see that in the tooltip

As we have previously discussed, SQL Server uses a cost based optimizer but it has determined the seek + the key lookup is still cheaper than the alternative of scanning the entire clustered index, however if a large number of Johns were to come back, the extra work of looking each one up would maybe be more expensive than just scanning through the table so the index would not help us in that case.

Fixing the Key Lookup

To ensure our index is as efficient as it can be, we need to add the missing columns to eliminate the Key Lookups. An index which contains all the columns in the query is known as a covering index. We can’t alter the key / include column layout of an existing index so we need to drop and create it.

DROP INDEX IX_DisplayName ON dbo.Users;
CREATE INDEX IX_DisplayName ON dbo.Users
(
	DisplayName
)
INCLUDE
(
	Location,
	LastAccessDate
);

Now we can see SQL Server performs a perfect index seek:

and the page reads have decreased again

We have gone from 7788 logical reads with no index to 2869 with a partially covering index, to 8 with a covering index and this shows us the power of indexing.

On to the Downsides

Yes, there had to be some!

There are downsides of indexes – each Nonclustered index is a copy of the table which includes a subset of columns, meaning we have to store that data again. This takes up space in the database and it means that all these extra pages must be backed up when the database is backed up, restored when the database is restored, consistency checked when we run a DBCC CHECKDB (with default options) etc

A further downside is the maintenance of the index when we write to the tables. For the index we created on the users table, each INSERT on that table now not only needs to insert into the clustered index (or heap) but the Nonclustered index too and it has to insert the row in the correct position to maintain the order, the same is true for updates. While we don’t need to worry about maintaining order on deletes, again, deletion of a row means it needs to be deleted from the clustered index (or heap) as well as the Nonclustered index. Too many indexes can degrade write performance and increase storage overhead.

Conclusion

So with that, we have spoken about what indexes are and why they are helpful, we also looked at a practical example of an index that helped one of our queries

References / Further Reading

Index Architecture and Design Guide

Wikipedia – B-tree

B-Trees vs B+Trees