Understanding SQL Server Statistics

The next post in this short introductory series is about statistics in SQL Server – this isn’t an all-encompassing deep dive into the topic – the aim is to outline what statistics are, where and how SQL Server uses them, present a quick overview of an example statistic and an example of statistics at work. This post has a SQL Server focus but statistics are a pivotal mechanism in relational database systems so a lot of these concepts will likely cross over with other systems too.

I like to think of SQL Server statistics as metadata – data about data. What I mean by this is they describe the data that is in a table and SQL Server can use this metadata when it compiles execution plans – for example how many rows are in a table, how many rows in a particular column are equal to a certain value, how unique the data in the column is etc.

But Why?

Why does SQL Server need to store such data? Well, as I mentioned in my earlier post, when SQL Server builds an execution plan, it is planning its “route” to get the data from the tables and will choose the plan it thinks is the most efficient from a number of potential execution plans. Execution plans are chosen based on cost and costs are estimated based on the amount of CPU and IO work SQL Server estimates will be required to fulfill an execution plan.

Example

Let’s take the following query in the Stack Overflow database (available under cc-by-sa 4.0 license)

SELECT	DisplayName
FROM	dbo.Users
WHERE	Reputation = 1;

and an index on the filter column:

CREATE INDEX IX_Reputation ON dbo.Users
(
	Reputation
);

If I asked you how you would execute that query, how would you do it?

Your options might be:

Seek the IX_Reputation index – jump to the users with a reputation of 1 and stop reading when the reputations are no longer 1. Though since the DisplayName column in the select list is not in the index, you’ll need to Key Lookup to the Clustered index for each row to get the DisplayName
Scan the entire users table, checking each row for reputation = 1, returning the rows that match

Which is the best option?

What about this query:

SELECT	DisplayName
FROM	dbo.Users
WHERE	Reputation = 2;

Same query, though different reputation value – the same possible execution options apply

Which is the best option?

In true SQL Server fashion, the answer is “it depends” and as always, we should never say “it depends” without saying what it depends on.

So what does it depend on?

Some things we might want to consider when deciding which way would be best to execute the query:

How many rows are in the table?
How many rows are likely to match the reputation value?

The answers to these questions can make a big difference to which is the better way to execute the query.

If we have a large table with many users of reputation 1, those Key Lookups with the index seek might be quite costly and it may be cheaper to just scan the full table.

Likewise, if not many users have a reputation of 1, scanning the full table to return a few rows is quite inefficient.

Let’s run both and see what SQL Server does, the database is in compatibility level 130 on SQL Server 2022 on a VM with 8 cores and 46GB RAM.

SELECT	DisplayName
FROM	dbo.Users
WHERE	Reputation = 1;

SELECT	DisplayName
FROM	dbo.Users
WHERE	Reputation = 2;

SQL Server did something different for both queries.

What Happened?

What happened here is that SQL Server had some data to help it decide which execution method was better for each reputation value, this data came from statistics. Statistics on the Users table told SQL Server that there are many users with a reputation of 1 and not so many with a reputation of 2 and therefore SQL Server made the choice that for reputation = 1, the Key Lookups associated with the index seek were more costly than scanning the full table.

As it happens, we can confirm that the key lookup approach is more costly by forcing SQL Server to use the index on the reputation = 1 query and look at the stats IO and execution plan.

SET STATISTICS IO ON;

SELECT	DisplayName
FROM	dbo.Users WITH (INDEX (IX_Reputation))
WHERE	Reputation = 1;

SELECT	DisplayName
FROM	dbo.Users
WHERE	Reputation = 1;

We can see SQL Server costed the index use much higher (we can see this by looking at the cost relative to the batch) and the logical reads of the key lookup plan were far higher. As a side, I wouldn’t recommend forcing the optimizer down a certain path under most circumstances, however these types of hints are useful to understand why SQL Server made one choice and not the other.

Anatomy of Statistics

Let’s now look at how SQL Server was able to use statistics to pick the most efficient plan.

A statistic is an object in the database, we can see them in the object explorer under the Statistics folder for the table:

Here we have two statistics objects.

You may notice they have familiar names:

The two statistics we can see are named after the two indexes on the table. Each time an index is created, an associated statistics object is created.

Let’s look at the statistic for the IX_Reputation index. There are multiple ways we can view it, we can either right click the object in explorer and select properties or we can use the command DBCC SHOW_STATISTICS() We’ll use the properties menu here.

If I right click the IX_Reputation statistics object and select properties, the statistic opens up.

This tells us the name of the statistic and which table / columns the statistic is created on. As this statistic is for the IX_Reputation index, which is only on the Reputation column, the statistic is also only on the reputation column.

Selecting Details on the left pane gives us further information:

We can see all kinds of information here which is quite overwhelming at first so let’s look at each section one by one

Break It Down

The first section is the header:

This section is quite wide and so I’ve spread it across two screenshots.

This section tells us:

Name

The name of the statistic – we saw this in the object explorer

Updated

When the statistic was last updated – more on this later

Rows

The number of rows in the table (correct at the time of the statistic being updated)

Rows Sampled

The number of rows in the table that were sampled to create the statistic (more on sampling later too)

Steps

How many steps in the histogram

Density

A measure of how unique the data in the column is, calculated by 1 / distinct values

Average Key length

the number of bytes on average, a value in the column has

String Index

The documentation states:

Yes indicates the statistics object contains string summary statistics to improve the cardinality estimates for query predicates that use the LIKE operator; for example, WHERE ProductName LIKE '%Bike'. String summary statistics are stored separately from the histogram and are created on the first key column of the statistics object when it is of type char, varchar, nchar, nvarchar, varchar(max), nvarchar(max), text, or ntext.

Filter Expression (no label)

It is possible to create filtered statistics, think Statistics with a WHERE clause. This shows what the filter operator is where this is the case. Note there is no header label for this in the view of the statistic we are looking in for some reason.

Unfiltered Rows (no label)

Number of rows in the table before the filter(above) was applied. This is blank if no filter was applied

Persisted Sample Percent (no label)

This is used to specify the amount of data that is sampled when the statistic is auto updated, more on updating statistics later

The next section is the density vector information

The density of the column is a measurement of how unique the data in that column is.

There is an excellent article written about the density vector on SQLServerCentral.com that describes the density vector in very clear terms so I will quote from that as it describes it more clearly than I am able to:

Density measure the uniqueness of column or selectivity of column.Density can have value between 0 and 1. If the column has density value 1, it means all the records has same value in that column and less selectivity. Higher the density lower the selectivity. If the column has density value 0.003, that means there are 1/0.003=333 distinct values in that column.

Note there are two density values in this statistic, one solely for the column of the statistic (Reputation) and one for the column plus the clustering key. There is a lot more to be said about density, it is well explained in the article above, for the purpose of the remainder of this article, just know that density is a measure of how unique the data in the column is.

The final section of the statistic is the one that takes up the most space in the window and that is the histogram:

Histogram

The histogram describes the distribution of data in the column. It is broken down into a maximum of 200 steps, each line in the screenshot above is a step (note the window is truncated for the purpose of the screenshot) Each step relates to a range_hi_key this range_hi_key is a value in the column on which the statistic is created.

As this statistic is on reputation which is an integer column, the values in the left hand column are integers. The remaining four columns describe data in relation to the range_hi_key. Let’s look at the first two steps:

The headings are as follows

RANGE_ROWS

The number of rows in the table that are greater than the previous range_hi_key and less than this range_hi_key

EQ_ROWS

The number of rows in the table that are equal to this range_hi_key

DISTINCT_RANGE_ROWS

The number of distinct column values that are greater than the previous range_hi_key and less than the current range_hi_key

AVG_RANGE_ROWS

The average number of rows in the table where the value is greater than the previous range_hi_key and less than the current range_hi_key

So let’s walk through the first two steps:

Step 1

RANGE_HI_KEY: 1

RANGE_ROWS: 0

This is always 0 on the first histogram step as there is no previous range_hi_key

EQ_ROWS: 40714

There are 40714 rows in the Users table where Reputation = 1 (1 being the range_hi_key)

DISTINCT_RANGE_ROWS: 0

This is always 0 on the first histogram step as there is no previous range_hi_key

AVG_RANGE_ROWS: 1

This is always 1 on the first histogram step

Step 2

RANGE_HI_KEY: 3

EQ_ROWS: 111

There are 111 rows in the Users table where Reputation = 3

DISTINCT_RANGE_ROWS: 1

There is one distinct reputation value greater than the previous range_hi_key but less than this one (which is the reputation of 2)

AVG_RANGE_ROWS: 111

For every reputation value greater than the previous range_hi_key but less than this range_hi_key there are on average 111 rows in the table with a reputation equal to that value

Using the Histogram

Given what we now know about statistics, you can maybe start to see how SQL Server might use them to estimate how many rows would come back from certain queries. Let’s look at some examples:

Firstly, the query we used earlier:

SELECT	DisplayName
FROM	dbo.Users
WHERE	Reputation = 1;

here is the execution plan:

We can see that SQL Server correctly guessed 40714 rows would match the predicate, how did it make that guess?

The answer is here:

As the reputation value of 1 has a range_hi_key, SQL Server knows exactly how many rows are equal to that range_hi_key, though don’t forget, that histogram was correct at the time of sampling only.

What about this query?

SELECT	DisplayName
FROM	dbo.Users
WHERE	Reputation = 2;

Here is the plan

How did it make that guess?

There is no range_hi_key for Reputation = 2 so we have to use the Avg_Range_Rows which tells us the average number of rows for each reputation value that is greater than 1 and less than 3 (which obviously can only be the value of 2) which as we see below, is 111 – the number SQL Server estimated

One last example

SELECT	DisplayName
FROM	dbo.Users
WHERE	Reputation = 3600;

Here, we can see SQL Server’s estimate is less accurate (though unlikely to have changed SQL Server’s execution decision had it been correct in this case)

The method was exactly the same as for Reputation = 2 as 3600 doesn’t have its own range_hi_key, though in this example, the Avg_Range_Rows relates to values 3531-3898 and as such, any reputation value in that range would get the same estimate. One thing to note is that the value is rounded up to the nearest whole number

_WA_T?

The statistic we have been looking at so far in this example is a statistic for an index – it was created when the index IX_Reputation was created and the data within the statistic relates to the column the index was created on.

This isn’t the only way statistics get created. A statistic gets automatically created on a column upon which a predicate is applied (a column in a WHERE clause or an ON clause in a join) assuming one does not already exist. This behaviour is governed by the AUTO_CREATE_STATISTICS setting which is on by default (and I have never personally known a use case to turn it off, though I believe SharePoint databases require it to be turned off, or did at some point in the past)

These statistics appear in the statistics folder in the same way as the other one we looked at and they have a prefix of _WA_Sys_ (which I believe dates all the way back to the Sybase days) The name includes the column number and the object ID in hexadecimal format.

Let’s (implicitly) create one by running the query below:

SELECT	Reputation
FROM	dbo.Users
WHERE	DisplayName = 'John';

And now we can see that it has appeared in the object explorer:

This statistic is on column 5 on object ID 165575628 (0x9DE7BCC in hex)

When looking at the statistic it is just the same as the other one we looked at. However, there is one thing that is different which we can see in the header:

The number of rows sampled.

Statistics Maintenance

This leads us onto the sampling and maintenance of statistics.

Statistics are a snapshot of the data at the time they are created. As rows are inserted, updated, and deleted, the statistics are not refreshed with every change and, over time, they become less accurate.

Let’s remind ourselves of the estimate for Reputation = 1

We can see that the estimate is spot on. However, let’s now update a few accounts so that their reputation is now 1:

UPDATE  u
SET     Reputation = 1 
FROM    dbo.Users u
        JOIN 
        (
            SELECT  TOP 5 PERCENT
                    Id
            FROM    dbo.Users
            WHERE   Reputation <> 1
            ORDER BY NEWID()
        ) u1
            ON u1.Id = u.id;

And run our reputation = 1 query again

DBCC FREEPROCCACHE;

SELECT	DisplayName
FROM	dbo.Users
WHERE	Reputation = 1;

Note how the estimated number of rows stays the same but the actual rows have increased. I also freed the plan cache at the beginning just to demonstrate this is a freshly compiled plan, rather than a cached one.

Statistics are updated automatically, assuming AUTO_UPDATE_STATISTICS is on, which it is by default. The interval at which statistics are updated has changed through SQL Server versions over time – For databases in compatibility level 120 or below, statistics are marked as needing to be updated when 20% of the table + 500 rows have changed (via INSERTs, UPDATEs and DELETEs) the statistics are updated the next time they are used. For large tables, this threshold may not be frequent enough and as such, in SQL Server 2016, the threshold was changed for databases in compatibility level 130 onwards (this behaviour can also be enabled on older SQL Server versions by enabling Trace Flag 2371)

The newer threshold is dynamic, according to Microsoft documentation:

Starting with SQL Server 2016 (13.x) and with the database compatibility level 130, the Database Engine also uses a decreasing, dynamic statistics recompilation threshold that adjusts according to the table cardinality at the time statistics were evaluated. With this change, statistics on large tables will be updated more frequently

the formula is MIN ( 500 + (0.20 * n), SQRT(1,000 * n) )

It is also possible to update statistics at any time by issuing an UPDATE STATISTICS command which can be targeted at a single statistic or an entire table. There is also a system stored procedure, sp_updatestats to update statistics across an entire database.

The statistic contains the updated date in the header as well as how many rows were sampled when the statistic was created or updated, whichever was most recent.

We can see how many modifications a column has had since the last statistics update by using sys.dm_db_stats_properties and observing the modification_counter column.

Sampling

When SQL Server creates a statistic for an index, all of the data in the table is scanned to create the histogram and give the density information. When the statistics are created automatically, a smaller sample of the data will be used. A full sample takes longer as all of the data will be read and processed. The auto statistic update will be processed against a sample, assuming a persist_sample_percent hasn’t been set (even if the statistic is for an indexed column) Similar to political polling, sampling doesn’t give a perfect picture of the data but it should be “good enough” in most cases.

Let’s remind ourselves of the header for the IX_Reputation statistic

We can see the information is based on a full sample, because the rows sampled is the same as the number of rows in the table (the header, Rows) though remember that the number of rows in the table was correct at the time of creation and doesn’t include any inserted rows.

Let’s update the statistic

UPDATE STATISTICS dbo.Users IX_Reputation;

and look at that header again

Notice how the data for the reputation of 1 has changed in the histogram after we updated the rows and we can see the effect on the estimate, but note this time that the estimate isn’t spot on, though it is very close:

For the sake of completeness, we can update with a full sample using

UPDATE STATISTICS dbo.Users IX_Reputation WITH FULLSCAN;

and then run the reputation = 1 query again, which once again gives us a perfect estimate

Conclusion

We have looked at what statistics are within SQL Server, how SQL Server uses them and observed how they influence estimates. We also looked at how to view the statistics, how to update them and talked about the sampling frequency.

However, this is only scratching the surface of statistics – we have only looked at example queries with hard coded values and equality predicates, we haven’t looked at range predicates, inequalities, the effects of variables etc but the idea was just to give a brief introduction, so those topics are left as further reading.