NULL Values in Histogram
During my cardinality estimator presentation in DBA virtual group there was a question asked about NULL values and histogram. I am quoting the question as is “SQL Server include null values in the index. How are statistics handle for null values?”
I will answer the question in this blog post.
For demo purpose I will be using WideWorldImporters database which you can download and restore from here. In table [Sales].[Orders] column ‘PickingCompletedWhen’ does allow NULL values.
How many rows are NULL and how many are NOT NULL?
USE [WideWorldImporters]; GO SELECT 'NULL' AS [Value], COUNT(*) AS [NumOfRecord] FROM [Sales].[Orders] WHERE PickingCompletedWhen IS NULL UNION ALL SELECT 'NOT NULL' AS [Value], COUNT(*) AS [NumOfRecord] FROM [Sales].[Orders] WHERE PickingCompletedWhen IS NOT NULL;
There is an auto created statistics on column ‘PickingCompletedWhen’ named ‘_WA_Sys_0000000E_44CA3770′.
If we look at the header section of DBCC SHOW_STATISTICS for this statistics we can see all rows were used to create the statistics. Meaning rows with NULL is also taken into consideration.
|_WA_Sys_0000000E_44CA3770||May 5 2017 6:56PM||73595||73595|
In the density_vector section ‘All density’ value for column ‘PickingCompletedWhen’ is 0.0004705882 which was calculated from: 1/(Number of distinct values of column ‘PickingCompletedWhen’).
In this case which is 1/2125. All NULL values were considered as one. If you do a count of distinct values you will get a result of 2124. Reason is explained here. If you do a select of all distinct values NULL will show along with other 2124 values.
|All density||Average Length||Columns|
Let’s look at the histogram now.
|RANGE _HI_KEY||RANGE _ROWS||EQ_ROWS||DISTINCT _RANGE _ROWS||AVG _RANGE _ROWS|
First RANGE_HI_KEY is NULL and EQ_ROWS value is 3085 which matches with count above for ‘PickingCompletedWhen IS NULL’. As this is the first RANGE_HI_KEY for the histogram steps, RANGE_ROWS and DISTINCT_RANGE_ROWS will be zero.
First NON NULL value in RANGE_HI_KEY is ‘2013-01-01 11:00:00.0000000′ which is 2nd step in the histogram. For which ‘AVG_RANGE_ROWS’ value is 1. If there is a query for ‘PickingCompletedWhen’ < ‘2013-01-01 11:00:00.0000000′, for estimated number of rows that value will be used which is 1 in this case.