Saturday, July 23, 2011

Filtered Index


Filtered indexes are probably my favorite feature in 2008. That's saying a lot, since there are so many great new features to choose from. In this post, I want to explore a little about how filtered indexes work, how they can be applied, and some of the "gotchas" to be aware of.
First, for those of you who may not yet know about filtered indexes, allow me enlighten you. In short, filtered indexes allow you to create an index on a subset of data using a filtering predicate. Filters can only be applied to non-clustered indexes. The general syntax of a filtered index is:
Create NonClustered Index [index_name]
On [table_name] ([column_list])
Include ([column_list])
Where [filtered_criteria];
For our purposes, we're going to be working with the Sales.SalesOrderDetail table in the AdventureWorks database. Let's look at a specific example. Suppose we have a query that regularly searches on the [SpecialOfferID] column.
Select SalesOrderID
    , Count(*) As 'CountOfLineItem'
    , Sum(LineTotal) As 'SumOfLineTotal'
From Sales.SalesOrderDetail
Where SpecialOfferID <> 1
Group By SalesOrderID;
We notice that there's no covering index for this query by looking at the actual execution plan:
Query Plan - Clustered Scan
Query Plan - Clustered Scan
If this is a commonly executed query, then we'd probably want to toss an index on it. Before we get started, let's take a look at what the distribution of values are on that column:
Select SpecialOfferID
    , Count(*) As 'rows'
From Sales.SalesOrderDetail
Group By SpecialOfferID
Order By Count(*) Desc;
Our distribution of values is:
SpecialOfferID rows
-------------- -----------
1              115884
2              3428
3              606
13             524
14             244
16             169
7              137
8              98
11             84
4              80
9              61
5              2
As you can see, [SpecialOfferID] = 1 accounts for 96% of our values. In 2005, we'd create an index that may look something like this:
Create NonClustered Index IX_Sales_SalesOrderDetail_SpecialOfferID
    On Sales.SalesOrderDetail(SpecialOfferID)
    Include (SalesOrderID, LineTotal);
Now if we re-run our original query, this is what we see:
Indexed Query Plan
Indexed Query Plan
So we're now performing a non-clustered index seek instead of a clustered index scan. Already this results in some pretty significant performance improvements. To see this, we're going to use the INDEX query hint to force an index scan. We're also going to use the DBCC command DROPCLEANBUFFERS, which will allow us to clear the buffer cache and better examine what's happening with our IO.
Set Statistics IO On;
 
DBCC DropCleanBuffers;
 
Select SalesOrderID
    , Count(*) As 'CountOfLineItem'
    , Sum(LineTotal) As 'SumOfLineTotal'
From Sales.SalesOrderDetail With 
    (Index(PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID))
Where SpecialOfferID <> 1
Group By SalesOrderID;
 
DBCC DropCleanBuffers;
 
Select SalesOrderID
    , Count(*) As 'CountOfLineItem'
    , Sum(LineTotal) As 'SumOfLineTotal'
From Sales.SalesOrderDetail
Where SpecialOfferID <> 1
Group By SalesOrderID;
 
Set Statistics IO Off;
Clustered Index Scan:
Table 'SalesOrderDetail'. Scan count 1, logical reads 1240, physical reads 17, read-ahead reads 1242...
 
NonClustered Index Seek:
Table 'SalesOrderDetail'. Scan count 2, logical reads 30, physical reads 4, read-ahead reads 480...
As you can see, the non-clustered (NC) index seek performs quite a bit better. Now let's create a filtered index and explore what happens:
Create NonClustered Index FIX_Sales_SalesOrderDetail_SpecialOfferID_Filtered
    On Sales.SalesOrderDetail(SalesOrderID)
    Include (LineTotal)
    Where SpecialOfferID <> 1;
First, let's look at the pages consumed by each index:
SELECT i.name, ddips.index_depth, ddips.index_level
    , ddips.page_count, ddips.record_count
FROM sys.indexes AS i
Join sys.dm_db_index_physical_stats(DB_ID(), 
    OBJECT_ID(N'Sales.SalesOrderDetail'), Null, Null, N'Detailed') AS ddips
    ON i.OBJECT_ID = ddips.OBJECT_ID
    And i.index_id = ddips.index_id
WHERE i.name In ('IX_Sales_SalesOrderDetail_SpecialOfferID'
    , 'FIX_Sales_SalesOrderDetail_SpecialOfferID_Filtered'
    , 'PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID')
    AND ddips.index_level = 0;
name                                                       index_depth index_level page_count  record_count
---------------------------------------------------------- ----------- ----------- ----------- --------------------
PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID        3           0           1234        121317
IX_Sales_SalesOrderDetail_SpecialOfferID                   3           0           480         121317
FIX_Sales_SalesOrderDetail_SpecialOfferID_Filtered         2           0           19          5433
If you scroll over, you'll see that the clustered index consumes the most pages, naturally. The non-filtered NC index consumes less pages than the clustered index because it's narrower; however, it still consumes more pages than the filtered index because it's storing every data row. The filtered index, with only 5433 rows stored, is by far our smallest index, consuming 96% less space than our non-filtered NC index.
Because we're using less space to store this index, we should also see an equivalent performance boost. Let's verify that this is the case:
Set Statistics IO On;
 
DBCC DropCleanBuffers;
 
Select SalesOrderID
    , Count(*) As 'CountOfLineItem'
    , Sum(LineTotal) As 'SumOfLineTotal'
From Sales.SalesOrderDetail With (Index(IX_Sales_SalesOrderDetail_SpecialOfferID))
Where SpecialOfferID <> 1
Group By SalesOrderID;
 
DBCC DropCleanBuffers;
 
Select SalesOrderID
    , Count(*) As 'CountOfLineItem'
    , Sum(LineTotal) As 'SumOfLineTotal'
From Sales.SalesOrderDetail
Where SpecialOfferID <> 1
Group By SalesOrderID;
 
Set Statistics IO Off;
NonClustered Index Seek:
Table 'SalesOrderDetail'. Scan count 2, logical reads 30, physical reads 4, read-ahead reads 480
 
Filtered Index Scan:
Table 'SalesOrderDetail'. Scan count 1, logical reads 24, physical reads 2, read-ahead reads 22
Filtered Query Plan
Filtered Query Plan
As expected, we get the best results with our filtered index scan.
You'll notice that I did *not* create the index on the [SpecialOfferID] column like I did in [IX_Sales_SalesOrderDetail_SpecialOfferID]. This is because my query doesn't care what my [SpecialOfferID] value is, just as long as it's not equal to 1. My non-filtered NC index was created on [SpecialOfferID] because it needed to navigate the B-TREE to find the records where [SpecialOfferID] <> 1. With my filtered index, the query optimizer knows that all of my records already meet the criteria, so doesn't need to navigate through the index to find the matching results.
We could choose to include the [SpecialOfferID] data in our filtered index, but we'd most likely want to make it an included column rather than part of the index key. In fact, it's important to note that, if I don't add [SpecialOfferID] as an included column and I want to return it in the results, i.e.
Select SalesOrderID
    , SpecialOfferID
    , Count(*) As 'CountOfLineItem'
    , Sum(LineTotal) As 'SumOfLineTotal'
From Sales.SalesOrderDetail
Where SpecialOfferID <> 1
Group By SalesOrderID
    , SpecialOfferID;
my filtered index will not be used and I will instead scan on the clustered index once more (assuming [IX_Sales_SalesOrderDetail_SpecialOfferID] does not exist). This is because the filtering criteria is not included anywhere on the actual index page. This is actually good news, in my opinion, since it allows you to create even leaner indexes. And like I already mentioned, if you do need the data returned, you can always add the filtering criteria as included columns.
What if you're trying to find out whether or not an index is filtered, and what it's filtered on? The sys.indexes catalog view has been updated in 2008 to include this information:
Select name, has_filter, filter_definition
From sys.indexes 
Where name In ('IX_Sales_SalesOrderDetail_SpecialOfferID'
    , 'FIX_Sales_SalesOrderDetail_SpecialOfferID_Filtered'
    , 'PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID');
name                                                   has_filter filter_definition
------------------------------------------------------ ---------- -------------------------
FIX_Sales_SalesOrderDetail_SpecialOfferID_Filtered     1          ([SpecialOfferID]<>(1))
IX_Sales_SalesOrderDetail_SpecialOfferID               0          NULL
PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID    0          NULL
I personally recommend Kimberly Tripp's system stored proc, sp_helpindex2. It returns a lot of good information about your indexes, such as included columns and filtering criteria.
That's all I have for today. Hopefully, you now understand how powerful filtered indexes can be. When used properly, filtered indexes can use less space, consume less IO, and improve overall query performance.

No comments:

Post a Comment