Pages

Monday 25 March 2013

SQL Server : Part 8 : Explaining The Covering Index or Included Columns

In our earlier discussion about non clustered index ,we have seen that, the leaf level of a non clustered index contain only the non clustered index key column and clustered index key (if the table is a clustered table). To fetch the remaining column from the clustered index structure or heap structure, SQL server has to do a bookmark/key look up operation.Many time the bookmark or key look up operation might be costly affair. Let us see an example.

USE mydb
GO
DROP TABLE dbo.SalesOrderDetail                               
GO                               
SELECT * INTO dbo.SalesOrderDetail FROM AdventureWorks2008.Sales.SalesOrderDetail
GO
CREATE UNIQUE CLUSTERED INDEX ix_SalesOrderDetail ON dbo.SalesOrderDetail(SalesOrderDetailID)
GO
CREATE UNIQUE NONCLUSTERED INDEX ix_Productid ON dbo.SalesOrderDetail(ProductId,SalesOrderId)
GO 
SET STATISTICS IO ON
GO
SELECT SalesOrderDetailid,productid,salesorderid,orderqty,unitprice FROM SalesOrderDetail WHERE productid=707 AND 
SalesOrderID=43680

The execution plan and the out put of IO statistics of the select statement are given below.















In the execution plan, you can see that ,50 percent of the query cost is contributed by the Key Lookup operation.In the output of the IO statistics , it clearly says SQL server performed 5 IO operation to fetch the single record. 


Note that, the existing non clustered index have 229 pages and depth is 2( levels in the b tree structure).Seek operation on this index need to perform only 2 IO operation to complete the task.You can verify this using the  DBCC IND  command or refer the earlier post.


Let us assume that, this query(with different parameters ) is used very frequently from the application and you need to optimize it further.How we can do that ? The only way that we can optimize this query is by avoiding the Key lookup operation. For that we can modify our non clustered index and add the remaining two column (OrderQty and UnitPrice) which are not part of clustered index key or non clustered index key. 

DROP INDEX ix_Productid ON dbo.SalesOrderDetail
GO
CREATE UNIQUE NONCLUSTERED INDEX ix_Productid ON dbo.SalesOrderDetail(ProductId,SalesOrderId,OrderQty ,UnitPrice)
GO
SELECT SalesOrderDetailid,productid,salesorderid,orderqty,unitprice FROM SalesOrderDetail WHERE productid=707 AND SalesOrderID=43680














Now we were  able to get rid of the Key lookup operation from the execution plan and to reduce the IO from 5 to 3.But if we  look into the out put of DBCC IND of the modified non clustered index, we can see that , depth of the b tree is increased by one due to this change. As the index level is increased , the non clustered index has to to perform 3 IO to complete the operation. This will be worst, if we have more column in the select list and we added all those columns into the non clustered index key to avoid the key lookup operation.

Here comes the covering index to help us.Covering index help us to add non key column to leaf level of the non clustered index with very minimal possibility of increasing the depth of the b-tree structure. This can be achieved by adding include column in the CREATE INDEX statement.
An index that contains all information required to resolve the query is known as a Covering Index.When we create a nonclustered index to cover a query, we can include nonkey columns in the index definition to cover the columns in the query that are not used as primary search columns. Performance gains are achieved because the query optimizer can locate all the required column data within the index; the table or clustered index is not accessed.

DROP INDEX ix_Productid ON dbo.SalesOrderDetail
GO
CREATE UNIQUE NONCLUSTERED INDEX ix_Productid ON dbo.SalesOrderDetail(ProductId,SalesOrderId
include(OrderQty ,UnitPrice)
GO
SELECT SalesOrderDetailid,productid,salesorderid,orderqty,unitprice FROM SalesOrderDetail 
WHERE productid=707 AND SalesOrderID=43680













With this also, we were able to get rid of the key lookup operation and to reduce the IO operation to 2. The IO operation clearly says the the depth of the clustered index is two.
Let us see the output of the DBCC IND

SELECT index_id FROM sys.indexes WHERE name='ix_Productid' AND OBJECT_ID= OBJECT_ID('SalesOrderDetail')
GO
DBCC ind('mydb','SalesOrderDetail',2)

This returns 378 records and the root page is 7456 (Value of pagepid column of the record having max value for indexlevel column)


Let us see the root page and one leaf level page

DBCC traceon(3604)
GODBCC page ('mydb',1,7456,3)GODBCC page ('mydb',1,7328,3)


























From the output we can see that, columns mentioned in the include clause are added into the leaf level pages with out making any changes in the non leaf level pages.


Include column are useful because we can refer the column that has a data type which can not be used in the index key.More over include columns are not counted in the 900 bytes or 16 key column limitation of index keys.We can include with any data types except text,ntext and image.Included column also support the computed column.

If you liked this post, do like my page on FaceBook 

Sunday 24 March 2013

SQL Server : Part 7 : Non clustered index on non unique column

In our earlier  post, we have discussed about the non clustered index, but there we were always discussing  about unique non clustered index to make the discussion simple.As we understood the general structure of the non clustered index, let us discuss the storage structure of a non clustered index on a non unique column.

In our last post, we have discussed how SQL server manage clustered key on a non unique column.In that post we learned that SQL server add 4 bytes value to all duplicate occurrence of the clustered key.In the same way, non clustered index add the  cluster key in all level of the b tree to uniquely identify the records in the next level.In the case of clustered index, the uniquifier is added only to the duplicate occurrence.In the case of non clustered index , clustered key is added to all records if the uniqueness is not defined while creating the index. If the non clustered index is defined as unique, SQL server adds the clustered key only to the leaf level for the bookmark look up operation .

Let us see a sample . 

SELECT * INTO dbo.SalesOrderDetail FROM AdventureWorks2008.Sales.SalesOrderDetail
GO
CREATE UNIQUE CLUSTERED INDEX ix_SalesOrderDetail ON dbo.SalesOrderDetail(SalesOrderDetailID)
GO
CREATE INDEX Ix_ProductId ON SalesOrderDetail(ProductId,Salesorderid)

A copy  of Salesorderdetail table is created with unique clustered index on SalesOrderDetailId and a non clustered index on ProductId and SalesOrderId column.Note that, while  creating the non clustered index , I have purposefully avoided the unique keywords even if the non clustered index key is unique.

DBCC ind('mydb','SalesOrderDetail',4)

DBCC IND returns 229 records and root page id is 8320. Let us see the output of DBCC page for the root page.

DBCC page ('mydb',1,8320,3)


In the below figure, I have combined the output of the root page and one of the leaf level page. If you look into the root page (the first part), you can see that the cluster key (SalesOrderDetailid) is added in the root page.If you go back to our earlier discussion on non clustered index on clustered table , you will not find the clustered key in the root level. it will be there only in the leaf level. There is no change in the leaf level page structure while defining non clustered index as  unique or non unique.

Let us see what will happen if the  table is on heap.

SELECT * INTO dbo.SalesOrderDetailHeap FROM AdventureWorks2008.Sales.SalesOrderDetail
GO
CREATE INDEX Ix_ProductId ON SalesOrderDetailHeap (ProductId,Salesorderid)
GO
SELECT index_id FROM sys.indexes WHERE name='Ix_ProductId' AND 
OBJECT_NAME(OBJECT_ID)='SalesOrderDetailHeap'
GO
DBCC ind('mydb','SalesOrderDetailHeap',2)
GO
DBCC page ('mydb',1,5352,3)


In the below figure, I have combined the output of the root page and one leaf level page.In the root page you can notice that Heap RID is added. If you go back our earlier discussion on non clustered index on heap, we can see the Heap RID only on the leaf level not in the root page. There is no change in the leaf level page structure while defining non clustered index as unique or non unique.


































You might have noticed that ,in our above example, even if the non clustered key is unique , SQL server considered it as non unique as we did not mentioned the uniqueness while creating the non clustered index. Adding the clustered key (or HEAP RID) in to all level of index might lead to increase the level of index (more IO) depends on the size of the clustered key.So it is important to consider the uniqueness of column while defining the non clustered index key in all possible situation.

If you liked this post, do like my page on FaceBook


Thursday 21 March 2013

SQL Server : Part 6 : Design consideration of Clustered Index

In our earlier post, we have discussed about the structure and storage of clustered index. In this post we will discuss about the design consideration of clustered index.There are couple of points that need to be considered while selecting clustered index key.There is no hard rule in selecting the clustering key . It is best practices and guidelines and the internal knowledge will help us to select right clustering key.

Uniqueness 

SQL server will allow you to create a clustered index on non unique column, but uniqueness is one of the most desirable attribute for any indexes especially for a clustered index.Even if SQL server allow to create clustered index on a non unique column , internally SQL server add 4 bytes value for all duplicate instance of clustered key and this 4 bytes variable length column is known as uniquiifier .In this case SQL server consider the clustered key as the combination of non unique column on which clustered index is defined and internally generated uniquifier column.This value will be stored where ever the clustered key to be stored. For example in the leaf level of a non clustered index defined on a clustered table.

Let us see an example.I am going to create a copy of SalesOrderDetail table and define clustered index on the productid column which has duplicate values.

Use MyDb
GO
SELECT * INTO dbo.SalesOrderDetailDupCI FROM AdventureWorks2008.Sales.SalesOrderDetail
GO
CREATE CLUSTERED INDEX ix_SalesOrderDetailDupCI ON dbo.SalesOrderDetailDupCI(ProductId)

Now let us run the DBCC IND command

DBCC IND('mydb','SalesOrderDetailDupCI',1)

This returns me 1608 pages and  the root page number is 3570(The value of PagePID column of the record having highest value for indexlevel column ) .  Below is the output of DBCC Page command for the root page and one intermediate page


In the output, we can see one additional column called UNIQUIFIER which we did not seen when we created a unique clustered index in earlier post. When a clustered index created on a non unique column , SQL server add 4 bytes random value for all duplicate occurrence of clustered key. It will not generate the uniquifier value for the first occurrence. That is the reason there are 0 for uniquifier column for some records. So clustered key defined on the non unique column has an overhead of generating the uniquifier value and also widen the cluster index key.In our example we defined clustered index on a 4 bytes column but actual clustered key size become 8 bytes due to uniquifier.This combination has to replicated to leaf level of all non clustered index .The magnitude of the issue will increase when there is non clustered index  defined on non unique column where this clustered index value need to be stored in the non leaf level pages also.(We will discuss about non clustered index on non unique column in later post).

If your table does not have a single unique key to define clustered index, try to make it unique by adding one or two narrow column to the clustering key. That will avoid the overhead of adding the uniquifier and reduce the bookmark look up operation as the non leaf level of non clustered index has more columns (The additional column added to clustered key to make it unique)

Static
Another desired property of the clustered index key is to be a Static. when we define clustered index on a non static column , that makes the update statement(updating the clustered index key)  more costly as it has to move the record into different page  to make sure the data is stored in the logical order of clustered index and leaf level of all non clustered index need to be updated. Let us see a sample 

set statistics io on 
GO
--Unique clustered index on SalesOrderDetailIdUPDATE SalesOrderDetail_StaticIndex SET ProductID =99 WHERE ProductID=707
GO
--Nonunique clustered index on ProductIdUPDATE SalesOrderDetail SET ProductID =99 WHERE ProductID=707
                              
    

In the below output you can see a huge change in IO for the second update statement.










Even in small table (Customer_id,Firstname.lastname) on which clustered index defined on a non static column (Lastname) and which has one non clustered index, any change in the clustered index key need to make changes in minimum of two pages.One data page and one leaf level page of non clustered index.

Size of the clustered index key

Size of the clustered index refers to the number of bytes requires to store the clustered index key.As the size of the clustered index increases, more IO required to fetch the records.This is happening because an index page can store lesser index row if the clustered index is wider. That increase number of pages in the intermediate level and depth of the indexes (Levels in the b tree structure) .For example, a table which contain millions of records might need only 3 level in b tree structure if the clustered index in defined on integer column. If we defined clustered index on wider column (say a uniqueidentifier column which need 16 bytes), the depth of index might increase to 4 (Level of index). Any clustered index seek requires 4 IO operation compared to 3( if the clustered index is narrow).

This issue propagate to non clustered index also as the clustered index keys are stored in the leaf level of all non clustered index as a pointer to the clustered index. If the non clustered index is defined on the non unique column , the clustered key need to be stored in non leaf level pages of  the non clustered index . Which again might cause for more page in intermediate level and to increase the depth of the in non clustered index which in turn increase the IO operation in non clustered index seek/scan also. As the clustered index depth is increased to 4, each bookmark look up operation also need to do 4 IO operation.

Sequential 

It is a best practice to define clustered index on a ever increasing(sequential) column. That is the reason we can commonly seen clustered index defined on identity column. Clustered index defined on a  non sequential column cause for fragmentation . To read more on fragmentation refer this posts . A non sequential clustered index key force the SQL server to insert record in between to maintain the logical ordering of the data.This lead to page split which cause for external fragmentation and internal fragmentation.

Conclusion 

We have discussed about the desired qualities of the clustered index key and the reason behind that.The above points that we discussed are general best practices to be considered while deciding the clustered index key.Apart from this, data access pattern also influence in deciding the clustered index key.If we do not have complete understand of the data access pattern , we might need to test the performance with different strategies.                        



If you liked this post, do like my page on FaceBook

Sunday 17 March 2013

SQL Server : Part 5 : Explaining Non clustered Index on Heap

In the last post, we have discussed about non clustered index on a clustered table.In this post we will discuss about the non clustered index on a heap table. 

Non clustered index can be created on clustered table as well as heap table.While creating a non clustered index on clustered table , clustered index key will act as a row pointer. In heap table ,the combination of file id , page number and slot number will act as a row pointer in the non  clustered index.

Let us see a hands on example. We will create a copy of salesorderdetail table with a non clustered index on productid and salesorderid columns

SELECT * INTO dbo.SalesOrderDetailHeap FROM AdventureWorks2008.Sales.SalesOrderDetail
GO
CREATE UNIQUE INDEX Ix_ProductId ON SalesOrderDetailHeap(ProductId,Salesorderid)

A pictorial representation of the b tree structure is given below
Non clustered Index on heap Structure





















Let us run the DBCC IND command to find the root page of the index.

DBCC ind('mydb','SalesOrderDetailHeap',2)


This statement returns 289 records with 1 as maximum value in the indexlevel column. That means the index structure required one page in root node , 287 pages in leaf level and one page for IAM chain.The page number of record, having maximum value for indexlevel column (root page) is 1656. Let us see the root page.


DBCC traceon(3604)

GO
DBCC PAGE ('mydb',1,1656,3)

Below we can see a part of the output of DBCC page command. The structure of the output is same as the structure of root page of non clustered index on clustered table.




Fig 1
Now let us move to the page 1497 which is a leaf level page.

DBCC traceon(3604)
GO
DBCC PAGE ('mydb',1,1497,3)


Fig 2

 In the leaf level of non clustered index on clustered table , we have noticed that , clustered key is added to leaf level pages along with nonclustered key.Here we do not have a clustered index. So SQL server added row identifier (8 bytes in size) which is a combination of fileid (2 bytes) ,pageid(4 bytes) and slot number(2 bytes). The row identifier can be found in the  HeapRID column of Fig 2. From the above figure, it is clear that, the complete information of the record with productid=707 and salesorderid =49464 can be found at the location HeapRID 0x6813000001000600. Below query will help us to split the RID into FileId:PageId:SlotNo format


DECLARE @HeapRid BINARY(8)
SET @HeapRid = 0x6813000001000600SELECT
    
      
CONVERT (VARCHAR(5),
                    
CONVERT(INT, SUBSTRING(@HeapRid, 6, 1)
                               +
SUBSTRING(@HeapRid, 5, 1)))
     +
':'
    
+ CONVERT(VARCHAR(10),
                    
CONVERT(INT, SUBSTRING(@HeapRid, 4, 1)
                               +
SUBSTRING(@HeapRid, 3, 1)
                               +
SUBSTRING(@HeapRid, 2, 1)
                               +
SUBSTRING(@HeapRid, 1, 1)))
     +
':'
          
+ CONVERT(VARCHAR(5),
                    
CONVERT(INT, SUBSTRING(@HeapRid, 8, 1)
                               +
SUBSTRING(@HeapRid, 7, 1)))
                              
AS 'Fileid:Pageid:slot'


The output of the query is 1:4968:6 which says File id =1, Pageid =4968 and slot no=6. Let us see the page number 4968


DBCC traceon(3604)
GO
DBCC PAGE ('mydb',1,4968,3)










































In the output, you can see all columns  of record having productid=707 and  salesordeid = 49464 at slot number 6. 

Let us consider  the below query and analyse how SQL sever fetch the data.

SET STATISTICS IO ON
GO
SELECT *  FROM SalesOrderDetail WHERE productid=707 AND SalesOrderid=49464

SQL Server  has to do two I/O operation to reach the leaf level of non clustered index and one I/O operation to fetch the remaining data using Heap RID from the heap structure.The execution plan of the query is given below


Even if we make changes in the query to fetch only the ProductId,SalesOrderid and SalesorderDetaildId columns, still SQL sever need to do Key lookup operation. This is because SalesOrderDetailid is not declared as clustered key and not available in the leaf level of the nonclustered index. To avoid the key lookup operation, we need to limit the select columns only to the non clustered key(ProductKey ,salesorderid).


If you liked this post, do like my page on FaceBook

Thursday 14 March 2013

SQL Server : Part 4 :Explaining the Non Clustered Index Structure

A table can have only one clustered index as the data rows are stored in the order of the clustered key, but a table can have multiple non clustered indexes.We have discussed about clustered index structure in our earlier post . In this post let us try to understand the non clustered index structure.

Logical Representation of Non Clustered index

In a simple word , a non clustered index is a subset of a table. When we define a non clustered index, SQL server store the set of non clustered key in a different pages.Let us consider a table with four columns (PersonId(PK),PersonType,FirstName,LastName) and a non clustered index on that table.The actual table is stored in the order of personid column (Cluster index key).In the below figure will give you a pictorial representation of non clustered index. The non clustered index key column are stored  apart from the actual table.If you look into the Non clustered index ,you can notice that, records are stored in the order of Firstname and lastname (Non Clustered index key) not as in the order of actual table.To understand easily , non clustered index can be considered as subset table of actual one. 

Fig 1

Now let us assume that we have a task to find out all records whose first name is 'Michael'.If you tried to find out from the Actual table , we need to go through each record from top to bottom to check whether first name matches with our search string 'Michael' as the records are not ordered based on the firstname column. It will be more tedious task if we have thousands of records in that table. The task will be much easier if we look into the the Non Clustered index side as the first name and last name are alphabetically ordered.We can easily locate the records which has firstname as 'Michael' and we do not need go beyond that as we are sure that there will not be any more records with firstname as 'Michael'.

Now we know Firstname and lastname of the record. How do we get the values for other two column ? Let us make a change in the Non clustered index part by associating the PersonId column along with the non clustered index.

Fig 2




















Now, once we locate the records , we can go back to the Actual Table using the PersonId (Cluster index key) to find the values of other columns and this operation is called bookmark lookups or RID lookup. 

Clustered index v/s Non clustered index

Non clustered indexes have the same B-tree structure as clustered indexes.The non clustered  index key will not make any change in the sort order of underlying table where as clustered index force the SQL server to store the underlying table in the order of cluster index key. The leaf level of clustered index are made up of data pages which contain the actual data of table where as the leaf level of non clustered index are made up of index pages.

Non clustered index can be defined on a heap table or clustered table.In the leaf level of nonclusterd index, each index row contain the nonclustered key value and a row locator.This locator point to a the data row in the clustered index or heap.The row locator in nonclusterd index rows are either a pointer to a row or a clustered index key for a row. If the table is a heap, which means it does not have a clustered index,the row locator is a pointer to the row.The pointer is built from the file identifier ,page number and slot number of the row on the page. The whole pointer is known as a Row ID(RID). If the table has a clustered index, the row locator is the clustered index key for the row.

Hand on

Let us create a unique non clustered index on the salesorderdetails table that we were working on the clustered index post. The table has a unique clustered index on salesorderdetailid column. Now we are defining a non clustered index using the below statement

CREATE UNIQUE INDEX Ix_ProductId ON SalesOrderDetail(ProductId,Salesorderid)

A pictorial representation of the b-tree structure is given below
Clustered Index Structure





















Now let us see how SQL server store the index. Let us see the output of DBCC IND command

DBCC IND('mydb','SalesOrderDetail',2)


The last parameter, 2 is the index id of the index Ix_ProductId. For the usage of DBCC IND refer this post.

The output returns me 229 records which include one IAM page and 228 index page.From this we can find the root page id by locating the record having highest value for  IndexLevel column. Remember that index level increase from leaf level to root. In this case, the root level page id is 4688 with index level 1. That means, the b-tree structure of this index has only root and leaf level  .There is no  intermediate level  in the b tree structure of this non clustered index. Let us see the page 4688.

DBCC traceon(3604)
GO
DBCC page('mydb',1,4688,3)

This returns the 228 records (228 leaf level index pages) .Part of the output is shown below.The output looks same as the clustered index root/intermediate page structure. The output can be read as , all the records where the combination of  productid,salesorderid is less than or equal to (707,51151) can be found in child page 2288. All the records where the combination of  productid,salesorderid is between (707,51151)  and (707,55920) can be found in the child page 2289 and so on.

Fig 3




















Let us see the page 2289

DBCC traceon(3604)
GO
DBCC page('mydb',1,2289,3)

This returns 539 records, all with product id 707.As this index use only two level, this is the leaf level of the b tree structure. Here you can notice that, there is no child page id but we have the value of the salesorderdetailid column (Clustered index key) which SQL server use do key or bookmark look up operation.

Fig 4

Let us see, how SQL server perform a select operation using this index. 

SET STATISTICS IO ON
GO
SELECT *  FROM SalesOrderDetail WHERE productid=707 AND SalesOrderid=51192

The execution plan will looks like below.You can see the Key lookup operation in the execution plan.


Fig 5















As the where condition exactly match with our non clustered index definition, SQL sever use that index to execute this query.As a first step SQL server reads the root page of b tree structure.The combination of our search criteria (707,51192) falls in to the second records (Refer Fig 3). So SQL server goes to the child page (Page id 2289).In that page, we can locate the record with the exact combination of (707,51192) with salesorderdetailid 37793. From here, SQL server do a key look up operation using the salesorderdetailid value. From the earlier post about clustered index, we know that , SQL server need to perform 3 I/O operation while searching on any clustered index key. So to complete this query, SQL server needs to do 5  I/O operation (2 in non clustered index and 3 for bookmark/key lookup operation in the clustered index)  and that is the same you can find in the output of IO statistics. 

To understand it better,let us consider non clustered index as subset table ( let us say table name Saleorderdetail_NC) of salesorderdetail table with columns productid,salesorderid and SalesorderDetailid and having combination ProductId and salesorderid  as clustered index.Then the result of the above query can  be obtained by combining the output of below two query.
SELECT *  FROM SalesOrderDetail_nc WHERE productid=707 AND SalesOrderid=51192
GO
SELECT *  FROM SalesOrderDetail WHERE SalesOrderDetailid=37793


As a last part let us see one more query

SELECT *  FROM SalesOrderDetail WHERE productid=707

This query returns 3083 records and the search condition is matching with the non clustered index first column. Still SQL server will not use the non clustered index to execute the query. Please find below the execution plan of this query.

Fig 6



The reason behind that, if it use the non clustered index, it has to do a key lookup operation for 3083 records . That itself requires  (3083X3)  9249 I/O operation . Instead of that, SQL sever can fetch the output by performing clustered index scan which need only 1501(Total number of pages required for clustered index b-tree structure) I/O operation.If we make slight changes in the query to select only Productid ,SalesOrderDetailid and SalesOrderId, SQL server use the non clustered index as it does not need to do the key lookup operation. The leaf level of the non clustered index have the values of these three column . Please find below the execution plan.

SELECT productid,salesorderdetailid,salesorderid  FROM SalesOrderDetail WHERE productid=707

FIg 7









I know this was a bit lengthy post, but still I feel it will give you the clear picture of the non clustered index structure.

If you liked this post, do like my page on FaceBook