Thursday 12 February 2009

Clustered and non-clustered indexes in MSSQL server

Columns to include in clustered index

. Those that are often accessed sequentially
. Those that contain a large number of distinct values
. Those that are used in range queries that use operators such as BETWEEN, >
<= in the WHERE clause . Those that are frequently used by queries to join or group the result set When to use non-clustered index . Queries that do not return large result sets . Columns that are frequently used in the WHERE clause that return exact match . Columns that have many distinct values (that is, high cardinality) . All columns referenced in a critical query (A special nonclustered index called a covering index that eliminates the need to go to the underlying data pages.) (based on the book SQL Server 2005 Unleashed, Sams 2006). Note: if there are columns usually required in the results but not in the query conditions, we can add them as "included columns" (for non-clustered index), which means they will exist in the leaf nodes of the index for faster retrieval. This is similar to a covering index (same?) To improve index performance, we can "REBUILD" them (drop and recreate) or just "REORGANIZE" them (like defrag, without drop). When we reorganise we have the option to compact large objects. Index can be disabled, in this case if we rebuild it we don't need to have extra temp space. Index can be applied on a view, but there are several conditions that must be met before this can be done. See online documentation. A composite index will only be used optimally (i.e. index seek instead of index scan) if its first column is part of the search argument or join clause. When deciding on an index another parameter to consider is the selectivity of the index i.e. the number of distinct rows returned on the index keys / number of rows in the table. The higher this number the better. e.g.
   1:  select count(distinct surname) from sometable / select count(*) from sometable 

as a rule of thumb we require this to be > 0.85

Example sql code to determine the statistical distribution of values for a specific column


   1:  select state, count(*) as numrows, count(*)/b.totalrows * 100 as percentage 
   2:  from authors a, (select convert(numeric(6,2), count(*)) as totalrows from  authors) as b group by state, b.totalrows 
   3:  having count(*) > 1
   4:  order by 2 desc
   5:  go

No comments: