Monday, November 30, 2009

Temp table vs Table variable



Ok. Another done to death topic. I will try my best to say something that has not been said.( or atleast said little ).

Before we get started , couple of popular myths to be clarified.

* Both temp table and table variable use tempdb.
The difference is table variable's data is initally stored in memory
and when the data grows bigger it is pushed to tempdb.
The object definition of table variable when it is created is stored in tempdb.

* Performance wise both have their advantages and disadvantages. There are a few
scenarios where temp table outperforms a table variable which we will attempt to
see.

Few pointers about them are given below:

* Table variable can have constraints but not indexes. This means that one can
create primary key( creates cluster index ) on table variable or a unique
constraint ( creates non cluster ) but not non unique clustered or non
clustered indexes.However, temp table can have constraints and indexes.
* Statistics are not maintained for table variable but yes for temp table
* Table variables are not recompiled but temp table's are recompiled
* Scope of a table variable is limited to the batch/section but temp table is
available thro out the session.
* changes made to a table variable cant be rollbacked but on a temp table
rollback is possible.

The points 1,2,3 have serious implications on the performance of your code and becomes a major factor in deciding your temp object type.

This post is just a starter placing a few facts on this topic. Next few posts,
I will come up withn few scenarios and examples to compare the performance of
both the objects.So please hang on for few more days for in depth analysis :)

Sunday, November 22, 2009

Covering Index - Include NonKey columns


To help my impatient readers reaching here thro google search, let me come to the point right away.

What is Include Non Key columns in SQL 2005?

Include Non Key Columns allows one to add columns only to the leaf of a non clustered index. Include Non Key columns option is applicable only for Non Clustered indexes.

Whats the use?

* SQL Server has a restriction that the Key columns of an index ( the columns that are present in the non leaf levels of an index) can have a maximum size
of 900 bytes. Also, Number of columns within an index cant exceed 16. Non key
columns are not counted for these restrictions. So if you are likely to execeed
900 bytes Include non key columns.
* Including Non key columns increases your chances of covering your query and hence boosting performance.

How to include non key columns?

SSMS->User Table->Indexes->Index properties->Included Columns.
Refer Screenshot below.


/*****************************************************************************************/

Now for my usual lengthy write up.

Covered indexes have a few drawbacks as stated above. 'Include Non Key columns' is wonderful especially in the following scenario.


DECLARE  @dt DATETIME


SET @dt = Getdate()


SELECT supplier_xid,

       customer_xid,

       product_xid,

       trans_city,

       trans_date,

       full_payment_receieved,

       payment_mode

FROM   dbo.supplier_product_customer

WHERE  trans_city = 'Moscow'

       AND payment_mode = 'Amex'


SELECT Datediff(ms,@dt,Getdate())


Supplier_xid,Customer_xid,Product_xid,Trans_city,Trans_date Columns from the Primary key.Trans_city column has a nonclustered index.The stats obtained by running the query are provided below.The table contains 112155 rows.

Time taken : 343 ms

IO stats : Table 'Supplier_Product_Customer'. Scan count 1, logical
reads 402, physical reads 3, read-ahead reads 3, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


Overall query cost : 0.32

Query Plan is shown below:


It shows Key lookup which means that the query engine refered to Non Clustered index
key and jumped to the table to get some data.This doesnt come as a surprise as the Clustered index and Non clustered index doesnt contain the columns full_payment_receieved, Payment_mode columns, as they are not covered by the Non clustered index.
So ,How can we improve it?
One way is to add these two columns to the Nonclustered index on the trans_city column. Thats not a great idea because

* Already there are 5 key columns that are part of the index ( 4 Cluster
+ 1 Non cluster ). Adding two more would make the index's non leaf
pages huge.

* Payment_mode,full_payment_receieved dont have great selectivity. In
other words,Payment_mode column can have a maximum of 6 disitnct
values. ie.Amex / master/cheque /Cash/Visa/Nets. So its not going to
filter out most of the rows.Trans_city column which has reasonable
selectivity ( greater filtering ability/column not having many
duplicate values ) filters most of the rows and only the rest are
filtered by Payment_mode column(filters 1/6th). This makes Payment_mode
a bad candidate for indexing.'full_payment_receieved' has only Yes/No
values and is not part of the filtering criteria itself, so it
doesnt make great sense to store them at non leaf levels of
the index.

Just to stress a basic point in indexes, its the non leaf levels ( higher/upper
levels of the B+ tree ) which direct the way to search for the required row. More
selective ( filering out more rows) the columns in non leaf, shorter the distance to reach the required row(s) and hence better the performance.

So whats the next option? Including these two columns at leaf level alone using
SQL 2k5's Include non key columns.


CREATE NONCLUSTERED INDEX [NCIX_Supplier_Product_Customer_Trans_City] ON [dbo].[Supplier_Product_Customer] (

      [Trans_city] ASC)

INCLUDE ([Payment_mode],[full_payment_receieved])

WITH (pad_index = OFF,

      statistics_norecompute = OFF,

      sort_in_tempdb = OFF,

      ignore_dup_key = OFF,

      drop_existing = OFF,

      online = OFF,

      allow_row_locks = ON,

      allow_page_locks = ON)

ON [PRIMARY]


So, now let us run the same query.


DBCC FREEPROCCACHE


DBCC DROPCLEANBUFFERS



       --To Clear the cache


SET STATISTICS  io  ON

--To capture IO stats


DECLARE  @dt DATETIME


SET @dt = Getdate()


SELECT supplier_xid,

       customer_xid,

       product_xid,

       trans_city,

       trans_date,

       full_payment_receieved,

       payment_mode

FROM   dbo.supplier_product_customer

WHERE  trans_city = 'Moscow'

       AND payment_mode = 'Amex'


SELECT Datediff(ms,@dt,Getdate())



Query plan screen shot is provided above. Query plan doesnt have any Key lookup and Shows NonClustered index seek which implies the query was fully covered with in the index.


Other stats:

Time taken :170 ms

IO stats : Table 'Supplier_Product_Customer'. Scan count 1,
logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


Query cost :0.003

You can see the difference. 170 ms against 340 ms which is almost 50% improvement.
IO stats indicate a tremendous improvement. 5 reads against 402 reads.
Query cost has improved 100 times from 0.32 to 0.003.

So to sum up, Include Nonkey column is ideal when

* Most of the columns have been covered and 1/2 columns have been left out.

* Left out columns are not a part of filtering criteria.

* Even if the left out column is a part of filtering criteria, if it is not
a major filtering condition and has a bad selectivity factor.

* Like for any indexed column, included columns shouldnt be frequently
updated.

Sunday, November 15, 2009

Covering Index


Before we get started with this post please read the last one here.Even if you dont, just know that the leaf of a non clustered index contains Clustered index key.

So what are we upto now?
Covered indexes.

Covered indexes arguablly gets the best out of an index. When all the columns that
are required by the query ( Both Select columns - Projection and where clause columns ( selection criteria ) )are present in an index, then the index is said to be covered index.Covered index performs much better than Non Clustered index + Row look up as the query need not go( rather jump ) to the Data page to fetch the data.
All data access starts and ends within the index itself. The number of data pages in a table are always greater ( theortically greater than equal to ) than the leaf pages of an index, as data pages need to hold all the columns in the table whereas the index leaf pages hold only the indexed columns. Let us see an example..

Consider the following table


CREATE TABLE [dbo].[Supplier_Product_Customer] (

  [Supplier_xid]  [INT]    NOT NULL,

  [Customer_xid]  [INT]    NOT NULL,

  [Product_xid]   [INT]    NOT NULL,

  [trans_date]    [DATETIME]    NOT NULL,

  [Qty]           [INT]    NULL,

  [Trans_city]    [VARCHAR](100)    NULL,

  [Note]          [VARCHAR](500)    NULL,

  [Rate_per_unit] FLOAT,

  [comments]      VARCHAR(500),

  [balance_left]  INT,

  [Payment_mode]  VARCHAR(50)

  CONSTRAINT [PK_Supplier_Product_Customer] PRIMARY KEY CLUSTERED ( [Supplier_xid] ASC,[Customer_xid] ASC,[Product_xid] ASC,[trans_date] ASC ) WITH ( pad_index = OFF,statistics_norecompute = OFF,ignore_dup_key = OFF,allow_row_locks = on,allow_page_locks = on,FILLFACTOR = 90 ) ON [PRIMARY])

ON [PRIMARY]


A typical transaction table with a primary key on Supplier id ( Supplier_xid) , Customer id ([Customer_xid] ),Product ID ([Product_xid]) and transaction date ( [trans_date] ). So We have clustered index key on our primary key columns.

The table has 112055 rows. Total number of pages is 1020. Consider the following query.


SET STATISTICS  IO  ON

--To get IO stats



SELECT supplier_xid,

       customer_xid,

       product_xid,

       trans_city,

       trans_date

FROM   dbo.supplier_product_customer

WHERE  trans_city = 'Perth'


The query returns 22316 rows.

IO Statistics are provided below.

Table 'Supplier_Product_Customer'. Scan count 1, logical reads 1029, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Grab a look at the query plan.




Query Plan shows a clustered index scan.

Points to note are

* Query plan uses Clustered index scan
* Total Logical Reads : 1029
* Query cost: 0.881357

From IO stats and query plan its obvious that the entire table was read/scanned.

To make this query use a covered index, let us create a nonclustered index on Trans_city column.


CREATE NONCLUSTERED INDEX [NCIX_Supplier_Product_Customer_Trans_City] ON [dbo].[Supplier_Product_Customer] (

      [Trans_city] ASC)

WITH (pad_index = OFF,

      statistics_norecompute = OFF,

      sort_in_tempdb = OFF,

      ignore_dup_key = OFF,

      drop_existing = OFF,

      online = OFF,

      allow_row_locks = ON,

      allow_page_locks = ON)

ON [PRIMARY]


Let us run the same query again:


SET STATISTICS  IO  ON

--To get IO stats



SELECT supplier_xid,

       customer_xid,

       product_xid,

       trans_city,

       trans_date

FROM   dbo.supplier_product_customer

WHERE  trans_city = 'Perth'


The query returns the same 22316 rows.

Query plan provided below.



IO Stats are provided below.

Table 'Supplier_Product_Customer'. Scan count 1, logical reads 100, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Points to note are

* Query plan uses Non Clustered index seek
* Total Logical Reads : 100
* Query cost: 0.105275

Query plan indicates a non clustered index seek using the new created NCIX_Supplier_Product_Customer_Trans_City index instead of the clustered index scan.
The number of IOs are also very very less and its down to 100 compared to 1029 in clustered index scan. The cost of the query shows great improvement from 0.88 to 0.105. Hence, there is a significant improvement in the performance as well.

The reason as explained, the query engine accesses only NCIX_Supplier_Product_Customer_Trans_City index and not the table at all as all the columns are present with in the index itself. The out of the columns selected Supplier_xid,Customer_xid,Product_xid,Trans_city,Trans_date make the clustered index which is present in the leaf nodes of Non clustered index. Trans_city column is the non clustered index key and so its also present in the non clustered index. Non Clustered index is definitely smaller than the entire table and hence less number of reads.. So better performance.Had it been any version before SQL 2k then using non clustered index would have needed 22316 IOS + cost of traversing index non leaf pages , which lot more expensive than table scan and hence it would have perfomed a table scan again.But with SQL 2k and after, the unwanted table scan was prevented.

Covered index in SQL 2k has a drawback.SQL 2005 has gone one more step ahead and improved this performance even more by making an enhancement on covered indexes.We will look into the drawback and the improvement done on SQL 2k5 in the next
post.

Sunday, November 8, 2009

Leaf of a Non Clustered Index



After a small break I am back with indexes again...

Agreed that this post provides a very basic info on indexes, but this one is just a starter for the posts I plan to write in the next 2 weeks.

An entry on a non clustered index would have two parts.

1> Non Clustered index key
2> Row Locator.

If the non clustered index is on a table with no clustered index then the Row locator
is the physical address of the entry or simply called as Row ID.If the non clustered index is on a table with a clustered index then the Row locator is the clustered index key.



Grab a look at above picture. It should give a clear idea on the structure of non clustered index on a table with a clustered index.

In case if the clustered index is not unique, then non clustered index adds a unique
indentifier ( 4 bytes length ) to the clustered index key to uniquely identify the row. Before SQL 2000 non clustered indexes always had the Physical address/Row Pointer/Row id and not clustered index key. From SQL 2k onwards it was a welcome change.Let us see the reason why clustered index keys are part of Non clustered indexes.


> Physical address of a row changes due to page spilts. If non clustered indexes contained Row ID then they should be updated every time a page split happens in the table. Even if there is Re Index on clustered index then non clustered index should be rebuilt as well.This can be serious pain. Thankfully this was changed and it brought many many good things to query tuning. We will see that in the next post.

Just to Summarize, the leaf of a non clustered index on a clustered table contains a clustered index key.

Picture courtesy : http://blogs.developpeur.org/photos/raptorxp/images/43172/original.aspx

Tuesday, November 3, 2009

NOT IN VERSUS NOT EXISTS



Logical difference between NOT IN and NOT EXISTS:

Though 'Not in' and 'Not exists' serve the same purpose, there is a logical difference
in their operation other than 'Not Exists' being correlated.

Before I explain the difference let us see an example.


CREATE TABLE #capital (

  capital_id INT,

  city_name  VARCHAR(100))


CREATE TABLE #city (

  city_name VARCHAR(100),

  country   VARCHAR(100))



INSERT INTO #capital

SELECT 1,

       'New Delhi'

UNION

SELECT 2,

       'Singapore'

UNION

SELECT 3,

       'London'


INSERT INTO #city

SELECT 'New Delhi',

       'India'

UNION

SELECT 'Singapore',

       'Singapore'

UNION

SELECT NULL,

       NULL


Two tables are created.. namely #capital,#city. #capital table contains three cities
.. New Delhi,singapore, London. The #city table contains New Delhi,Singapore and a NULL
row.

We would like to write a query to find the city that is present in Capital and not in
city table. Expected result is London.

First let us take 'NOT IN':


SELECT *

FROM   #capital

WHERE  city_name NOT IN (SELECT city_name

                         FROM   #city)


Straight forward query..
To our surprise, the query returns Zero rows.

Now for NOT EXISTS.


SELECT *

FROM   #capital cp

WHERE  NOT EXISTS (SELECT cy.city_name

                   FROM   #city cy

                   WHERE  cy.city_name = cp.city_name)


Hey presto..it works!!! London is returned.

Why this behaviour with NOT in?

The reason lies in the way in which 'NOT IN' handles NULL.
First point to note is 'NOT IN' condition can return three possible values.
They are True,False,NULL. But 'NOT EXISTS' always returns either True/False.

Closer look at how NOT IN works:-

NOT IN condition returns TRUE when condition check is successful
EX: 'London' NOT IN ( 'New Delhi','Singapore').
NOT IN condition returns FALSE when condition check FAILS
EX: 'London' NOT IN ( 'London','Singapore').

NOT IN condition returns NULL when condition check is made against NULL.
EX:
'London' NOT IN
( 'New Delhi','Singapore',NULL)


Returns NULL because 'London' not in ( NULL) will always return NULL and not TRUE as expected. As for the row to be successfully returned the result of the condition checked ( in this case 'not in') should be TRUE. But because of NULL returned by NOT in ( instead of TRUE ) it is discarded. So if the sub query contains NULL then for NOT IN condition no rows will be returned for any value.

Though 'IN' also has a similar behavior the anomaly is realized only in NOT IN because

'London' IN
( 'New Delhi','Singapore',NULL)

Returns NULL and not TRUE, so 'London' is not returned.


'London' IN
( 'New Delhi','Singapore','London',NULL)

Returns True as 'London' in ( 'London' ) is TRUE though 'London' in ( NULL ) is NULL.
Note that 'IN' needs just one TRUE.

But NULL comparison using '=' returns FALSE always ( assuming you haven't changed the default behavior using SET ANSI_DEFAULTS option). ie NULL = 'London' is always False. So, Not Exists always works.

As I have stressed before , another reason to correlate queries.
Performance wise as well NOT EXISTS performs better than NOT IN.