Saturday, December 3, 2016

Data Channel - Interview 01 - Andreas Wolter on Columnstore

Dear all,

Welcome to the first ever interview of DataChannel. I am honored to have Mr.Andreas Wolter, a German MVP, MCM, MCSM for the first interview. Detailed profile of Andreas provided below.

Andreas Wolter has over 16 years of experience with SQL Server and earned the Master-Certification both on SQL Server 2008 (MCM) and SQL Server 2012 (MCSM)– as one of only 2 experts worldwide. Besides that he has also been awarded with the MVP for SQL Server since 2014.

He is the founder of Sarpedon Quality Lab, originally from Germany, specializing in Development and Optimization of SQL Server Database- and Datawarehouse-architectures with focus on performance, scalability and security.


Website: http://andreas-wolter.com/
Facebook: https://www.facebook.com/andreas.wolter.01
LinkedIn:https://www.linkedin.com/in/andreaswolter

Topic of discussion is Columnstore Index. 

Interview video recording provided below:




Excerpts from the Interview provided below:


 What is Columnstore Index, how is it different from traditional Index?

ColumnStore Index is supposed to be optimized for access of large volume of data in the range of millions of rows. Columnstore index is optimized for single column access queries which are common in datawarehouse style applications. Columnstore Indexes design combined with underlying compression helps in fetching large volume of data at a much faster pace from OLAP style queries. Columnstore index has improved so much more in SQL Server 2016 to make it much more flexible.


What are the standard gains for applications using columnstore index?


First, application workload needs to be suitable to columnstore index.  The major gains would be

Compression: If the application is a typical datawarehouse application with "star schema" design, with 20 million or more rows, one is likely to receive 50% data compression by using columnstore index. Effectiveness of compression also depends on the data type of columns

Speed: On the speed front, it can be double or triple times or sometimes even 100 times faster, depending upon how the query has been designed and how the query was performing previously.

 How does SQL Server make such a improvement possible? What is the architecture of columnstore that makes it possible? 

It is a fundamental change in the data storage making it possible. Columnstore is hugely compressed resulting in massive reduction of I/O. Even today, in the era of SSDs, disk IO is the slowest part of the system. Compression in columnstore does provide half of the improvement in columnstore. And of course, there is more to it too.

How does one pick the tables that would benefit from columnstore? How does one design the application to leverage upon the benefits of columnstore?

Ideally datawarehouse style applications would benefit. Tables running into several million rows and wide tables with many columns (for ex 50 columns) would benefit from columnstore index
The width and length table need to be huge to benefit via columnstore index. The type of queries commonly executed by the application is also important. Columnstore would help queries fetching larger number of rows and wouldn't make a difference on queries reading 1 or 2 rows. Especially queries which involve aggregation functions with group by clause reading lots of rows would benefit the most.

How are the columnstore indexes stored on the disk? Are they differently stored compared to traditional Indexes?

Totally different. Only thing that is common is 8KB Pages for storing data. Unlike traditional Index pages which contain data of multiple fields and multiple data types, columnstore stores only one type of column data on a single page (One data type). These pages are organized as segments and each segment contains data from just one single column. This makes large compression possible.
Second part is, when fetching the data, if the query is selecting only 5 columns, column store reads the only the segments that contain the 5 columns. The traditional index would be forced to read all the columns as each page contains all the columns of the table or index.

Columnstore Index is one of those technologies which has improved a lot since 2012. Can you drive us thro the lifecycle of improvements it has had till 2016?

Columnstore was actually first introduced in SQL Server 2008 in the Parallel Data warehouse edition of SQL Server. 
In SQL Server 2012, once the nonclustered columnstore was put on the table, the table was write protected and supported only read only queries. Due to this limitation it very few people adapted or started using it.

In SQL Server 2014, clustered columnstore was introduced. Clustered columnstore by design contains all columns of the table. Clustered columnstore was updatable. However, Non clustered columnstore was still read only.

In SQL Server 2016, both clustered and non clustered column store indexes are updatable. However, one can't mix columns on clustered and non clustered indexes. Meaning, one can either have a clustered columnstore index or a non clustered columnstore index on the same table.

Andreas has prepared a Index Cheat Sheet to showcase the various combinations with Indexes. Please check it out over here


We understand that clustered columnstore index would contain all the columns of the table. Do I still need traditional indexes?

Yes. Good question. You will still need them if you have queries which pick smaller set of data. Traditional non clustered indexes are required for single row or fewer row lookups. Fetching lesser number of rows or specific rows from compressed segments would be harder and traditional indexes would help those requests.

What is Real Time Operational Analytics?

Let’s split this into 2 parts - Analytics is what we have discussed using all along using columnstore. Real time operational data refers to the data that is constantly changed via inserts / updates / Deletes. 
Real time operational analytics refers to the same table being used for two kinds of workload - both regular operations like insert / update /deletes and also datawarehouse style analytic queries. This has been made possible in SQL Server 2016 via the "Delta" store in columnstore indexes. "Delta Store" is an area of columnstore, which contains the hot data or recently changed data in an uncompressed format. This combination of static data on compressed segments and hot data on "Delta" Store makes real time operational analytics possible in SQL Server 2016.

The other buzzword is In Memory OLTP. Can it be combined with columnstore? How would applications benefit out of these two technologies?

The other much improved technology of SQL Server 2016 is in "In Memory". Yes, In SQL Server 2016, one can combine the columnstore technology and In memory table, which gives faster access to real time data. "Real Time Operational Analytics" with "in memory OLTP", makes the access to operational data lock free, latch free and also let analytic workload benefit via columnstore technology

What are the warning signs in using columnstore? What are the "Dont's" while using columnstore?

The queries/ workload used should suit the design of columnstore index. 
1) Application with most of the queries using "select * " or many columns may not benefit via columnstore index
2) Database schema design also matters. Shouldn't be a too complex schema design like snowflake / galaxy schema design. Datawarehouse applications using star schema design are most suitable
3) Functions or queries used also matters - For ex: - functions like min / max are suitable while commands like "distinct" are unsuitable for columnstore

No comments: