SQL Hadoop: data warehouse

Today we will briefly discuss the difference between star and snowflake schemas, and what you should consider when you make decisions for a data warehouse design.

What are star schemas and snowflake schemas?

In dimensional data warehousing, star schemas represent a dimensional model which is composed of a central fact table and multiple “de-nomalized” dimension tables around the fact table. The hierarchy dimension are flattened into a single table. For example, in the sample data warehouse of AdventureWorks, the customer dimension may contain attributes about customers such as title, name, date of birth, address in one table. Or for a data warehouse in healthcare, the provider dimension would contain attributes about providers and also practices that providers belong to. Thus, there are some redundancy for star schemas.

On the other hand, a snowflake schema can be regarded as a special type of a star schema. It contains a central fact table and multiple normalized tables to represent one dimension. For example, the Customer dimension can be represented by Customer table and Geography table as shown below.

In the example of the provider dimension, it may have one Practice table containing all practices’ attributes and another table contains only providers’ attributes and a foreign key pointing to the Practice table.

Differences or trade-offs

In terms of disk space, star schemas cost more space because snowflake schemas contain more normalized dimension tables, and thus removes some of the attribute redundancy in dimension tables. (But compared to the size of fact tables, the dimension tables usually are negligible.)

However, in terms of performance, star schemas usually give better performance in most cases because if a query in snowflake schemas may need many joins to do a data aggregation.

Also, a dimension table in snowflake schema may have one or more parent tables. So the dimensions in a snowflake schema must be highly controlled and managed to avoid update and insert anomalies.

Thus, Kimball design tips recommended star schemas in regards of understandability and performance. Even when using snowflake schemas, views are suggested to be placed on top of the dimensions so that they appear to the user as a single dimension.

Cases to choose snowflake schemas

There are some user cases that snowflake schemas may be a better choices:

Based on query frequency: Large customer dimensions such as Amazon or Facebook, where, for example, 90 percent of the fact table measurements involve anonymous visitors about whom you collect little detail, and 10 percent involve reliably registered customers about whom you collect much detailed data by tracking many dimensions. Moving the 10 percent registered customers as a separate dimension can improve performance.
Sparsely populated attributes, where most dimension records are NULL, are moved to a sub-dimension.
Low cardinality attributes that are queried independently. For example, a product dimension may contain thousands of products, but only a handful of product types. So if the product types are often queried independently, it is better to have a separate product type dimension. Another example could be the provider dimension may contain a very large number of demographic attributes at the practice level. Rather than repeat a large number of attributes in the provider dimension, we might create a special type of normalized practice dimension called a “dimension outrigger” where the attributes are stored in the provider dimension.
Attributes that are part of a hierarchy and are queried independently. Examples include the year, quarter, and month attributes of a date hierarchy; and multiple enterprise calendar dimensions.

In summary, before making decision on which schema to use, you should have business requirements defined about how this data warehouse is used. You should reflect how users want to aggregate and “slice and dice” historical fact table data. You must embrace the goal of enhancing business value as its primary purpose. Work closely with business users, uncover the needs of your business users, then design a system that meet their needs!

Using hash functions in sql server for incremental data loading has a big performance advantage when you have millions of rows to load, or have several dozens of columns to compare and make decision on whether to update, insert, or expire, as mentioned by Andy Leonard's Anatomy of an Incremental Load. Brett Flippin has introduced the way to calculate hash columns with SSIS’s Script Components in his article Speeding Up Incremental ETL Processes in SSIS by Using MD5 Hashes. Today I would like to demonstrate how to calculate hash columns by stored procedures. This stored procedure will automatically concatenate all columns except hash columns for given tables, and then generate a hashKey for each row. What you need to provide is the name of the table. So it saves the time and efforts to drag and draw columns for tables with different structures.

First let's start with the script to create our test database and destination table:

-- Create test database
use master
go
if exists(select name from sys.databases where name = 'DemoSQL2012')
drop database DemoSQL2012
go
create database DemoSQL2012
go
-- Create Contact Destination table
use DemoSQL2012
go
if exists(select name from sys.tables where name = 'hashTest')
drop table dbo.hashTest
go

CREATE TABLE [dbo].[hashTest](
	[personId] [int] IDENTITY(1,1) NOT NULL,
	[firstName] [varchar](10) NULL,
	[lastName] [varchar](10) NULL,
	[hashKey] [varbinary](256) NULL,
 CONSTRAINT [PK_hashtest] PRIMARY KEY CLUSTERED 
(
	[personId] ASC
))

Insert INTO [dbo].[hashTest] ([firstName], LastName)
VALUES ('David', 'Smith'),
('Bob', 'Stones')

If you check the table dbo.hashTest, you will see that it has four columns and two records, with hashKey column set as NULL.

Create Stored Proc dbo.GetHashKey

Then you need to run the following script to create stored proc to generate hashKey.

USE [DemoSQL2012]
GO

CREATE PROCEDURE [dbo].[GetHashKey] ( @pSchemaName sysname,@pTableName sysname,@pDebug INT = NULL )
/************************************************************************************************************
   Description: Populates hashKey column of target table indicated by @pTableName parm, using SHA1 algorithm.
                Here the hashkey column is called hashKey. 
		The concatenated string as inputs for Hashbytes function is in a format of col1$col2$col3.
   Examples:    exec [dbo].[GetHashKey] 'dbo','hashTest', 1 to get the concatenated string to be calculated by Hashbytes fcn.
                exec [dbo].[GetHashKey] 'dbo','hashTest', 2 to get the sql to update hashKey columns.
*************************************************************************************************************/
AS

BEGIN
SET NOCOUNT ON
DECLARE @SchemaName VARCHAR(MAX)
DECLARE @TableName VARCHAR(MAX)
DECLARE @HashSql VARCHAR(MAX)
DECLARE @Sql VARCHAR(MAX) = ''
DECLARE @CRLF char(2) = CHAR(13)+CHAR(10) 

SET @SchemaName = @pSchemaName
SET @TableName = @pTableName

--- Concatenate strings for Hashbyte functions by INFORMATION_SCHEMA
SELECT @HashSql = COALESCE(@HashSql + '+ ', '')
        + 'ISNULL(CAST('+ QUOTENAME(COLUMN_NAME)+' as varchar),''~'')+''$'''+@CRLF
FROM    INFORMATION_SCHEMA.COLUMNS
WHERE   TABLE_NAME = @TableName
    AND TABLE_SCHEMA = @SchemaName
AND column_name NOT LIKE 'hash%'


SET @sql = @sql + 'DECLARE @updateChunk int = 4000000;
WHILE 1=1
BEGIN
'

SET @sql = @sql + 'UPDATE Top(@updateChunk) '+ @SchemaName + '.' + @TableName + ' ' + @CRLF

SET @sql = @sql + 'SET  hashKey = ' + @CRLF
SET @sql = @sql + 'HashBytes(''SHA1'','+ @HashSql + ')'+@CRLF

SET @sql = @sql + 'WHERE hashKey is null;
	IF @@RowCount=0 
		BREAK;
END'

IF (@pDebug = 1)
BEGIN
	DECLARE @sqlStr2Hash VARCHAR(MAX) =  'SELECT ' + @HashSql + ' as hashCol FROM  ' + @SchemaName+'.' + @TableName;
	EXEC (@sqlStr2Hash)
END
ELSE 
BEGIN
 IF (@pDebug = 2)
   SELECT @sql;
ELSE
   EXEC (@sql);
END

END

GO

Here we use SHA1 algorithms to be Hashbyte functions. According to the article “Exploring Hash Functions in SQL Server”, SHA1 is one of the reliable algorithms with “best spread over the hash space”. We use metadata INFORMATION_SCHEMA.COLUMNS to obtain all columns with given table names.

How to use Stored Proc dbo.GetHashKey?

To call this stored procedure we would execute it as follows:

exec [dbo].[GetHashKey] 'dbo','hashTest', 1

It will show you the concatenated string for each row.

If you call stored procedure with @pDebug set as 2, you will get the sql scripts to update the hashKey column.

exec [dbo].[GetHashKey] 'dbo','hashTest', 2

If you want to update the hashKey column, call the stored procedure as follows:

exec [dbo].[GetHashKey] 'dbo','hashTest'

Now you can see the table is updated with hashKey:

If you would like to use this stored proc in SSIS packages, you can directly use Execute SQL Task with schema name and table name passed by as parameters.

Wrap up

You can directly call the above stored procedure to speed up calculation of hashKey columns for different tables when loading to data warehouse. The inputs are mainly the schema name and table name. With pDebug parameters, it would be easy to debug your hash strings or sql scripts.

Also please keep in mind that the hashbyte functions do have an input limit of 8000 bytes. For those extremely fat tables, you can lean on SQL Server's CLR integration and perform the hashing in a .NET assembly.

SQL Hadoop

Pages

Friday, March 14, 2014

Snowflake Schemas Vs. Star Schemas