Pages

Wednesday, February 26, 2014

2014 Gartner Magic Quadrant Reports for BI and Analytics Platform: leaders & trends

2014 Gartner magic quadrant report for business intelligence and analytics platform was released last week.  The report rates the strengths and weaknesses of 27 vendors in the market according to 17 defined capabilities across three categories: information delivery, analysis and integration.  Below is a snapshot of vendors’ performance charted on a graphical quadrant.

 

gartner-mq-2014

Who are leaders?

The two main axis remain the same as in 2013 report: completeness of vision and ability to execute. It added some capabilities such as Geospatial and location intelligence, Embedded advanced analytics, Business user data mashup and modeling, Embeddable analytics, and Support for big data sources. Tableau leads in ability to execute and IBM/SAS lead in completeness of vision. QlikView, Microsoft, SAP, Tibco, Oracle, MicroStrategy remain in leader quadrant.

What are the trends for 2014?

The answer is “business-user-driven data discovery and analysis” and “Cloud BI”. “The BI and analytics platform market is in the middle of an accelerated transformation from BI systems used primarily for measurement and reporting to those that also support analysis, prediction, forecasting and optimization”, according to Gartner's 2014 reports.The data-discovery category was pioneered by Tableau, Qlikview, Microsoft, IBM and SAS, which have done a better job than the others of integrating their enterprise and new data discovery capabilities.

I highly recommend you to read more detailed analysis of report here or download it from http://www.qlik.com/mq2014.

Monday, February 24, 2014

How to: Do Incremental Loading by Hashbytes Function with Stored Proc

Using hash functions in sql server for incremental data loading has a big performance advantage when you have millions of rows to load, or have several dozens of columns to compare and make decision on whether to update, insert, or expire, as mentioned by Andy Leonard's Anatomy of an Incremental Load. Brett Flippin has introduced the way to calculate hash columns with SSIS’s Script Components in his article Speeding Up Incremental ETL Processes in SSIS by Using MD5 Hashes. Today I would like to demonstrate how to calculate hash columns by stored procedures. This stored procedure will automatically concatenate all columns except hash columns for given tables, and then generate a hashKey for each row. What you need to provide is the name of the table. So it saves the time and efforts to drag and draw columns for tables with different structures.

First let's start with the script to create our test database and destination table:

-- Create test database
use master
go
if exists(select name from sys.databases where name = 'DemoSQL2012')
drop database DemoSQL2012
go
create database DemoSQL2012
go
-- Create Contact Destination table
use DemoSQL2012
go
if exists(select name from sys.tables where name = 'hashTest')
drop table dbo.hashTest
go

CREATE TABLE [dbo].[hashTest](
[personId] [int] IDENTITY(1,1) NOT NULL,
[firstName] [varchar](10) NULL,
[lastName] [varchar](10) NULL,
[hashKey] [varbinary](256) NULL,
CONSTRAINT [PK_hashtest] PRIMARY KEY CLUSTERED
(
[personId] ASC
))

Insert INTO [dbo].[hashTest] ([firstName], LastName)
VALUES ('David', 'Smith'),
('Bob', 'Stones')

If you check the table dbo.hashTest, you will see that it has four columns and two records, with hashKey column set as NULL.


hashtableStart


Create Stored Proc dbo.GetHashKey


Then you need to run the following script to create stored proc to generate hashKey.

USE [DemoSQL2012]
GO

CREATE PROCEDURE [dbo].[GetHashKey] ( @pSchemaName sysname,@pTableName sysname,@pDebug INT = NULL )
/************************************************************************************************************
Description: Populates hashKey column of target table indicated by @pTableName parm, using SHA1 algorithm.
Here the hashkey column is called hashKey.
The concatenated string as inputs for Hashbytes function is in a format of col1$col2$col3.
Examples: exec [dbo].[GetHashKey] 'dbo','hashTest', 1 to get the concatenated string to be calculated by Hashbytes fcn.
exec [dbo].[GetHashKey] 'dbo','hashTest', 2 to get the sql to update hashKey columns.
*************************************************************************************************************/
AS

BEGIN
SET NOCOUNT ON
DECLARE @SchemaName VARCHAR(MAX)
DECLARE @TableName VARCHAR(MAX)
DECLARE @HashSql VARCHAR(MAX)
DECLARE @Sql VARCHAR(MAX) = ''
DECLARE @CRLF char(2) = CHAR(13)+CHAR(10)

SET @SchemaName = @pSchemaName
SET @TableName = @pTableName

--- Concatenate strings for Hashbyte functions by INFORMATION_SCHEMA
SELECT @HashSql = COALESCE(@HashSql + '+ ', '')
+ 'ISNULL(CAST('+ QUOTENAME(COLUMN_NAME)+' as varchar),''~'')+''$'''+@CRLF
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = @TableName
AND TABLE_SCHEMA = @SchemaName
AND column_name NOT LIKE 'hash%'


SET @sql = @sql + 'DECLARE @updateChunk int = 4000000;
WHILE 1=1
BEGIN
'

SET @sql = @sql + 'UPDATE Top(@updateChunk) '+ @SchemaName + '.' + @TableName + ' ' + @CRLF

SET @sql = @sql + 'SET hashKey = ' + @CRLF
SET @sql = @sql + 'HashBytes(''SHA1'','+ @HashSql + ')'+@CRLF

SET @sql = @sql + 'WHERE hashKey is null;
IF @@RowCount=0
BREAK;
END'

IF (@pDebug = 1)
BEGIN
DECLARE @sqlStr2Hash VARCHAR(MAX) = 'SELECT ' + @HashSql + ' as hashCol FROM ' + @SchemaName+'.' + @TableName;
EXEC (@sqlStr2Hash)
END
ELSE
BEGIN
IF (@pDebug = 2)
SELECT @sql;
ELSE
EXEC (@sql);
END

END

GO

Here we use SHA1 algorithms to be Hashbyte functions. According to the article “Exploring Hash Functions in SQL Server”, SHA1 is one of the reliable algorithms with “best spread over the hash space”. We use metadata INFORMATION_SCHEMA.COLUMNS to obtain all columns with given table names.


How to use Stored Proc dbo.GetHashKey?


To call this stored procedure we would execute it as follows:

exec [dbo].[GetHashKey] 'dbo','hashTest', 1

It will show you the concatenated string for each row.


hashString


If you call stored procedure with @pDebug set as 2, you will get the sql scripts to update the hashKey column.

exec [dbo].[GetHashKey] 'dbo','hashTest', 2

hashSql


If you want to update the hashKey column, call the stored procedure as follows:

exec [dbo].[GetHashKey] 'dbo','hashTest'

Now you can see the table is updated with hashKey:


hashColUpdated


If you would like to use this stored proc in SSIS packages, you can directly use Execute SQL Task with schema name and table name passed by as parameters.


Wrap up


You can directly call the above stored procedure to speed up calculation of hashKey columns for different tables when loading to data warehouse. The inputs are mainly the schema name and table name. With pDebug parameters, it would be easy to debug your hash strings or sql scripts.

Also please keep in mind that the hashbyte functions do have an input limit of 8000 bytes. For those extremely fat tables, you can lean on SQL Server's CLR integration and perform the hashing in a .NET assembly.


Reference: HASHBYTES (Transact-SQL)

Monday, February 17, 2014

Conversion between SSIS Integers and SQL Server Strings

When you carry out database migration, one of main headaches is the conversion of different data types between sources and destinations. If you are using SSIS, be careful about the conversion between strings and int. Sometimes SSIS will fail because the integer type defined is not big enough to hold the original data. It is also known as overflow problem. So here is a summary that indicates which kind of integer data types you should use for SSIS for strings in the source table. Hope it helps!

SSIS SQL Server Range Maximum String Converted
DT_I1 tinyint 0 to 255 varchar(2)
DT_I2 smallint -2^15 (-32,768) to 2^15-1 (32,767) varchar(4)
DT_I4 int -2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647) varchar(9)
DT_I8 bigint -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807) varchar(18)
Note that for an empty string, you can use derived column task in SSIS with expression below to convert an empty string into a NULL integer in SSIS.
LEN(strCol) == 0 ? NULL(DT_I4) : (DT_I4)(RTRIM(strCol))

You can also refer to BOL: int, bigint, smallint, and tinyint (Transact-SQL)

Monday, February 10, 2014

Intro of Data Analysis in Healthcare for Triple Aim

Have you ever heard of Triple Aim in healthcare? If not, you may notice that private and public entities around the world are investigating how to maximize the power of analytics to improve quality of care, operational efficiencies, billing reimbursements with a lower cost. United States spent about 17.9% of GDP on healthcare in 2011, according to World Health Organization’s report.

With such a great opportunity, millions and millions of healthcare data are waiting for analysts to be turned into useful insights and predictions for Triple Aim. This presentation paints a big picture of what you can do with these huge volume and complicated data. What can be turned into reports, predictions, and actions? Analysis of enrollment and emergency department (ED) visits demonstrate the aspects you can dig into.

Moreover, what is the structure for claims? How to use quality measures? You may get a peek from an example of ED visits about how to dig out ED visits from claims.

Lastly, it explains common medical billing and coding, including ICD, CPT, and HCPCS to help you get a good grasp of these coding standards.

Saturday, February 8, 2014

Recordings for 24 Hours of PASS: Business Analytics - 2014

If you would like to take a peek about the coming PASS Business Analyics Conference in San Jose, CA, May 7-9, 2014;

If you would like to learn more about how to use Power Query, PowerPivot, Power View, and Power Map to discover and demonstrate business insights;

If you would like to learn something about big data or predictive analysis/modeling;

here is a valuable resource that you should not miss,

Recordings for 24 Hours of PASS: Business Analytics - 2014

Thursday, February 6, 2014

Hidden Gem for Data Analysis: Data Profiling Task

Data analysis and cleansing are the most time consuming parts when you want to load source data, or prepare sampling data for data mining. You can write TSQL to find out distinct values or lengths of a column that are existing in the testing data. However, when you have millions of rows or more than thirty columns for analysis, TSQL scripts might take a whole day to get the answer. Since SQL Server 2008,  SSIS has Data Profiling task that can conveniently help analyze the content and structure of data, and uncover patterns, inconsistencies, anomalies and even redundancies. Today we will give a brief demonstration about this hidden gem for data analysis.

How to view the result by Data Profiler Viewer?

Data Profiler Viewer is used to view the generated profiler by data profiling task in SSIS. You create a new package and drag “Data Profiling Task” to the control flow pane as shown.

dataProfilerTask2012

Double click the task, a new Data Profiling Task Editor window pops up and then click “Open Profile Viewer” button.

dataProfilerTask2012_openViewer

Click the open button at the upper left corner and locate the output profile file in xml format.

dataProfilerTask2012_openFile

Types of profiles and statistics for analysis

Then you can explore the profile. Below only show four kinds of profiler but you can have eight kinds of profiles to request:

  1. Column Length Distribution
  2. Column Null Ratio
  3. Column Pattern
  4. Column Statistics including min, max, mean, standard deviation for each column.
  5. Column Value Distribution including values, counts, percentages.
  6. Functional Dependency: fully or partially dependent of other column to check redundancy.
  7. Candidate Key: check which column might be a good candidate for a primary key or business key.
  8. Value Inclusion: check whether all values in a column exists in another table. It can be dimensions or lookup table.

dataProfiler_item

Demo

Let us take a look at the Column Value Distribution Profile first. Click each column on the top pane and then the lower pane will display Value, Count, Percentage for that column. You can sort by value, counts, or percentage by clicking the dropdown arrow at the right side of item. For this column, you can easily tell the top three kinds of type codes are M, P, R.

DistinctValue

Note that it won’t be an exhausted list for all the values. Here is an example for drug code. It only listed three values in the lower pane but number of distinct values is 48720. So if the variation for column values is a lot but for some values, there are only a tiny percentage records, the profile won’t display those small percentage.

DistinctValue_list

In summary, data profiles generated by Data Profiling Task of SSIS is very convenient to help you understand data, find patterns, derive data rules, detect the outliners of columns for data cleansing. You can even perform data profiling to test the foreign key relationship. If you would like to find out how to set up each kinds of profile requests, Jamie Thomson’s series of SSIS: Data Profiling Task are an awesome reference.

Reference: TechNet’s Data Profile Viewer

Monday, February 3, 2014

How to: Use Check Constraints to Enforce Business Rules

As a developer, you may often find that it is essential to ensure data integrity before the data get into database. Otherwise, you will face a cleanup work or the situation of “trash in, trash out” afterwards. Today we will address how to enforce business rules by using check constraints on columns.

Here's an example of a Member table with name and phone number as columns. Let us check constraint on the phone number column. Constraint will suggest that phone numbers should follow the format of xxx-xxx-xxxx.

IF OBJECT_ID('[dbo].[Member]') IS NOT NULL
DROP TABLE [dbo].[Member]
CREATE TABLE [dbo].[Member](
MEMBERID INT IDENTITY NOT NULL,
[Name] varchar(25) NOT NULL,
[PhoneNumber] varchar(12) NULL
)
---- INSERT RECORDS
Insert INTO [dbo].[Member] Values
('Susan', '697-555-0142'),
('Kevin', '819-555-0175')

If you check the table, you can see it already has two records.

SELECT *
FROM [dbo].[Member]


Add a CHECK CONSTRAINT


Then we add a CHECK CONSTRAINT to help check the format of phone numbers.

ALTER TABLE [dbo].[Member]
ADD CONSTRAINT CK_Member_Phone
CHECK ([PhoneNumber] LIKE '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]')
GO

If any of existing records in tables violate the constraint to be added, it will pop up messages like this:
Msg 547, Level 16, State 0, Line 1
The ALTER TABLE statement conflicted with the CHECK constraint “CK_Member_Phone”
. The conflict occurred in database "xxxx", table "dbo.Member", column 'PhoneNumber'.


 


Test a CHECK CONSTRAINT


Next, you can test the check constraint by trying to insert a record that has a phone number not following the format defined in the constraint CK_Member_Phone.

Insert INTO [dbo].[Member]    Values 
('Charles', '212-555-0187'),
('Chris', '612555-0100')

An error message will be returned like the following one:
Msg 547, Level 16, State 0, Line 1
The INSERT statement conflicted with the CHECK constraint "CK_Member_Phone". The conflict occurred in database "DemoSQL2012", table "dbo.Member", column 'PhoneNumber'.

The statement has been terminated.