Pages

Friday, November 29, 2013

Quotes about Making Predictions / Data Mining

As the saying “prediction is very difficult – especially of the future” is one of golden rules among economists, below are some of the quotes about making predictions with data:  crystal-ball-300x240

  • “The art of prophecy is very difficult, especially about the future.”Mark Twain
  • “Any time there's a 50% chance of getting something right, there's a 90% chance you'll get it wrong.”Yogi Berra
  • “No one can possibly know what is about to happen: it is happening, each time, for the first time, for the only time.”James Baldwin
  • “Occurrences in this domain are beyond the reach of exact prediction because of the variety of factors in operation, not because of any lack of order in nature.”Albert Einstein
  • “The best way to predict your future is to create it.”Abraham Lincoln
  • “Prediction is very difficult, especially about the future.”Niels Bohr
  • “To write it, it took three months; to conceive it three minutes; to collect the data in it all my life. ”F. Scott Fitzgerald, American author
  • “Where is the Life we have lost in living?
    Where is the wisdom we have lost in knowledge?
    Where is the knowledge we have lost in information?”
    Thomas Stearns Eliot, poet, 1888-1965
  • “Errors using inadequate data are much less than those using no data at all.”Charles Babbage, British Mathematician and Inventor, 1791-1871
  • “The goal is to transform data into information, and information into insight.” — Carly Fiorni, former president of Hewlett Packard
  • “Big data is mostly about taking numbers and using those numbers to make predictions about the future. The bigger the data set you have, the more accurate the predictions about the future will be.”Anthony Goldbloom, the founder and CEO of Kaggle
  • “The world is now awash in data and we can see consumers in a lot clearer ways.”Max Levchin, PayPal co-founder.
  • “Data is a precious thing and will last longer than the systems themselves.”Tim Berners-Lee, inventor of the World Wide Web.
  • Drowning in Data yet Starving for Knowledge” — Anonymous
Image courtesy of KATIE DRUMMOND.

Sunday, November 24, 2013

Script Task: Find the Day of Week in SSIS

Problem

How to find out the day of week in SSIS to achieve the same result as DATENAME( dw, datecolumn ) as in TSQL? For example, we need to find out “2013-11-24” is Sunday or Saturday.

Solution

Suppose FROM_SERVICE_DATE is the date column to derive the info about day of week. Below are two options. Using “Script Task” seems more complicate than directly using “Derived Column”. But it is more readable, isn’t it? Moreover, from this example, you can learn to use “Script Task” to handle multiple CASE situations.

Using “Derived Column”

Below is the expression to input in “Derived Column” task by using nested Conditional Operator in SSIS.

DATEPART("dw",FROM_SERVICE_DATE)==1?"Sunday":DATEPART("dw",FROM_SERVICE_DATE)==2?"Monday":DATEPART("dw",FROM_SERVICE_DATE)==3?"Tuesday":DATEPART("dw",FROM_SERVICE_DATE)==4?"Wednesday":DATEPART("dw",FROM_SERVICE_DATE)==5?"Thursday":DATEPART("dw",FROM_SERVICE_DATE)==6?"Friday":DATEPART("dw",FROM_SERVICE_DATE)==7?"Saturday":""
 
Using “Script Task”

Step 1: Derive a new column with "dayNoOfWeek” as Name for the numeric number of day of week, such as 1st or 6th of the week.
SSIS_ScriptTask_WeekofDay_1_InputColumn
Step 2: Drag and drop Script Component task to the Data Flow Task (DFT) pane, you will be prompted to select script component type as shown below. Check “Transformation”. Then connect it with the above Derived Column Task.


SSIS_ScriptTask_WeekofDay_0_scriptTrans
Step 3: Under “Input Column” page, check “dayNoOfWeek” as shown below.

SSIS_ScriptTask_WeekofDay_2_InputColumn
Step 4:
Under “Inputs and Outputs” page, expand “Output0” and select “Output Columns”. Click “Add Column” button. A new column will appear. You can rename it to “NameOfWeekDay”.
Next, modify the column properties on the right. Set the new column to be a string with length of 10.


SSIS_ScriptTask_WeekOfDay_3a
Now go back to “Script” page on the top, select the “ScriptLanguage” to be C# related. Then click “Edit Script” button at the low right corner.


SSIS_ScriptTask_WeekOfDay_Script2
Enter the following codes in the Sub function Input0_ProcessInputRow.

public override void Input0_ProcessInputRow(Input0Buffer Row)
    {

        switch (Row.dayNoOfWeek) {             
            case 1: Row.NameOfWeekDay = "Sunday";       break;
            case 2: Row.NameOfWeekDay = "Monday";       break;
            case 3: Row.NameOfWeekDay = "Tuesday";      break;
            case 4: Row.NameOfWeekDay = "Wednesday";    break;
            case 5: Row.NameOfWeekDay = "Thursday";     break;
            case 6: Row.NameOfWeekDay = "Friday";       break;
            case 7: Row.NameOfWeekDay = "Saturday";     break;
        }
   }

If you prefer to Visual Basic as your script language, add the following code in the sub function of Input0_ProcessInputRow.       

SELECT CASE Row.dayNoOfWeek
Case 1
Row.NameOfWeekDay = "Sunday"
Case 2
Row.NameOfWeekDay = "Monday"
Case 3
Row.NameOfWeekDay = "Tuesday"
Case 4
Row.NameOfWeekDay = "Wednesday"
Case 5
Row.NameOfWeekDay = "Thursday"
Case 6
Row.NameOfWeekDay = "Friday"
Case 7
Row.NameOfWeekDay = "Saturday"
END SELECT
Next, build and close the Script Edit window.

Tuesday, November 19, 2013

Is Truncate DDL or DML?

Recently as I referred to the book of Microsoft SQL Server 2012 T-SQL Fundamentals by Ben-Gan, Itzik for DML definitions, there is a conclusion about TRUNCATE statement that surprised me at first:
"Another common misunderstanding is that TRUNCATE is a DDL statement, but in fact it is a DML statement."
Yes. I had always had that “misunderstanding” to regard TRUNCATE as DDL because TRUNCATE statement removes the data by deallocating pages used to store the table and records only the page deallocations in the transaction log. It is not simply row removal as DELETE statement.

After further discussion with the author (I really appreciated that Itzik spent time to help me understand his conclusion.), it turned out TRUNCATE is listed under Data Manipulation Language (DML) in the standard document ISO_9075_02_Foundation_2011_E. Also, the Standard ISO/ANSI SQL deals with only the logical aspects of the code, and logically TRUNCATE statement deletes all rows from a table.

At the same time, Microsoft listed TRUNCATE TABLE as Data Definition Language (DDL) Statements together with ALTER, CREATE, DISABLE/ENABLE TRIGGER, DROP, and UPDATE STATISTICS.

Isn’t it interesting to know about that?

Wednesday, November 13, 2013

How to: Set Row Versioning-based Transaction Isolation Levels


Row versioning-based transaction isolation levels were introduced since SQL Server 2005. They use row versioning for concurrency controls instead of locks. Therefore they are very attractive because no more blocking issues with usage of row versioning. However, since row versioning-based isolation levels are related to both database-level settings and transaction-level settings, it is very easy to get confused about how to set it right. I hope you will get a clear idea after reading this.


Two types, three commands?


There are two types of transaction isolation levels based on row versioning. One is READ COMMITTED using row versioning. (It is also known as READ COMMITTED SNAPSHOT isolation or RCSI. ) The other is SNAPSHOT isolation. The difference between the two is the "cutoff" time for the latest row versions they will use.  RCSI uses latest row versions when the statement starts while SNAPSHOT uses the latest committed row when the transaction starts.
On the other hand, there are three commands involved in order to set RCSI and SNAPSHOT isolation levels:
  1. ALTER DATABASE <current_database> SET READ_COMMITTED_SNAPSHOT ON;
  2. ALTER DATABASE <current_database>  SET ALLOW_SNAPSHOT_ISOLATION ON;
  3. SET TRANSACTION ISOLATION LEVEL SNAPSHOT
The first two commands are database-level settings and can be turned on at the same time. So the question is:

 

What is the right combination?


First of all, let us do some experiments. Create a table with two rows:

create table tranTest (id int primary key, price int)
insert tranTest values (1, 10), (2, 20) 

Suppose both ALLOW_SNAPSHOT_ISOLATION and READ_COMMITTED_SNAPSHOT are set OFF. You can double check the isolation levels by following the steps in How to Check Transaction Isolation Levels.

Experiment 1: ALLOW_SNAPSHOT_ISOLATION is OFF with READ_COMMITTED_SNAPSHOT is ON;

Run the following code in session 1:
ALTER DATABASE DemoSQL2012 SET ALLOW_SNAPSHOT_ISOLATION ON;
Next, run the following code in session 2:
SET TRANSACTION ISOLATION LEVEL SNAPSHOT

Ok, it seems fine with no error message appears. 
However, when you begin a new transaction in the same session that you have tried setting SNAPSHOT isolation, it will give the following error:


Msg 3952, Level 16, State 1, Line 3
Snapshot isolation transaction failed accessing database 'xxx' because snapshot isolation is not allowed in this database. Use ALTER DATABASE to allow snapshot isolation.

 

That means that in order to start a SNAPSHOT transaction, you must have ALLOW_SNAPSHOT_ISOLATION to be turned on. Only when ALLOW_SNAPSHOT_ISOLATION database option is set ON, the support for snapshot isolation transaction is activated. Moreover, the database engine does not generate row versions for modified data until all active transactions that have data modified complete. Interestingly, it does not matter whether READ_COMMITTED_SNAPSHOT is ON or OFF.
After you set ALLOW_SNAPSHOT_ISOLATION option on, run the following codes:

---- EXPERIMENT for SNAPSHOT ISOLATION
-- Connection 1
BEGIN TRAN;

  SELECT id, price
  FROM dbo.tranTest
  WHERE id = 2;
  UPDATE dbo.tranTest
    SET price += 1.00
  WHERE id = 2;

  SELECT id, price
  FROM dbo.tranTest
  WHERE id = 2;

-- Connection 2
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;

BEGIN TRAN;

  SELECT id, price
  FROM dbo.tranTest
  WHERE id = 2;
-- 20
-- Connection 1
COMMIT TRAN;

-- Connection 2
  SELECT id, price
  FROM dbo.tranTest
  WHERE id = 2;
--- 20
COMMIT TRAN;

BEGIN TRAN;

  SELECT id, price
  FROM dbo.tranTest
  WHERE id = 2;
--- 21
COMMIT TRAN;

-- Connection 1
-- Cleanup
UPDATE dbo.tranTest
  SET price = 20.00
WHERE id = 2;


Notice that in session 2, we got same results for reading in the first transaction but a different result for the later transaction. That proves that SNAPSHOT isolation level always read the last committed row version when the transaction starts. The cutoff for the row to be read is when transaction starts.


Experiment 2: ALLOW_SNAPSHOT_ISOLATION is ON with READ_COMMITTED_SNAPSHOT is ON;


In order to enable the database option for READ_COMMITTED_SNAPSHOT, please only have one active session connected to the related database. Then run the command:

ALTER DATABASE DemoSQL2012 SET READ_COMMITTED_SNAPSHOT ON;


Then test the isolation level by running these codes:
--- EXPERIMENT FOR READ_COMMITTED_SNAPSHOT
-- Connection 1
BEGIN TRAN;

  SELECT id, price
  FROM dbo.tranTest
  WHERE id = 2;

  UPDATE dbo.tranTest
    SET price += 1.00
  WHERE id = 2;

  SELECT id, price
  FROM dbo.tranTest
  WHERE id = 2;

-- Connection 2
BEGIN TRAN;

  SELECT id, price
  FROM dbo.tranTest
  WHERE id = 2;
-- 20
-- Connection 1
COMMIT TRAN;

-- Connection 2
  SELECT id, price
  FROM dbo.tranTest
  WHERE id = 2;
--- 21
COMMIT TRAN;

-- Connection 1
-- Cleanup
UPDATE dbo.tranTest
  SET price = 20.00
WHERE id = 2;


If you get the same results in session 2 as shown above, congratulations! The difference in the results demonstrates that READ_COMMITTED_SNAPSHOT always get the last committed row version right before the statement starts.

Conclusion


In order set SNAPSHOT transaction isolation, you need to first turn on the ALLOW_SNAPSHOT_ISOLATION option at database level. Then at the related session, run "SET TRANSACTION ISOLATION LEVEL SNAPSHOT" to set isolation levels at transaction level.

On the other hand, for RCSI, you only need to have READ_COMMITTED_SNAPSHOT turned on at database level.