SQL Hadoop: October 2013

Tuesday, October 29, 2013

Best Practice on Bias and Variance in Machine Learning

As the CEO of Growth Science, Thomas Thurston comments: “I think of data science as a way of thinking about the world in terms of hypotheses, testing, confidence, and error margins.” Machine learning involves selecting algorithms, sampling data, setting parameters to optimize accuracy of learning accuracy. How to get the most fruitful results means that you need to action right at the earliest stage. Here I will cover the relationship among bias, variance, regulation parameters, and number of sample data.

More data means better results?

It is not always the case. As the learning curve shown below, when we increase the number of training samples, both training error and cross validation error are still very large. A learning curve plots training and cross validation error as a function of training set size.

Here we split a dataset into training, validation and test sets to be 60% training set, 20% validation set, and 20% test set as typical practices do. The learning algorithm finds parameters {theta} to minimize training set error and use these parameters {theta} to estimate errors on cross validation dataset and test set.

Best practice for high bias

The poor performance on both the training and test sets suggests a high bias problem. Sometimes it is also called “underfitting the training data”, meaning that the learning model has not captured the information in the training set. Thus, adding more data won’t help. In order to improve both training and test set performance, here are actions to take:

Adding more complex features (also known as degrees of freedom) to increase the complexity of the hypothesis.
Increase the degree d of the polynomial. This allows the hypothesis to fit the data more closely, improving both training and test set performance.
Decreasing the regularization parameter lambda to allow the hypothesis to fit the data more closely.

Figure 2 demonstrates that when we increase the degree of polynomial in regression, the model fits training data better and the error for training sets is almost zero. However, the cross validation error are still very large. Also, you may notice the big jump at the two ends of fitting curve. This comes from the nature of high degree of polynomials. It is the sign of high variance.

Best Practice for High Variance

High variance means the learning model is overfitting the training data. One sign of overfitting is the gap in errors between training and test. Here are the actions that can be taken to help:

Use fewer features or decrease the degree d of the polynomial. Using a feature selection technique may be useful, and decrease the over-fitting of the estimator.
Adding more training data to reduce the effect of over-fitting.
Increase Regularization. Regularization is designed to prevent over-fitting. In a high-variance model, increasing regularization properly can lead to better results.

Bias-variance Tradeoff

So as you can see, to find a model with the balance of bias and variance is one of the most important goals in machine learning. As Stanford Professor Andrew Ng suggested in his Machine Learning course:

In practice, especially for small training sets, when you plot learning curves
to debug your algorithms, it is often helpful to average across multiple sets
of randomly selected examples to determine the training error and cross
validation error.

In other words, here are the main two steps:

Learn parameters {theta} using the randomly chosen training set.
Evaluate the parameters and choose the optimal regulation parameters lambda using cross validation dataset.

After that, evaluation training and cross validation error by trying out the following ways to find the bias-variance tradeoff:

Get more training examples
Reduce of increase the number of features or degree of freedom
decrease or increase regulation parameters lambda

Hopefully this discussion of bias and variance will give you some sense of how to adjust features and parameters of learning algorithms to get the best training results.

Tuesday, October 22, 2013

PASS Summit 2013: Connect, Share, Learn

Charlotte, NC -- October 15 - 18, 2013 is an unforgettable moment in my life. There I joined the other 5,000 people for the world's largest, most-focused, and most-intensive conference for Microsoft SQL Server and BI professionals: PASS Summit 2013.

The weather at Charlotte was very nice. They were typical beautiful days in a harvest season. Inside the convention center, there were welcoming smiles at every corner, and enthusiastic people waiting in line to get into the next sessions.

What's my takeaway at PASS Summit?

It was truly my "Connect, Share, Learn" week with the SQL Server community.

Besides those sessions to expand my knowledge, networking with SQL Server professionals is the main takeaway for me. I met many well-known SQL Server experts and industry leaders that I am always learning from, either from their talks or from their articles. Also I made lots of friends with those who are also enthusiastic about data and BI. Every day, you can learn something new with the people you talked to, at lunch, at hand-on labs, at sponsor events, and at Exhibitor Halls. Even lunches at every conference day was a great chance to communicate with your peers, your local chapters, and the SQL Server Professionals with similar background.

I truly appreciate the organizers, speakers, volunteers, exhibitors, and all other people behind the scenes for making PASS Summit 2013 an amazing experience! Also great thanks to my boss for providing this awesome opportunity for me to reflect and recharge!

What's going on at PASS Summit?

For those who have never been to any PASS Summit, below is what I'd like to share with you. Here are the numbers:

3 packed days with 190+ technical sessions and 6 deep-dive Half-Day Sessions shared by 150+ experts.
2 days of pre-conferences with 14 sessions.
2 day of keynotes to keep you informed about the news of PASS and what's the "Next Big Thing", including Server 2014: In-Memory OLTP and Power BI.
3 luncheons to suit different needs for learning and networking, including:

PASS Chapter luncheon
Women In Technology Luncheon
Birds of a Feather Luncheon

SQL Server Clinic for some help from an expert from Microsoft.
Instructor-led hand-on workshops and Microsoft on-site certification testing.
6 tracks that covered:

Application & Database Development
BI Information Delivery
BI Platform Architecture
Cloud Application Development & Deployment
Enterprise Database Administration & Deployment
Enterprise Database Administration & Deployment and Professional Development.

What's next?

If you would like to catch any part of PASS Summit 2013 live, go to PASSTV. The highlights are Day 1 and Day 2 keynotes. (To me, the demos are very impressive and the presentations are amazing.) Also, for my dear SQL/BI ladies, please don't miss Women in Technology: Panel Discussion.

Next, continue learning with top technical sessions at PASSTV. You can take a peek at the list of sessions(in pdf) . Then you can decide whether to get the complete set of PASS Summit 2013 session recordings to review at anytime you want.

Last but not least, don't forget to check out the PASS homepage for events near you and under your interests, including SQL Saturdays, local and Virtual PASS Chapters.

Now I am looking forward to the next PASS Summit at Seattle. How about you?

Tuesday, October 15, 2013

How to: Calculate Current Age of Members

Age is an important measure in healthcare. We need to calculate the right age to differentiate aged or non-eldly enrolled members. Moreover Age is an essential part of conditions in HEDIS measures. However, it is very easy to get it skewed by birthday or some special case, such as leap years. Here you can find out the solutions by using TSQL or SSIS.

Before we go to the right answers, what do you think of the following two solutions?

DATEDIFF(“YYYY”, BirthDate, GETDATE())
DATEDIFF(“DD”,BirthDate, GETDATE()) / 365

As the first glance, they may look right. Now suppose one child was born at 2012-11-15, when you try to use the following TSQL to get his age

select DATEDIFF(YY, '2012-11-15', '2013-10-15')

You will get 1 as the result. The year has been rounding to one although the first birthday has not arrived yet.

TSQL Solution

In the case that you need to derive a column for the members' age, use the following:

Case
    when (MONTH([MEMBER_BIRTH_DATE]) > MONTH(CURRENT_TIMESTAMP))
                  or ((MONTH([MEMBER_BIRTH_DATE]) = MONTH(CURRENT_TIMESTAMP)) and (DAY([MEMBER_BIRTH_DATE]) > DAY(CURRENT_TIMESTAMP)))
                  then DATEDIFF(YY, [MEMBER_BIRTH_DATE], CURRENT_TIMESTAMP) - 1
                else DATEDIFF(YY, [MEMBER_BIRTH_DATE], CURRENT_TIMESTAMP)
           end MEMBER_AGE

SSIS Solution

SSIS expression for derived column task is:

DATEDIFF("Year",MEMBER_BIRTH_DATE,GETDATE()) - ((MONTH(MEMBER_BIRTH_DATE) == MONTH(GETDATE()) && DAY(MEMBER_BIRTH_DATE) > DAY(GETDATE())) || MONTH(MEMBER_BIRTH_DATE) > MONTH(GETDATE()) ? 1 : 0)

Isn't it easy?

Tuesday, October 8, 2013

How to: Check Transaction Isolation Levels

Setting the right transaction isolation level for SQL Server is essential to maintain consistency with optimized concurrency. Below are two quick tips on how to check transaction isolation levels.

How to check optimistic concurrency control level?

Since SQL Server 2005, optimistic concurrency controls achieve no blocking by row-versioning techniques. There are two kinds of optimistic lockings: SNAPSHOT and READ COMMITTED SNAPSHOT isolations. They can be only set at database level by using "ALTER DATABASE" command. Below shows the query to find out whether you have either of them on or not.

---- To check SNAPSHOT_ISOLATION_STATE and READ_COMMITTED_SNAPSHOT for the database 
SELECT name as database_name
,snapshot_isolation_state
,snapshot_isolation_state_desc
,is_read_committed_snapshot_on
FROM sys.databases
WHERE name = N'DATABASE_NAME';
GO

How to check transaction isolation levels for sessions?

There are two ways to check transaction isolation levels for sessions.

1) Run the DBCC UserOptions command get the transaction isolation level for current sessions. You must make sure that you have the right database selected for the session first.
The possible isolation values for this command are:

read uncommitted
read committed
repeatable read
serializable
read committed snapshot
snapshot

2) Use sys.dm_exec_sessions to get transaction isolation levels for not only current sessions but also other sessions by session IDs.

--- To check isolation levels for other connections
SELECT CASE transaction_isolation_level 
WHEN 0 THEN 'Unspecified' 
WHEN 1 THEN 'ReadUncommitted' 
WHEN 2 THEN 'ReadCommitted' 
WHEN 3 THEN 'Repeatable' 
WHEN 4 THEN 'Serializable' 
WHEN 5 THEN 'Snapshot' END AS TRANSACTION_ISOLATION_LEVEL 
FROM sys.dm_exec_sessions 
where session_id = 
---- where session_id = @@SPID

Here @@SPID returns the session ID of the current user process.

Look out for READ_COMMITTED_SNAPSHOT!

The above two methods give the same results for all transaction isolation levels EXCEPT for READ_COMMITTED_SNAPSHOT. Using sys.dm_exec_sessions will give you "ReadCommitted" while DBCC UserOptions command will have "read committed snapshot" as the isolation level. Here is the sreenshot:

DBCC useroptions

Set Option                  Value
--------------------------- --------------
textsize                    2147483647
language                    us_english
dateformat                  mdy
datefirst                   7
lock_timeout                -1
quoted_identifier           SET
arithabort                  SET
ansi_null_dflt_on           SET
ansi_warnings               SET
ansi_padding                SET
ansi_nulls                  SET
concat_null_yields_null     SET
isolation level             read committed snapshot

At the bottom you can see the isolation level is "read committed snapshot".

Pages