SQL Hadoop: Data Mining

Showing posts with label Data Mining. Show all posts

Sunday, March 16, 2014

SSAS error msg: Unable to retrieve data mining algorithms

Problem

When you tried to create a new mining structure in SSAS, you may notice a warning message appears in the window as shown below:

“Unable to retrieve a list of supported data mining algorithms. Make sure you are connected to the correct Analysis Services server instance and the Analysis Services server named localhost is running and configured properly. You can continue with a default list of data mining algorithms.”

Solution:

Step 1: Check Analysis Service has been started from SQL Server Configuration Manager.

Step 2: Check the deployment property to make sure that it has been set up correctly.

Monday, February 10, 2014

Intro of Data Analysis in Healthcare for Triple Aim

Have you ever heard of Triple Aim in healthcare? If not, you may notice that private and public entities around the world are investigating how to maximize the power of analytics to improve quality of care, operational efficiencies, billing reimbursements with a lower cost. United States spent about 17.9% of GDP on healthcare in 2011, according to World Health Organization’s report.

With such a great opportunity, millions and millions of healthcare data are waiting for analysts to be turned into useful insights and predictions for Triple Aim. This presentation paints a big picture of what you can do with these huge volume and complicated data. What can be turned into reports, predictions, and actions? Analysis of enrollment and emergency department (ED) visits demonstrate the aspects you can dig into.

Moreover, what is the structure for claims? How to use quality measures? You may get a peek from an example of ED visits about how to dig out ED visits from claims.

Lastly, it explains common medical billing and coding, including ICD, CPT, and HCPCS to help you get a good grasp of these coding standards.

Thursday, February 6, 2014

Hidden Gem for Data Analysis: Data Profiling Task

Data analysis and cleansing are the most time consuming parts when you want to load source data, or prepare sampling data for data mining. You can write TSQL to find out distinct values or lengths of a column that are existing in the testing data. However, when you have millions of rows or more than thirty columns for analysis, TSQL scripts might take a whole day to get the answer. Since SQL Server 2008, SSIS has Data Profiling task that can conveniently help analyze the content and structure of data, and uncover patterns, inconsistencies, anomalies and even redundancies. Today we will give a brief demonstration about this hidden gem for data analysis.

How to view the result by Data Profiler Viewer?

Data Profiler Viewer is used to view the generated profiler by data profiling task in SSIS. You create a new package and drag “Data Profiling Task” to the control flow pane as shown.

Double click the task, a new Data Profiling Task Editor window pops up and then click “Open Profile Viewer” button.

Click the open button at the upper left corner and locate the output profile file in xml format.

Types of profiles and statistics for analysis

Then you can explore the profile. Below only show four kinds of profiler but you can have eight kinds of profiles to request:

Column Length Distribution
Column Null Ratio
Column Pattern
Column Statistics including min, max, mean, standard deviation for each column.
Column Value Distribution including values, counts, percentages.
Functional Dependency: fully or partially dependent of other column to check redundancy.
Candidate Key: check which column might be a good candidate for a primary key or business key.
Value Inclusion: check whether all values in a column exists in another table. It can be dimensions or lookup table.

Demo

Let us take a look at the Column Value Distribution Profile first. Click each column on the top pane and then the lower pane will display Value, Count, Percentage for that column. You can sort by value, counts, or percentage by clicking the dropdown arrow at the right side of item. For this column, you can easily tell the top three kinds of type codes are M, P, R.

Note that it won’t be an exhausted list for all the values. Here is an example for drug code. It only listed three values in the lower pane but number of distinct values is 48720. So if the variation for column values is a lot but for some values, there are only a tiny percentage records, the profile won’t display those small percentage.

In summary, data profiles generated by Data Profiling Task of SSIS is very convenient to help you understand data, find patterns, derive data rules, detect the outliners of columns for data cleansing. You can even perform data profiling to test the foreign key relationship. If you would like to find out how to set up each kinds of profile requests, Jamie Thomson’s series of SSIS: Data Profiling Task are an awesome reference.

Reference: TechNet’s Data Profile Viewer

Friday, November 29, 2013

Quotes about Making Predictions / Data Mining

As the saying “prediction is very difficult – especially of the future” is one of golden rules among economists, below are some of the quotes about making predictions with data:

“The art of prophecy is very difficult, especially about the future.” — Mark Twain
“Any time there's a 50% chance of getting something right, there's a 90% chance you'll get it wrong.” — Yogi Berra
“No one can possibly know what is about to happen: it is happening, each time, for the first time, for the only time.” — James Baldwin
“Occurrences in this domain are beyond the reach of exact prediction because of the variety of factors in operation, not because of any lack of order in nature.” — Albert Einstein
“The best way to predict your future is to create it.” — Abraham Lincoln
“Prediction is very difficult, especially about the future.” — Niels Bohr
“To write it, it took three months; to conceive it three minutes; to collect the data in it all my life. ” — F. Scott Fitzgerald, American author
“Where is the Life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?” — Thomas Stearns Eliot, poet, 1888-1965
“Errors using inadequate data are much less than those using no data at all.” — Charles Babbage, British Mathematician and Inventor, 1791-1871
“The goal is to transform data into information, and information into insight.” — Carly Fiorni, former president of Hewlett Packard
“Big data is mostly about taking numbers and using those numbers to make predictions about the future. The bigger the data set you have, the more accurate the predictions about the future will be.” — Anthony Goldbloom, the founder and CEO of Kaggle
“The world is now awash in data and we can see consumers in a lot clearer ways.” — Max Levchin, PayPal co-founder.
“Data is a precious thing and will last longer than the systems themselves.” — Tim Berners-Lee, inventor of the World Wide Web.
“Drowning in Data yet Starving for Knowledge” — Anonymous

Image courtesy of KATIE DRUMMOND.

Tuesday, October 29, 2013

Best Practice on Bias and Variance in Machine Learning

As the CEO of Growth Science, Thomas Thurston comments: “I think of data science as a way of thinking about the world in terms of hypotheses, testing, confidence, and error margins.” Machine learning involves selecting algorithms, sampling data, setting parameters to optimize accuracy of learning accuracy. How to get the most fruitful results means that you need to action right at the earliest stage. Here I will cover the relationship among bias, variance, regulation parameters, and number of sample data.

More data means better results?

It is not always the case. As the learning curve shown below, when we increase the number of training samples, both training error and cross validation error are still very large. A learning curve plots training and cross validation error as a function of training set size.

Here we split a dataset into training, validation and test sets to be 60% training set, 20% validation set, and 20% test set as typical practices do. The learning algorithm finds parameters {theta} to minimize training set error and use these parameters {theta} to estimate errors on cross validation dataset and test set.

Best practice for high bias

The poor performance on both the training and test sets suggests a high bias problem. Sometimes it is also called “underfitting the training data”, meaning that the learning model has not captured the information in the training set. Thus, adding more data won’t help. In order to improve both training and test set performance, here are actions to take:

Adding more complex features (also known as degrees of freedom) to increase the complexity of the hypothesis.
Increase the degree d of the polynomial. This allows the hypothesis to fit the data more closely, improving both training and test set performance.
Decreasing the regularization parameter lambda to allow the hypothesis to fit the data more closely.

Figure 2 demonstrates that when we increase the degree of polynomial in regression, the model fits training data better and the error for training sets is almost zero. However, the cross validation error are still very large. Also, you may notice the big jump at the two ends of fitting curve. This comes from the nature of high degree of polynomials. It is the sign of high variance.

Best Practice for High Variance

High variance means the learning model is overfitting the training data. One sign of overfitting is the gap in errors between training and test. Here are the actions that can be taken to help:

Use fewer features or decrease the degree d of the polynomial. Using a feature selection technique may be useful, and decrease the over-fitting of the estimator.
Adding more training data to reduce the effect of over-fitting.
Increase Regularization. Regularization is designed to prevent over-fitting. In a high-variance model, increasing regularization properly can lead to better results.

Bias-variance Tradeoff

So as you can see, to find a model with the balance of bias and variance is one of the most important goals in machine learning. As Stanford Professor Andrew Ng suggested in his Machine Learning course:

In practice, especially for small training sets, when you plot learning curves
to debug your algorithms, it is often helpful to average across multiple sets
of randomly selected examples to determine the training error and cross
validation error.

In other words, here are the main two steps:

Learn parameters {theta} using the randomly chosen training set.
Evaluate the parameters and choose the optimal regulation parameters lambda using cross validation dataset.

After that, evaluation training and cross validation error by trying out the following ways to find the bias-variance tradeoff:

Get more training examples
Reduce of increase the number of features or degree of freedom
decrease or increase regulation parameters lambda

Hopefully this discussion of bias and variance will give you some sense of how to adjust features and parameters of learning algorithms to get the best training results.

Monday, September 9, 2013

How to: Install PowerPivot and Data Mining Add-ins for Excel 2010

For data analysts and BI developers, there are two add-ins for Excel 2010 that might be interesting to be installed for a tryout:

PowerPivot add-in for Excel 2010
Data mining add-in for Excel 2010

Where to download?

Microsoft SQL Server 2012 SP1 PowerPivot for Microsoft Excel 2010 can be downloaded at
http://www.microsoft.com/en-us/download/details.aspx?id=29074

Moreover, SQL Server 2012 Data Mining Add-ins for Office 2010 can be downloaded at http://www.microsoft.com/en-us/download/details.aspx?id=29061

For both add-ins, you need to decide which version you need since both a 32-bit and 64-bit version are available now. If you have a 32-bit version of Excel, you need to choose 32-bit version of PowerPivot: ENU\x86\SQL_AS_DMAddin.msi. (Don't get confused with the machine version.) If you don't know which version your Excel is, go to File Tab and click Help. Then on the right of the screen, you will find out detailed information about your Excel including versions.

Memory Considerations

Note that for 32-bit version of Excel, the limit for memory usage is 2GB. If you need to use more than 2GB of memory to work with your mining model, you must use the 64-bit version of Excel with the 64-bit version of the PowerPivot add-in. For more info, please refer to the article on Technet: How to Use the SQL Server Data Mining Add-ins with PowerPivot for Excel

Monday, August 5, 2013

Concepts in Microsoft Association Rules: Lift, Support, Importance, and Probability

You must have heard of the famous diapers and beer story that illustrates the correlation in shoppers' baskets. Market basket analysis by association rule mining has been widely used by retailers since 1990s to adjust store layouts, and to develop cross-promotion plans and catalogs. Nowadays instant recommendations with association rules becomes a hot spot for research. Microsoft Association Rules algorithm is a common algorithm to create association rules, which can be used in a market basket analysis. It supports several parameters that affect the behavior, performance, and accuracy of the resulting mining model. Therefore, it is important to have a clear understanding about these following concepts.

LIFT

In data mining and association rule learning, lift is a measure of the performance of a model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random-choice targeting model. For example, suppose that 5% of the customers mailed a catalog without using the model would make a purchase. But a certain model (or rule) has identified a segment with a response rate of 15%. Then that segment would have a lift of 3.0 (15%/5%). Lift indicates how well the model improved the predictions over a random selection given actual results.

SUPPORT

Support is the probability of a transaction contains targeted item or itemset. The larger Support is, the larger number of cases that contain the targeted item or combination of items the model has. You can use parameter MINIMUM_SUPPORT and MAXIMUM_SUPPORT to define the thresholds. By default, MINIMUM_SUPPORT is 0.0 and MAXIMUM_SUPPORT is 1.0.

RULEs

The Rules tab in Microsoft Association Rules Viewer displays Probability, Importance, Rule that are related to rules that the mining algorithm finds.

• Rule: A rule is a description of the presence of an item in a transaction based on the presence of other items.

• Probability: The likelihood of a rule, defined as the probability of the right-hand item given the left-hand side item. By default, MINIMUM_PROBABILITY is 0.4. However, probability sometimes is misleading. For example, if every transaction contains a gift bag--perhaps the gift bag is added to each customer's cart automatically as a promotion, a rule predicting that gift bag has a probability of 1. It is accurate but not very useful. To flag the usefulness of a rule, Importance is the right measure to use.

• Importance: A measure of the usefulness of a rule. A greater value means a better rule. The importance of a rule is calculated by the log likelihood of the right-hand side of the rule, given the left-hand side of the rule. For example, in the rule of If {A} then {B}, the importance is Log( Pr(A&B)/ Pr(B without A) ) .

Pages