Today we will briefly discuss the difference between star and snowflake schemas, and what you should consider when you make decisions for a data warehouse design.
What are star schemas and snowflake schemas?
In dimensional data warehousing, star schemas represent a dimensional model which is composed of a central fact table and multiple “de-nomalized” dimension tables around the fact table. The hierarchy dimension are flattened into a single table. For example, in the sample data warehouse of AdventureWorks, the customer dimension may contain attributes about customers such as title, name, date of birth, address in one table. Or for a data warehouse in healthcare, the provider dimension would contain attributes about providers and also practices that providers belong to. Thus, there are some redundancy for star schemas.
On the other hand, a snowflake schema can be regarded as a special type of a star schema. It contains a central fact table and multiple normalized tables to represent one dimension. For example, the Customer dimension can be represented by Customer table and Geography table as shown below.
In the example of the provider dimension, it may have one Practice table containing all practices’ attributes and another table contains only providers’ attributes and a foreign key pointing to the Practice table.
Differences or trade-offs
In terms of disk space, star schemas cost more space because snowflake schemas contain more normalized dimension tables, and thus removes some of the attribute redundancy in dimension tables. (But compared to the size of fact tables, the dimension tables usually are negligible.)
However, in terms of performance, star schemas usually give better performance in most cases because if a query in snowflake schemas may need many joins to do a data aggregation.
Also, a dimension table in snowflake schema may have one or more parent tables. So the dimensions in a snowflake schema must be highly controlled and managed to avoid update and insert anomalies.
Thus, Kimball design tips recommended star schemas in regards of understandability and performance. Even when using snowflake schemas, views are suggested to be placed on top of the dimensions so that they appear to the user as a single dimension.
Cases to choose snowflake schemas
There are some user cases that snowflake schemas may be a better choices:
- Based on query frequency: Large customer dimensions such as Amazon or Facebook, where, for example, 90 percent of the fact table measurements involve anonymous visitors about whom you collect little detail, and 10 percent involve reliably registered customers about whom you collect much detailed data by tracking many dimensions. Moving the 10 percent registered customers as a separate dimension can improve performance.
- Sparsely populated attributes, where most dimension records are NULL, are moved to a sub-dimension.
- Low cardinality attributes that are queried independently. For example, a product dimension may contain thousands of products, but only a handful of product types. So if the product types are often queried independently, it is better to have a separate product type dimension. Another example could be the provider dimension may contain a very large number of demographic attributes at the practice level. Rather than repeat a large number of attributes in the provider dimension, we might create a special type of normalized practice dimension called a “dimension outrigger” where the attributes are stored in the provider dimension.
- Attributes that are part of a hierarchy and are queried independently. Examples include the year, quarter, and month attributes of a date hierarchy; and multiple enterprise calendar dimensions.
In summary, before making decision on which schema to use, you should have business requirements defined about how this data warehouse is used. You should reflect how users want to aggregate and “slice and dice” historical fact table data. You must embrace the goal of enhancing business value as its primary purpose. Work closely with business users, uncover the needs of your business users, then design a system that meet their needs!
No comments:
Post a Comment