Blog

Data Warehouse Architecture Best Practices

A data warehouse should consist of at least three layers. You could design each of these three layers with several individual sub-layers.

Those three layers could be named Staging, Core and Access but several other names could be used as well like Landing, Sourcing, Integrated, Enterprise, Analytic, Application, etc. After all, they are just names.

In addition to the three layers, other components could be useful like a data lab, and “big data”, as well as support for advanced analytics.

Staging Layer

The Staging layer is where you “dump” source data keeping the original formatting to ease reconciliation between source systems and the data warehouse. It is usually optimal to get a full dump of source data every time it is loaded as a basis for deriving changes since last full dump. The exception is where you have a modern source system that publish every change as a transaction to be consumed by any “interested” target system – including the data warehouse. Just be absolutely sure that it is actually every change that is published no matter how insignificant it may seem.

Core Layer

The core layer is where all data is held in an integrated format. Preferably in a very abstracted and generalized third normal form (3NF) to make sure that data from any potential source can be loaded and integrated into the data model and to make cost efficient use of space to store the data. Some will argue, that the physical data model in such a core is very little user friendly, but users should never need to see or even access this physical data model.

Access Layer

You should only expose this layer to users. You should design this layer to mimic the way the end user knows data and therefor the layer will typically consist of many different parts each targeted to a specific user population or application. The access layer should be a virtual layer not actually storing information.

For performance reasons it may not be possible to get away with a purely virtual access layer. Most relevant data warehouse tools would support at least one way of materializing some of the virtual structures thus keeping data redundancy at a minimum and most important leaving management of redundancy to the data management engine instead of bespoke code produced by you or expensive consultants.

Data Lab

A data lab is a temporary space within the data warehouse environment where the end users can load data they want to combine with existing data in the data warehouse. A data lab is very useful to evaluate the value of including a new data source into the data warehouse – just remember – you should never use the data lab for production reporting. The most difficult task related to data labs is constructing a user friendly and highly automated system to administer the creation, use and subsequent discard of the data lab.

Big Data

Data that will not feasibly fit within the data warehouse but which you suspect will be able to add valuable insights to your company should be handled with big data tools. These provide very cheap storage of data in any size and format but most importantly provide a data preprocessor to make relevant data available to advanced analytics. The big data part of the data environment is gradually changing from an optional to a mandatory component and will soon be an even critical part.

Advanced Analytics

When tables and pie charts available from any BI/reporting tool are not sufficient to answer your business questions, you might need to turn to advanced analytics typically supplemented with advanced visualization tools. These tools will be able to disclose hidden information from your data environment and present the information in an easily understandable graphical format. These tools should be able to work with both the data warehouse and the big data tools. Just be aware that advanced analytics require specialist of a rare type often called data scientists. Without those specialists, the tools are of no use and can even be quite misleading.

Historical Arguments

This way of constructing a data warehouse has not always been feasible but many things have changed. Hardware prices have declined dramatically and the capacity increased even more dramatically. The number of potential source systems have dramatically increased and the nature of the systems have changed from largely homegrown systems with supposedly good control over data structures to more and more off-the-shelf systems and cloud based systems with little or no control over data structures. The cost of resources to build and maintain the data warehouse has increased quite substantially. The speed of business changes is increasing across all industries and there is a demand for more flexible solutions and faster cycle times in all business processes including building data warehouse solutions to support business. And finally the need for traceability and data security have increased and are more and more under regulatory control from various authorities depending on the industry. All these factors work in favor of the best practices described here.

Your data warehouse may not look like this and for historical reasons that might be justifiable, but you should consider migrating towards these best practices either by migrating the existing data warehouse or by designing the next generation data warehouse as a replacement for the old solution. The last option is usually the easiest and cheapest.

0 comments | >