Data warehouse
A Data warehouse is a computer database that collects, integrates and stores an organization's computer data with the aim of producing accurate and timely management information and supporting data analysis.
Definition of a Data Warehouse
A data warehouse is a collection of computerised data that is organised to most optimally support reporting and analysis activity.
Bill Inmon's formal systems definition of a data warehouse is a computer database and its supporting components that is:
- Subject oriented, meaning that the data in the database is organised so that all the data elements relating to the same real-world event or object are linked together;
- Time variant, meaning that the changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time;
- Non-volatile, meaning that data in the database is never over-written or deleted, but retained for future reporting; and,
- Integrated, meaning that the database contains data from most or all of an organisation's operational applications, and that this data is made consistent.
History of data warehousing
Data Warehouses became a distinct type of computer database during the late 1980's and early 1990's. They developed to meet a growing demand for management information and analysis that could not be met by operational systems. Operational systems were unable to meet this need for a range of reasons:
- The processing load of reporting reduced the response time of the operational systems,
- The database designs of operational systems were not optimised for information analysis and reporting,
- Most organisations had more than one operational system, so company-wide reporting could not be supported from a single system, and
- Development of reports in operational systems often required writing specific computer programs which was slow and expensive
As a result, separate computer databases began to be built that were specifically designed to support management information and analysis purposes. These data warehouses were able to bring in data from a range of different data sources, such as mainframe computers, minicomputers, as well as personal computers and office automation software such as spreadsheets and integrate this information in a single place. This capability, coupled with user-friendly reporting tools, and freedom from operational impacts has led to a growth of this type of computer system.
As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more functionality), data warehouses have evolved through several fundamental stages:
- Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.
- Offline Data Warehouse - Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure
- Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
- Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organisation.
Components of a data warehouse
The primary components of the majority of data warehouses are shown in the attached diagram and described in more detail below:
Data Sources
Data sources refers to any electronic repository of information that contains data of interest for management use or analytics. This definition covers mainframe databases (eg IBM DB2, ISAM, Adabas etc.), client-server databases (eg Oracle database, Informix etc.), PC databases (eg Microsoft Access), spreadsheets (eg Microsoft Excel) and any other electronic store of data. Data needs to be passed from these systems to the data warehouse either on a transaction-by-transaction basis for real-time data warehouses or on a regular cycle (e.g. daily or weekly) for offline data warehouses.
Data Transformation
The Data Transformation layer receives data from the data sources, cleans and standardises it, and loads it into the data repository. This is often called "staging" data as data often passes through a temporary database whilst it is being transformed. This activity of transforming data can be performed either by manually created code or a specific type of software could be used called an ETL tool. Regardless of the nature of the software used, the following types of activities occur during data transformation:
- comparing data from different systems to improve data quality (e.g. Date of birth for a customer may be blank in one system but contain valid data in a second system. In this instance, the data warehouse would retain the date of birth field from the second system)
- standardising data and codes (e.g. If one system refers to "Male" and "Female", but a second refers to only "M" and "F", these codes sets would need to be standardised)
- integrating data from different systems (e.g. if one system keeps orders and another stores customers, these data elements need to be linked)
- performing other system housekeeping functions such as determining change (or "delta") files to reduce data load times, generating or finding surrogate keys for data etc.
Data Warehouse
The data warehouse is a relational database organised to hold information in a structure that best supports reporting and analysis. Most data warehouses hold information for at least 1 year. As a result these databases can become very large.
Reporting
The data in the data warehouse must be available to the organisation's staff if the data warehouse is to be useful. There are a very large number of software applications that perform this function, or reporting can be custom-developed. Examples of types of reporting tools include:
- Business intelligence tools: These are software applications that simplify the process of development and production of business reports based on data warehouse data.
- Executive information systems: These are software applications that are used to display complex business metrics and information in a graphical way to allow rapid understanding.
- OLAP Tools: OLAP tools form data into logical multi-dimensional structures and allow users to select which dimensions to view data by.
- Data Mining: Data mining tools are software that allows users to perform detailed mathematical and statistical calculations on detailed data warehouse data to detect trends, identify patterns and analyse data.
Metadata
Metadata, or "data about data", is used to inform operators and users of the data warehouse about its status and the information held within the data warehouse. Examples of data warehouse metadata include the most recent data load date, the business meaning of a data item and the number of users that are logged in currently.
Operations
Data warehouse operations comprises of the processes of loading, manipulating and extracting data from the data warehouse. Operations also covers user management, security, capacity management and related functions.
Optional Components
In addition, the following components also exist in some data warehouses:
- Dependent Data Marts: A dependent data mart is a physical database (either on the same hardware as the data warehouse or on a separate hardware platform) that receives all its information from the data warehouse. The purpose of a Data Mart is to provide a sub-set of the data warehouse's data for a specific purpose or to a specific sub-group of the organisation.
- Logical Data Marts: A logical data mart is a filtered view of the main data warehouse but does not physically exist as a separate data copy. This approach to data marts delivers the same benefits but has the additional advantages of not requiring additional (costly) disk space and it is always as current with data as the main data warehouse.
- Operational Data Store: An ODS is an integrated database of operational data. Its sources include legacy systems and it contains current or near term data. An ODS may contain 30 to 60 days of information, while a data warehouse typically contains years of data. ODS's are used in some data warehouse architectures to provide near real time reporting capability in the event that the Data Warehouse's loading time or architecture prevents it being able to provide near real time reporting capability.
Different methods of storing data in a data warehouse
All data warehouses store their data grouped together by subject areas that reflect the general usage of the data (Customer, Product, Finance etc.). The general principle used in the majority of data warehouses is that data is stored at its most elemental level for use in reporting and information analysis.
Within this generic intent, there are two primary approaches to organising the data in a data warehouse.
The first is using a "dimensional" approach. In this style, information is stored as "facts" which are numeric or text data that capture specific data about a single transaction or event, and "dimensions" which contain reference information that allows each transaction or event to be classified in various ways. As an example, a sales transaction would be broken up into facts such as the number of products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and sales person. The main advantages of a dimensional approach is that the Data Warehouse is easy for business staff with limited information technology experience to understand and use. Also, because the data is pre-processed into the dimensional form, the Data Warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that it is quite difficult to add or change later if the company changes the way in which it does business.
The second approach uses database normalisation. In this style, the data in the data warehouse is stored in third normal form. The main advantage of this approach is that it is quite straightforward to add new information into the database, whilst the primary disadvantage of this approach is that it can be quite slow to produce information and reports.
Advantages of using data warehouse
There are many advantages to using a data warehouse, some of them are:
- Enhances end-user access to a wide variety of data.
- Increases data consistency.
- Increase productivity and decrease computing costs.
- Is able to combine data from different sources, in one place.
- It provides an infrastructure that could support changes to data and replication of the changed data back into the operational systems.
Concerns in using data warehouse
- Extracting, cleaning and loading data could be time consuming.
- Data warehousing project scope might increase.
- Problems with compatibility with systems already in place e.g. transaction processing system.
- Providing training to end-users, who end up not using the data warehouse.
- Security could develop into a serious issue, especially if the data warehouse is web accessible.
- A data warehouse is a HIGH maintenance system.
See also
- Business intelligence
- Business performance management
- Chief Performance Officer
- Data mart
- Data mining
- Data integration
- Database management system
- Executive information system
- Extract, transform, load
- Intelligent document
- Master Data Management
- OLAP
- OLTP
References
- William H. Inmon, Richard D. Hackathorn: Using the Data Warehouse, John Wiley & Sons, ISBN 0-471-05966-8
- Pyle, Dorian. Business Modeling and Data Mining. Morgan Kaufmann, 2003. ISBN 155860653X
- Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition), John Wiley & Sons, ISBN 0471200247
- Stephen Haag, Maeve Cummings, Donald J. McCubbery, Alain Pinsonneault,Richard Donvan:Managments Information System for the Information Age, Third Canadian Edition, McGraw-Hill Ryerson, ISBN 0-07-095569-7
- Data Management And Warehousing knowledge base
- BeyeNetwork – A vendor neutral resource for data management professionals
- Data Warehousing Development Process on gantthead.com
- Information Tecnology Data Warehouse
- Data Warehousing Gotchas