Data warehouse is a large database system designed for the purpose of data analysis. It is the design and implementation of processes, tools, and facilities to manage and deliver complete, timely, accurate, and understandable information for decision making.
Data warehousing provides an excellent approach for transforming the vast amounts of data that exist in large-scale organizations into useful and reliable information for getting answers to their questions and to support the decision making process. From a business perspective, data warehouse provides a single and consistent data source. It makes the data collection process much easier and faster for the users. Potentially, better business decisions can be made in a short period of time.
This document presents an overview to data warehouse in terms of the architecture, the data model, and the user interface.
The concept of data warehousing has evolved out of the need for easy access to a structured store of quality data that can be used for decision making. It is globally accepted that information is a very powerful asset that can provide significant benefits to any organization and a competitive advantage in the business world. Organizations have vast amounts of data but have found it increasingly difficult to access it and make use of it. This is because it is in many different formats, exists on many different platforms, and resides in many different files and database structures developed by different vendors. Thus organizations have had to write and maintain perhaps hundreds of programs that are used to extract, prepare, and consolidate data for use by many different applications for analysis and reporting. Also, decision makers often want to dig deeper into the data once initial findings are made. This would typically require modification of the extract programs or development of new ones. This process is costly, inefficient, and very time consuming.
Data warehousing offers a better approach. Data warehousing implements the process to access heterogeneous data sources; clean, filter, and transform the data; and store the data in a structure
that is easy to access, understand, and use. The data is then used for query, reporting, and data analysis. As such, the access, use, technology, and performance requirements are completely different from those in a transaction-oriented operational environment. The volume of data in data warehousing can be very high, particularly when considering the requirements for historical data analysis.
2.0 Business Driving Force for Data Warehouse:
With the increase in business competition, there is a need to obtain and analyze business data faster. A lot of important business data is in operational database systems. However, these systems are not designed for business data analysis due to the following reasons:
• Data model is normalized for speed and not for data analysis.
• Data model is not grouped into subject areas for analysis.
• Data model is not dimensional.
• Operational database can't afford the resources to perform computation intensive query during analysis.
• There's no cross reference information between data from the different operational databases.
• Historical data may not be found in operational database for trend analysis.
• Data warehouse has data description and data browsing facilities.
• Operational data changes over time.
Data warehouse can provide easy information access for business people to increase revenue, profit, customer satisfaction, savings, and market share. The system can be used for different departments in the organization.
3.0 Development Steps:
The development steps for data warehouse project are similar to those of any other information system. The following outlines some key steps during developing. The outline is divided into three sections and they are planning & design, building & testing, roll out & maintenance.
Planning and Design
• Business drivers
• User needs
• Application orientation
• Data sources and Quality
• To build data warehouse or data mart
• Project risk
• Budget plan
• Time frame
• Logical and physical data model (depends on access & usage)
At the planning & design phase, the data model is just in preliminary design.
Building and Testing
• HW, SW, transformation SW, middle ware, OLAP SW, system management SW
• Network infrastructure and management
• Connect to source databases (flat file, ongoing connection, direct access)
• Summarize or aggregate data
• Data mining to find out data patterns.
Some data extraction & transformation software are very useful to develop data transformation routines. These tools are very useful for both the construction and the maintenance phase. In addition, system management software can control the data extraction processes to extract data from other database systems to the data warehouse.
Roll out & Maintenance
• System growth
• Performance management
• System maintenance
• Backup, recovery
• Update data
Data warehouse reads source data from different database systems in the organization. The source databases are usually operational databases. The following is one of the data warehouse logical architecture:
Data warehouse reads data from multiple operational databases. The data is clean, transformed, or aggregated. The data is either updated or inserted into the data warehouse depending on the trend analysis requirement.
Scalability is an important consideration in choosing software, hardware, and system architecture for the data warehouse. Both the database size and the number of users for the data warehouse can increase substantially over time. The software and hardware must be scalable to support the new requirements.
There are different types of database management system such as relational database system, object oriented database system, hierarchical database system, etc. Relational database is usually the choice for implementing data warehouse because the following reasons:
• Relational database is the most commonly used database system in the commercial environment. Many developers already have experience with relational database products. This reduces the learning curve for the developers.
• Most of the operational database system is constructed with relational database. If the data warehouse is also constructed with relational database, the data conversion process between the operational database and the data warehouse can be simpler. Also, there are many database products that enable direct data transfer between relational database systems.
• Relational database is more mature than other types of database system in terms of its scalability, stability, and efficiency.
• Relational database has less proprietary functions than other database systems. This increases the degree of platform independence.
5.0 Data Source and Data Extraction:
Data warehouse reads data from multiple data sources. These data sources are usually operational databases such as accounting information database, financial information database, facility information database, ERP (Enterprise Resources Planning, i.e. SAP), operational information database, research & engineering database, GIS (Geographical Information System), etc.
Depending on the business requirements and the types of data, the data loading frequency can be just once, once a day, once a week, or once a month. Loading data once a day is the most often. There will be on going data and system administrative work required to maintain the data warehouse.
Data loading process can have errors. The problems can be data referential integrity error, data format error, data range error, or other data quality errors. In these situations, the source data has to be modified before it can be loaded into the data warehouse.
6.0 Data Modeling:
Data modeling is one of the most important steps in building a data warehouse. Data warehouse uses dimensional modeling in a relational database environment. There are two types of tables and they are dimension table and fact table. Dimension table contains information that is relatively static over time. Fact table contains transactional type of information that changes over time. Fact table contains multiple foreign keys to dimension tables and has some of its own attributes.
Data modeling is a creative process and there can be different modeling solutions for the same set of data. The purpose of data modeling is to organize data to meet business objectives and to provide good performance for database operation.
6.1 Star Schema:
• Star schema is a relational data model. Each schema has one fact table associated with multiple dimension tables. Each data warehouse has many Star schema. Star schema organizes data for the purpose of end-user analysis. Star schema is easy to understand for end-user. Also, there are many OLAP tools that support star schema analysis.
6.2 Historical Data and Trend Analysis:
Historical data is stored in the data warehouse for trend analysis. Trend analysis is a very important feature for the data warehouse. Fact table contains transaction type information. Data is inserted into the fact table without overwriting existing data. Therefore, fact table already captures historical information.
6.3 Information Grouping for Analysis:
Information can be grouped for data analysis. The grouping information can come from the original data source or from the end-user.
Storage space and user interface are needed for the end-user to maintain this type of grouping information.
6.4 Summary Information:
Some types of information are summarized before loading into the data warehouse. This depends on the level of detail of the information required by the user. For example, a supermarket may have a few thousands of transaction each day. This transaction can be summarized by each product before loading into the data warehouse.
6.5 Cross Reference Information:
Cross reference information between information from different databases is very important for data analysis. For example, financial information and operational information can come from two different database systems. The financial database contains cost and revenue information for each facility. The operational database contains operation information for each facility. Cross reference information between these two database systems can enable cost analysis on operation activity.
6.6 Data Model Prototype:
Prototyping is a good way to analyze the data model in early development stage. It can demonstrate the benefit of the data warehouse strategy. In helping to present the data model, the data model can be divided into two views. One is the business view and the other is the developer view. End user can use the business view to understand the system functionality.
There are some limitations to prototyping. Prototyping may not show:
• Data migration processes
• System with all the data
• Performance issues
• Security features
6.7 Stage Older Data:
A terabyte data warehouse requires 500 to 1000 physical disk drives and plus 100 more disk controllers. Some old data should be archived as new data is loaded into the warehouse. Fact table old data can be archived. For example, sales data from 7 years ago may be required by end user for analysis.
7.0 Data Transformation and Loading:
Data transformation and loading process puts source data into the data warehouse. The programming logic is usually simple. Depending on the data quality of the source data, the transformation and loading process can be time consuming. For example, if the source data is manually maintained, a lot of effort may be needed to clean the source data. Some records may have bad data and require to be corrected before loading into the data warehouse.
If data comes from two different database systems and the data warehouse is required to build cross reference information between them, there can be data referential integrity problem. For example, one database contains operational data for each facility and the other database contains financial data for each facility. The operational data may refer to a facility that does not exist in the financial data.
For every time there is change in the source data definition or the target data model, the associated data transformation and loading process has to be modified accordingly. If there are many changes, a lot of time is required to modify the processes. There are some visual development tools that are specialized in developing these data transformation and loading processes. These tools have a GUI interface that allows developer to specify the data transformation logic. It makes the data transformation and loading processes easier to develop and to maintain.
There are some other ways to implement the data transformation and loading processes. These processes can be implemented in conventional languages such as C and COBOL. Using C, we can achieve a fast execution speed and this is necessary for some computation intensive data warehouse processes.
8.0 Process Control and Scheduling:
There are many data transformation and loading processes in the data warehouse for data population. There are sequences and dependencies for these processes to execute. A control process is required to control these data transformation and loading processes. It may be required to access multiple computer systems. For example, it can start a data extraction process on another database system and transfer the data to the data warehouse server for data transformation and loading.
9.0 User Interfaces:
There are a few types of user interface for the data warehouse system. These user interfaces can be used by the system administrator or the end-user. With these user interfaces, system administrator can:
• Maintain user accounts
• Monitor and control data loading and transformation processes
• Analyze data (OLAP tool)
• Create and generate reports
• Create and execute query
• Maintain user input data (i.e. group information for data analysis)
9.1 System Administration and Maintenance:
Some data are manually maintained in the data warehouse. These can be system related data for the data warehouse to operate. This data is usually maintained by the system administrator. For example, the data warehouse has information about all the data loading processes. Scheduling program be can based on these information to execute the data loading processes and the execution status can be stored in the data warehouse for process tracking. Also, system administrator can maintain information about user account and access privilege. A user interface can be developed for the administrator to maintain the information.
Some lookup data and grouping data are also manually maintained. These data are for data analysis purposes.
Data warehouse is a good solution for storing and analyzing large amount of data. It reads data from multiple operational databases on an ongoing basis. Cross reference information is generated between the data from different databases. The data model is designed to provide good browsing performance to the end user. Data warehouse can be seen as a centralized data repository to provide both current and historical data to the end user.
Akmal B. Chaudhri, Mary Loomis. Object Databases in Practice. Hewlett-Packard Company, Prentice-Hall.
Kimball. Data Warehouse Toolkit. John Wiley & Sons, Inc.