1.What is a data warehousing?
Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated....This makes it much easier and more efficient to run queries over data that originally came from different sources.
Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases.
2.What is real time data-warehousing?
Real-time data warehousing is a combination of two things: 1) real-time activity and 2) data warehousing. Real-time activity is activity that is happening right now. The activity could be anything such as the sale of widgets. Once the activity is complete, there is data about it.
Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available.
3.What are Data Marts?
Data Marts are designed to help manager make strategic decisions about their business.
Data Marts are subset of the corporate-wide data that is of value to a specific group of users.
A data mart is a focused subset of a data warehouse that deals with a single area (like different department) of data and is organized for quick analysis.
There are two types of Data Marts:
Independent data marts – sources from data captured form OLTP system, external providers or from data generated locally within a particular department or geographic area.
Dependent data mart – sources directly form enterprise data warehouses.
4.What is data mining?
Data mining is a process of extracting hidden trends within a data warehouse. For example an insurance data warehouse can be used to mine data for the most high risk people to insure in a certain geographical area.
5.What is a Star Schema?
Star schema is a type of organizing the tables such that we can retrieve the result from the database easily and fastly in the warehouse environment. Usually a star schema consists of one or more dimension tables around a fact table which looks like a star, so that it got its name.
6.What Snowflake Schema?
Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table.
7.Differences between Star and Snowflake schemas?
Star schema - A single fact table with N number of Dimension
Snowflake schema - Any dimensions with extended dimensions are know as snowflake schema
8.Difference between Star and Snowflake Schema. What are situations where Star Schema is better than Snowflake Schema to use and when the opposite is true?
• Contains the dimension tables mapped around one or more fact tables.
• It is a denormalised model.
• No need to use complicated joins.
• Queries results fastly.
• It is the normalized form of Star schema.
• Contains in-depth joins, because the tables are splitted into many pieces. We can easily do modification directly in the tables.
• We have to use complicated joins, since we have more tables.
• There will be some delay in processing the Query.
9.What are slowly changing dimensions?
SCD stands for slowly changing dimensions. Slowly changing dimensions are of three types
SCD1: only maintained updated values.
Ex: a customer address modified we update existing record with new address.
SCD2: maintaining historical information and current information by using
A) Effective Date
Or combination of these
SCD3: adding new columns to target table we maintain historical information and current information
10.What is a Fact table?
Fact Table contains the measurements or metrics or facts of business process. If your business process is "Sales”, then a measurement of this business process such as "monthly sales number" is captured in the Fact table. Fact table also contains the foreign keys for the dimension tables.
11.What is a Dimension table?
A dimensional table is a collection of hierarchies and categories along which the user can drill down and drill up. It contains only the textual attributes.
12.What is a Lookup table?
A lookup table is the one which is used when updating a warehouse. When the lookup is placed on the target table (fact table / warehouse) based upon the primary key of the target, it just updates the table by allowing only new records or updated records based on the lookup condition.
Lookup tables can be considered as a reference table where in we refer and confirm for any updates / new addition in our dimension/ fact.
13.What is conformed fact?
A fact (measurement) that is used in more than one Data Mart. When developing multiple data marts, some standards across marts with respect to facts are useful. Warehouse users may be confused when a fact has the same meaning but different names, or the same name is used with inconsistent meaning across data marts. Ralph Kimball refers to the practice of consistently using common facts throughout the warehousing environment as "conformed" facts.
Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly.
14.What are conformed dimensions?
A Dimension that is reused by more than one Data Mart. When modeling multiple data marts, standards across marts with respect to dimensions are useful. Warehouse users may be confused when a dimension has the similar meaning but different names, structures, levels, or characteristics among multiple marts. Ralph Kimball refers to the practice of using standard dimensions throughout the warehousing environment as "conformed" dimensions.
Conformed dimensions are dimensions which are common to the cubes. (Cubes are the schemas contains facts and dimension tables)
Consider Cube-1 contains F1, D1, D2, D3 and Cube-2 contains F2, D1, D2, D4 are the Facts and Dimensions here D1, D2 are the Conformed Dimensions
15.What are Aggregate tables?
Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions. Retrieving the required data from the actual table, which have millions of records will take more time and also affects the server performance. To avoid this we can aggregate the table to certain required level and can use it. This table reduces the load in the database server and increases the performance of the query and can retrieve the result very fastly.
16.How can you improve the performance of Aggregate transformation?
We can improve the aggregator performance in the following ways.
• Send sorted input.
• Increase aggregator cache size. I.e. Index cache and data cache.
• Give input/output what you need in the transformation. I.e. reduce number of input and output ports.
17.What are non-additive facts?
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
18.What is a CUBE in data warehousing concept?
Cubes are logical representation of multidimensional data. The edge of the cube contains dimension members and the body of the cube contains data values.
19.What is junk dimension? What is the difference between junk dimension and degenerated dimension?
Junk dimension: Grouping of Random flags and text Attributes in a dimension and moving them to a separate sub dimension.
Degenerate Dimension: Keeping the control information on Fact table ex: Consider a Dimension table with fields like order number and order line number and have 1:1 relationship with Fact table, In this case this dimension is removed and the order information will be directly stored in a Fact table in order eliminate unnecessary joins while retrieving order information..
20.What is a three tier data warehouse?
A data warehouse can be thought of as a three-tier system in which a middle system provides usable data in a secure way to end users. On either side of this middle system are the end users and the back-end data stores.
21.What is Full load & Incremental or Refresh load?
Full Load: completely erasing the contents of one or more tables and reloading with fresh data.
Incremental Load: applying ongoing changes to one or more tables based on a predefined schedule.
22.What is Business Intelligence?
Business intelligence (BI) is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions.
23.What is OLAP?
OLAP (Online Analytical Processing) spans multiple versions of database schema due to the evolutionary process of an organization; Integration from many organizational locations and data stores.
Online Analytical Processing, a category of software tools that provides analysis of data stored in a database. OLAP tools enable users to analyze different dimensions of multidimensional data. For example, it provides time series and trend analysis views. The chief component of OLAP is the OLAP server, which sits between a client and a database management system (DBMS). The OLAP server understands how data is organized in the database and has special functions analyzing the data.
24.What are OLAP, MOLAP, ROLAP, DOLAP, and HOLAP? Examples?
OLAP - On-Line Analytical Processing - Designates a category of applications and technologies that allow the collection, storage, manipulation and reproduction of multidimensional data, with the goal of analysis.
MOLAP - Multidimensional OLAP - This term designates a Cartesian data structure more specifically. In effect, MOLAP contrasts with ROLAP. In the former, joins between tables are already suitable, which enhances performances. In the latter, joins are computed during the request. Targeted at groups of users because it's a shared environment. Data is stored in an exclusive server-based format. It performs more complex analysis of data.
DOLAP - Desktop OLAP. - Small OLAP products for local multidimensional analysis Desktop OLAP. There can be a mini multidimensional database (using Personal Express), or extraction of a data cube (using Business Objects). Designed for low-end, single, departmental user. Data is stored in cubes on the desktop. It's like having your own spreadsheet. Since the data is local, end users don't have to worry about performance hits against the server.
ROLAP - Relational OLAP. - Designates one or several star schemas stored in relational databases. This technology permits multidimensional analysis with data stored in relational databases. Used for large departments or groups because it supports large amounts of data and users.
HOLAP: Hybridization of OLAP, which can include any of the above.
25.What is a Universe?
A "universe" is a "Business object" terminology. Business objects also happen to be the name of the company. The universe is the interfacing layer between the client and the data warehouse. The universe defines the relationship among the various tables in the data warehouse.
Universe is a semantic layer between the database and the user interface (reports).
A mapping of the data structure found in databases: tables, columns, joins, etc. A universe, made up of classes, objects, and conditions, can represent any specific application, system, or group of users.
26.What is surrogate key? Where we use it explain with examples.
Surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table.
Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Info sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key.
It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this makes updates more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the primary keys (according to the business users) but ,not only can these change, indexing on a numerical value is probably better and you could consider creating a surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may display only the AIRPORT_NAME.
1.What is a data warehousing?
Formatting Your Resource Will lead in good points and Higher Readability.
Please Keep in mind in future.
Thanks and Regards
Agree with Meetu. Your contents are really worth. A bit of formatting can make them better. Thanks for sharing this. Its really nice to have many things (nearly everything) at one place.
This is really superb and outstanding resource.
Thanks for sharing your knowledge with us.
Hey seems good, nice one yar, this will be helpful to those while in search for good and simple definition.
Really helpful stuff