Monday, 9 February 2015

Data Warehousing Concepts



Here are a few basic questions we must understand and learn in order to Maintain a descent level understanding regarding basic data warehousing concepts.

1.Define data warehouse
Data warehouse is a relational database that is designed for query and analysis rather than for transaction processing.It is termed as corporate memory, though it usually contains historical data that is derived from transaction data, but can include data from other sources as well. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources.
2.What are Different types of data warehouses?

According To William Inmon There Are 4 Types Of Data Warehouse
Subject Oriented: Data warehouse are designed to help you analyze data. For example, to learn more about your company sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer the questions like “who was our best customer for this item last year?” This ability to define data warehouse by subject matter, sales in this case makes the data warehouse subject oriented.
Integrated: Integration is closely related to subject orientation. Data warehouse must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among the units of measures. When they achieve this they are said to be integrated.
Nonvolatile : Nonvolatile means that once entered into the warehouse, data should not be change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred and whatever once happened never changes.
Time variant: In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demands that historical data be moved to an archive. A data warehouse’s focus on change over time is what is meant by the term time variant.
3.Define Fact-table

Fact Table: Fact table contains the measurements or metrics or facts of business processes. If your business process is Sales, then a measurement of this business process such as "monthly sales number" is captured in the fact table. In addition to the measurements, the only other things a fact table contains are foreign keys for the dimension tables.
Fact table contains the actual business process measurements or metrics called facts. Usually these facts are numeric.
These facts are generally Additive.
Sometimes the facts are semi additive such as balances
Sometimes they are non-additive such as unit price

4.Define Dimension-table

Dimension Table
Dimension table contains the textual description of a business. It means context of the measurements are represented in dimension tables. You can also think of the context of a measurement as the characteristics such as who, what, where, when, how of a measurement (subject). In your business process Sales, the characteristics of the 'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What).

5.What do you mean by Detail in a table?

Details are the dimension beyond which an analyst is not interested to analyze. A detail is always a dimension. If we drill down region dimension table as region-sub region-country-state-city in that case city is detail being the last dimension.

6.What is Measures?

Special dimension that contains measurable quantities upon which the business is measured. They are always numeric quantities. They are the ones on which we perform aggregations. It includes cost, profit, or taxes.

7.What is a critical column?
The huge collection of data within a warehouse is useful for analysis only if the data in it is consistent. However there exists some information which when updated in a warehouse makes the data in a warehouse inconsistent. The column that contains such vital information is identified as critical columns.

9.What is Surrogate-key?

A surrogate key to a data warehouse is what a primary key is for an OLTP source. As a possible solution for a critical column we use surrogate keys by inserting a new record as and when a critical column gets updated at an OLTP source. However it violates the primary key constraint that might have been specified for the column Customer ID. Hence we no longer maintain the Customer ID as a primary key but we have one more new column acting as the so called surrogate keys which typically contains a value generated from a sequence.
So whenever a customer changes city, we insert a new record into our warehouse with the same Customer ID but with a different surrogate key. Customer’s latest city can be identified by the largest surrogate key value for that customer.

8.Define Primary-key

Each row in a dimension table is identified by a unique value of an attribute designated as the primary key of the dimension. In a product dimension table, the primary key identifies each product uniquely.

10.Define Foreign-key

Each dimension table is in a one-to-many relationship with the central fact table. So the
Primary key of each dimension table must be a foreign key in the fact table. If there are
Four dimension tables of product, date, customer, and sales representative, then the primary key of each of these four tables must be present in the orders fact table as foreign keys.


11.What Is A Star Schema?

Star schema is highly-denormalised. As it looks like a STAR. Hence it is also known as Star Schema.




Advantages of Star Schema

Star Schema is very easy to understand, even for non-technical business managers
Star Schema provides better performance and smaller query times
Star schema is easily extensible and will handle future changes easily

12.What is a snowflake schema?

Snowflake-Schema-Design: this looks like a snow flake, hence this type of schema is called Snowflake Schema.  General rule of thumb is keep away from snow flake schemas as even though they may save you some space, they will cost a lot in terms of query times. It is highly-normalized

 

 

13.What is the major difference between star and snowflake schema and on what basis the preference is made between the two?

The major difference between the two is a star schema is highly de-normalized whereas snowflake schema is partially normalized.
Snow flake schema is preferred when a data warehouse is most of the time used as a source for one more rather high end data warehouse whereas star schema is preferred when a data warehouse is used for direct analysis.

14.What is metadata
Meta data is to warehouse what a card catalogue is to traditional library. We can say it is Data about data means data that describes data and other structures, such as objects, business rules, and processes. For example, the schema design of a data warehouse is typically stored in a repository as metadata, which is used to generate scripts used to build and populate the data warehouse. A repository contains metadata.
15.What are various types of of metadata

There are three main types of metadata:Descriptive metadata describes a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, and keywords.
Structural metadata indicates how compound objects are put together, for example, how pages are ordered to form chapters.Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information, and who can access it.

There are several subsets of administrative data; two that are sometimes listed as separate metadata types are:
Rights management metadata, which deals with intellectual property rights,
and
Preservation metadata, which contains information needed to archive and preserve a resource.

16.What is the difference between ODS and OLTP
ODS is an Operational Data System, which contains the snapshot of currently loaded data for clerical day to day analysis and includes a little history, 30-60 days of historical data in addition to the current data. It is nothing but a collection of tables created in the Data warehouse that maintains only current data. Having data with Data warehouse that will be stand alone. No further transaction will take place for current data which is part of the data ware house. Current data will be change once you upload through ETL on schedule basis. 
OLTP is Online Transaction Processing System, which contains only the current data and keeps changing, not suitable for analysis due to temporal inconsistency unlike ODS which contains consistent data. It maintains the data only for transactions; these are designed for recording daily operations and transactions of a business. Having data with on line system which connected to network and all update on transaction happened in seconds. Every second data summarized value will get changed.

17.What is snapshot?

You can disconnect the report from the catalog to which it is attached by saving the report with a snapshot of the data. However, you must reconnect to the catalog if you want to refresh the data.
It’s a permanent local copy of the data in a report which can be used to create the reports. Snapshot is static data source, it is permanent local copy or picture of a report, and it is suitable for disconnected networks. We can't add any columns to snapshot. We can sort, grouping and aggregations.it is mainly use for analyze the historical data

18.What is the difference between datawarehouse and BI?
DATAWAREHOUSE: To make Business Analysis effective and efficient we require specialized form of storage. This special form of storage of data is called Data Warehouse and the process Data Warehousing. Data warehouse is integrated, time-variant, subject oriented and non-volatile collection data in support of management decision making process. Data warehouse is a centralized database. Extract data from desperate systems; transform the data into one single format and load this transformed data into target/data warehouse.
BUSINESS INTELLIGENCE: Business Intelligence is the process of extracting the data, converting it into information and then into knowledge base is known as Business Intelligence BI is the mechanism of using data according to type of industry for predictive analysis, fault findings, process improvement etc. business intelligence is a group application which is used to gather information, providing access to, analyze the information and proving this information to end user to take timely decisions.
19.What is the difference between OLAP and datawarehouse

Don't get confused; OLAP and Data warehouse are the terms that can be used interchangeably in some places. Online Analytical Processing is a concept whereas Data warehouse is the collection of data in denormalized form so that it can help in fast and frequent analysis.

DW:
  • Data from different source systems is stored in a relational database for end use analysis.
  • Data is organized in summarized, aggregated, subject-oriented, non-volatile pattern.
  • Data in a DW is consolidated, flexible collection of data, which supports analysis of data, but not online analysis of data.
OLAP:
  • A tool to evaluate & analyze data in a DW using analytical queries.
  • A tool which helps organizes data in a DW using multidimensional models.
  • Supports analysis in real time & online analysis with speed and flexibility.

20.What do you mean by data granularity?

Granularity or Grain of Fact Table
The level of detail of the fact table is known as the grain of the fact table. In information management, granularity refers to the degree of detail or precision contained in data. In another example the grain of your sales fact table is monthly sales number per location per product.


21.What is Multidimensional database

Multidimensional databases consolidate and calculate data to provide different views. With Multidimensional databases users can pivot the data to see the information from different view point. It contains Outlines that defines all the elements of the database.

22.What is active data warehousing?
Active Data ware house means. Every user can access the database any time 24/7 that is called Active dwh
An active data warehouse provides information that enables decision-makers within an organization to manage customer relationships nimbly, efficiently and proactively. Active data warehousing is all about integrating advanced decision support with day-to-day-even minute-to-minute-decision making in a way that increases quality of those customer touches which encourages customer loyalty and thus secure an organization's bottom line.

23.What do you mean by data marts?

Data Mart is the subset of data warehouse. You can also consider datamart holds the data of one subject area. For an example, you consider an organization that has HR, Finance, Communications and Corporate Service divisions. For each division you can create a datamart. The historical data will be stored into datamarts first and then exported to data warehouse finally.

24.Is OLAP databases are called decision support system ?

Yes, OLAP databases are called decision support systems because it helps in analyzing the historical and present data of the organizations and thereby helping in taking intelligent business decisions

25.What are additive/non-additive/semi-additive facts?
There are basically three types of Facts:
1) Additive Facts
2) Non- Additive Facts
3) Semi-Additive facts
Additive facts -Additive facts are those facts which we can aggregate or we can add. For e.g. 2+7=9.Dollar value is additive fact. If we want to find out the amount for a particular place for a particular period of time, we can add the dollar amounts and come up with the total amount.

Non-additive facts-A fact may be measure, metric or a dollar value. Measure and metric are non-additive facts. Non-additive Fact tables in which data in columns data cannot be aggregated or calculations can’t be possible for producing the expected results or known as non-additive facts. Examples are Ratios, Averages & Variance facts
A non-additive fact, for e.g. measure height(s) for 'citizens by geographical location' , when we rollup 'city' data to 'state' level data we should not add heights of the citizens rather we may want to use it to derive 'count'

Semi-additive facts are those facts which we can aggregate according to the context. If you want to sum up a measure for some dimensions but not for others then it became your semi-additive. And while creating a cube you take care of this thing. Like "balances".

26.What is fact less fact table? Where you have used it in your project?

Fact less table means only the key available in the Fact there is no measures available Used only to put relation between the elements of various dimensions. It contains nothing but dimensional keys. It is used to support negative analysis report. For example a Store that did not sell a product for a given period.

27.Explain in detail about
type 1,
type 2(SCD),
type 3 ?

Type1: Rows containing changes to existing dimensions are updated in the target by overwriting the existing dimension. In the Type 1 Dimension mapping, all rows contain current dimension data. Use the Type 1 Dimension mapping to update a slowly changing dimension table when you do not need to keep any previous versions of dimensions in the table.

Type 2: The Type 2 Dimension Data mapping inserts both new and changed dimensions into the target. Changes are tracked in the target table by versioning the primary key and creating a version number for each dimension in the table. Use the Type 2 Dimension/Version Data mapping to update a slowly changing dimension table when you want to keep a full history of dimension data in the table. Version numbers and versioned primary keys track the order of changes to each dimension.

Type 3: The Type 3 Dimension mapping filters source rows based on user-defined comparisons and inserts only those found to be new dimensions to the target. Rows containing changes to existing dimensions are updated in the target.

28.What is ER-Modeling and Dimensional modeling?
E-R modeling is the relation between entities in the form of normalization whereas dimensional modeling is the relation between dimensions in the form of de-normalization. So E-R modeling is entity relation between two tables.
Dimension modeling is contain three table Aggregate, Fact table, Dimension table

ER modeling:
  - focused how data will be efficient for processing (insert, update, delete)
  - Minimalize (limit to zero) data redundancies

Dimensional modeling:
  - focused how data will be efficient for retrieving
    (example, by report and analysis tools).
  - Many data redundancies
  - Consist of Fact and Dimension table

29.How many types of facts and what are they?

There are 6 types of facts theses are as follows: Fact less Facts:Facts without any measures.Additive Facts:Fact data that can be additive/aggregative.Non-Additive facts: Facts that are result of non-additionSemi-Additive Facts: Only few columns data can be added.Periodic Facts: That stores only one row per transaction that happened over a period of time.Accumulating Fact: stores row for entire lifetime of event.

30.What is the main difference between Inmon and Kimball philosophies of data warehousing?
Both differed in the concept of building the data warehouse...i.e.
Kimball--First DataMarts--Combined way ---Data warehouse
Inmon---First Data warehouse--Later----Datamarts
Kimball views data warehousing as a constituency of data marts. Data marts are focused on delivering business objectives for departments in the organization. And the data warehouse is a conformed dimension of the data marts. Hence a unified view of the enterprise can be obtained from the dimension modeling on a local departmental level. He follows bottom-up approach i.e., first creates individual Data Marts from the existing sources and then create Data Warehouse.
Inmon believes in creating a data warehouse on a subject-by-subject area basis. Hence the development of the data warehouse can start with data from the online store. Other subject areas can be added to the data warehouse as their needs arise. Point-of-sale (POS) data can be added later if management decides it is necessary. He follows top-down approach i.e., first creates Data Warehouse from the existing sources and then create individual Data Marts.  
31.What is materialized view
A pre-computed table comprising aggregated or joined data from fact and possibly dimension tables is known as materialized view. It is also known as a summary or aggregate table.
32.What is normalization
In a relational database, the process of removing redundancy in data by separating the data into multiple tables.
33.What do you mean by parallelism?
Breaking down of a task so that several processes do part of the work. When multiple CPUs each do their portion simultaneously, very large performance gains are possible.
34. What is a Staging Area?
We need to clean and process our operational data before putting it into the warehouse. These operations are done at a special dedicated area wherein all the data from all the sources are first dumped off; this area is known as staging area.
35. Guys At the end we must understand and will be able to draw the architecture of a warehouse as needed. So just go through it…

Data Warehouse Architecture (Basic)

Figure below shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse.
Here in this figure the metadata and raw data and summary data of a traditional OLTP system is present. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales.

Architecture of a Data Warehouse (Basic)



Data Warehouse Architecture (with a Staging Area)

In this figure staging area is included as we know we need to clean and process our operational data before putting it into the warehouse. These operations are done at a special dedicated area wherein all the data from all the sources are first dumped off, well known as staging area. A staging area simplifies building summaries and general warehouse management.

Figure1. Architecture of a Data Warehouse with a Staging Area

Data Warehouse Architecture (with a Staging Area and Data Marts)

You may want to customize your warehouse's architecture for different groups within your organization. You can do this by adding data marts, which are systems designed for a particular line of business. Figure below illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.

Figure2. Architecture of a Data Warehouse with a Staging Area and Data Marts


No comments:

Post a Comment