Here
are a few basic questions we must understand and learn in order to
Maintain a descent level understanding regarding basic data
warehousing concepts.
1.Define
data warehouse
Data
warehouse is a relational database that is designed for query and
analysis rather than for transaction processing.It is termed as
corporate memory, though it usually contains historical data that is
derived from transaction data, but can include data from other
sources as well. It separates analysis workload from transaction
workload and enables a business to consolidate data from several
sources.
2.What
are Different types of data warehouses?
According
To William Inmon There Are 4 Types Of Data Warehouse
Subject
Oriented:
Data warehouse are designed to help you analyze data. For example, to
learn more about your company sales data, you can build a warehouse
that concentrates on sales. Using this warehouse, you can answer the
questions like “who was our best customer for this item last year?”
This ability to define data warehouse by subject matter, sales in
this case makes the data warehouse subject oriented.
Integrated:
Integration is closely related to subject orientation. Data warehouse
must put data from disparate sources into a consistent format. They
must resolve such problems as naming conflicts and inconsistencies
among the units of measures. When they achieve this they are said to
be integrated.
Nonvolatile
: Nonvolatile
means that once entered into the warehouse, data should not be
change. This is logical because the purpose of a warehouse is to
enable you to analyze what has occurred and whatever once happened
never changes.
Time
variant:
In order to discover trends in business, analysts need large amounts
of data. This is very much in contrast to online transaction
processing (OLTP) systems, where performance requirements demands
that historical data be moved to an archive. A data warehouse’s
focus on change over time is what is meant by the term time variant.
3.Define
Fact-table
Fact
Table: Fact
table contains the measurements or metrics or facts of business
processes. If your business process is Sales, then a measurement
of this business process such as "monthly sales number"
is captured in the fact table. In addition to the measurements, the
only other things a fact table contains are foreign keys for the
dimension tables.
Fact
table contains the actual business process measurements or metrics
called facts. Usually these facts are numeric.
These
facts are generally Additive.
Sometimes
the facts are semi additive such as balances
Sometimes
they are non-additive such as unit price
4.Define
Dimension-table
Dimension
Table
Dimension
table contains the textual description of a business. It means
context of the measurements are represented in dimension tables. You
can also think of the context of a measurement as the characteristics
such as who, what, where, when, how of a measurement (subject).
In your business process Sales, the characteristics of the 'monthly
sales number' measurement can be a Location (Where), Time (When),
Product Sold (What).
5.What
do you mean by Detail in a table?
Details
are the dimension beyond which an analyst is not interested to
analyze. A detail is always a dimension. If we drill down region
dimension table as region-sub region-country-state-city in that case
city is detail being the last dimension.
6.What
is Measures?
Special
dimension that contains measurable quantities upon which the business
is measured. They are always numeric quantities. They are the ones on
which we perform aggregations. It includes cost, profit, or taxes.
7.What
is a critical column?
The
huge collection of data within a warehouse is useful for analysis
only if the data in it is consistent. However there exists some
information which when updated in a warehouse makes the data in a
warehouse inconsistent. The column that contains such vital
information is identified as critical columns.
9.What
is Surrogate-key?
A
surrogate key to a data warehouse is what a primary key is for an
OLTP source. As a possible solution for a critical column we use
surrogate keys by inserting a new record as and when a critical
column gets updated at an OLTP source. However it violates the
primary key constraint that might have been specified for the column
Customer ID. Hence we no longer maintain the Customer ID as a primary
key but we have one more new column acting as the so called surrogate
keys which typically contains a value generated from a sequence.
So
whenever a customer changes city, we insert a new record into our
warehouse with the same Customer ID but with a different surrogate
key. Customer’s latest city can be identified by the largest
surrogate key value for that customer.
8.Define
Primary-key
Each
row in a dimension table is identified by a unique value of an
attribute designated as the primary key of the dimension. In a
product dimension table, the primary key identifies each product
uniquely.
10.Define
Foreign-key
Each
dimension table is in a one-to-many relationship with the central
fact table. So the
Primary
key of each dimension table must be a foreign key in the fact table.
If there are
Four
dimension tables of product, date, customer, and sales
representative, then the primary key of each of these four tables
must be present in the orders fact table as foreign keys.
11.What
Is A Star Schema?
Star
schema is highly-denormalised. As it looks like a STAR. Hence it is
also known as Star Schema.
Advantages of Star Schema
Star
Schema is very easy to understand, even for non-technical business
managers
Star
Schema provides better performance and smaller query times
Star
schema is easily extensible and will handle future changes easily
12.What
is a snowflake schema?
Snowflake-Schema-Design:
this looks like a snow flake, hence this type of schema is called
Snowflake Schema. General rule of thumb is keep away from snow
flake schemas as even though they may save you some space, they will
cost a lot in terms of query times. It is highly-normalized
13.What
is the major difference between star and snowflake schema and on what
basis the preference is made between the two?
The
major difference between the two is a star schema is highly
de-normalized whereas snowflake schema is partially normalized.
Snow
flake schema is preferred when a data warehouse is most of the time
used as a source for one more rather high end data warehouse whereas
star schema is preferred when a data warehouse is used for direct
analysis.
14.What
is metadata
Meta
data is to warehouse what a card catalogue is to traditional library.
We can say it is Data about data means data that describes data and
other structures, such as objects, business rules, and processes. For
example, the schema design of a data warehouse is typically stored in
a repository as metadata, which is used to generate scripts used to
build and populate the data warehouse. A repository contains
metadata.
15.What
are various types of of metadata
There
are three main types of metadata:•
Descriptive
metadata
describes a resource for purposes such as discovery and
identification. It can include elements such as title, abstract,
author, and keywords.
• Structural
metadata
indicates how compound objects are put together, for example, how
pages are ordered to form chapters.•
Administrative
metadata
provides information to help manage a resource, such as when and how
it was created, file type and other technical information, and who
can access it.
There
are several subsets of administrative data; two that are sometimes
listed as separate metadata types are:
− Rights
management metadata,
which deals with intellectual property rights,
and
− Preservation metadata, which contains information needed to archive and preserve a resource.
and
− Preservation metadata, which contains information needed to archive and preserve a resource.
16.What
is the difference between ODS and OLTP
ODS
is an Operational Data System, which contains the snapshot of
currently loaded data for clerical day to day analysis and includes a
little history, 30-60 days of historical data in addition to the
current data. It is nothing but a collection of tables created in the
Data warehouse that maintains only current data. Having data with
Data warehouse that will be stand alone. No further transaction
will take place for current data which is part of the data ware
house. Current data will be change once you upload through ETL on
schedule basis.
OLTP
is Online Transaction Processing System, which contains only the
current data and keeps changing, not suitable for analysis due to
temporal inconsistency unlike ODS which contains consistent data. It
maintains the data only for transactions; these are designed for
recording daily operations and transactions of a business. Having
data with on line system which connected to network and all update on
transaction happened in seconds. Every second data summarized value
will get changed.
17.What
is snapshot?
You
can disconnect the report from the catalog to which it is attached by
saving the
report
with a snapshot of the data. However, you must reconnect to the
catalog if you want to refresh the data.
It’s
a permanent local copy of the data in a report which can be used to
create the reports. Snapshot is static data source, it is permanent
local copy or picture of a report,
and it is suitable for disconnected networks. We can't add any
columns to snapshot. We can sort, grouping and aggregations.it is
mainly use for analyze the historical data
18.What
is the difference between datawarehouse and BI?
DATAWAREHOUSE:
To make Business Analysis effective and efficient we require
specialized form of storage. This special form of storage of data is
called Data Warehouse and the process Data Warehousing. Data
warehouse is integrated, time-variant, subject oriented and
non-volatile collection data in support of management decision making
process. Data
warehouse is a centralized database. Extract data from desperate
systems; transform the data into one single format and load this
transformed data into target/data warehouse.
BUSINESS
INTELLIGENCE:
Business Intelligence is the process of extracting the data,
converting it into information and then into knowledge base is known
as Business Intelligence
BI is the mechanism of using data according to type of industry for
predictive analysis, fault findings, process improvement etc.
business intelligence is a group application which is used to gather
information, providing access to, analyze the information and proving
this information to end user to take timely decisions.
19.What
is the difference between OLAP and datawarehouse
Don't
get confused; OLAP and Data warehouse are the terms that can be used
interchangeably in some places. Online Analytical Processing is a
concept whereas Data warehouse is the collection of data in
denormalized form so that it can help in fast and frequent analysis.
DW:
-
Data from different source systems is stored in a relational database for end use analysis.
-
Data is organized in summarized, aggregated, subject-oriented, non-volatile pattern.
-
Data in a DW is consolidated, flexible collection of data, which supports analysis of data, but not online analysis of data.
OLAP:
-
A tool to evaluate & analyze data in a DW using analytical queries.
-
A tool which helps organizes data in a DW using multidimensional models.
-
Supports analysis in real time & online analysis with speed and flexibility.
20.What
do you mean by data granularity?
Granularity
or Grain of Fact Table
The
level of detail of the fact table is known as the grain of the fact
table. In information management, granularity refers to the degree of
detail or precision contained in data. In another example the grain
of your sales fact table is monthly sales number per location
per product.
21.What
is Multidimensional database
Multidimensional
databases consolidate and calculate data to provide different views.
With Multidimensional databases users can pivot the data to see the
information from different view point. It contains Outlines that
defines all the elements of the database.
22.What
is active data warehousing?
Active
Data ware house means. Every user can access the database any time
24/7 that is called Active dwh
An
active
data warehouse
provides information that enables decision-makers within an
organization to manage customer relationships nimbly, efficiently and
proactively. Active data warehousing is all about integrating
advanced decision support with day-to-day-even
minute-to-minute-decision making in a way that increases quality of
those customer touches which encourages customer loyalty and thus
secure an organization's bottom line.
23.What
do you mean by data marts?
Data
Mart is the subset of data warehouse. You can also consider datamart
holds the data of one subject area. For an example, you consider an
organization that has HR, Finance, Communications and Corporate
Service divisions. For each division you can create a datamart. The
historical data will be stored into datamarts first and then exported
to data warehouse finally.
24.Is
OLAP databases are called decision support system ?
Yes,
OLAP databases are called decision support systems because it helps
in analyzing the historical and present data of the organizations and
thereby helping in taking intelligent business decisions
25.What
are additive/non-additive/semi-additive facts?
There
are basically three types of Facts:
1) Additive Facts
2) Non- Additive Facts
3) Semi-Additive facts
Additive
facts
-Additive facts are those facts which we can aggregate or we can add.
For e.g. 2+7=9.Dollar value is additive fact. If we want to find out
the amount for a particular place for a particular period of time, we
can add the dollar amounts and come up with the total amount.1) Additive Facts
2) Non- Additive Facts
3) Semi-Additive facts
Non-additive facts-A fact may be measure, metric or a dollar value. Measure and metric are non-additive facts. Non-additive Fact tables in which data in columns data cannot be aggregated or calculations can’t be possible for producing the expected results or known as non-additive facts. Examples are Ratios, Averages & Variance facts
A non-additive fact, for e.g. measure height(s) for 'citizens by geographical location' , when we rollup 'city' data to 'state' level data we should not add heights of the citizens rather we may want to use it to derive 'count'
Semi-additive facts are those facts which we can aggregate according to the context. If you want to sum up a measure for some dimensions but not for others then it became your semi-additive. And while creating a cube you take care of this thing. Like "balances".
26.What
is fact less fact table? Where you have used it in your project?
Fact
less table means only the key available in the Fact there is no
measures available Used only to put relation between the elements of
various dimensions. It contains nothing but dimensional keys. It is
used to support negative analysis report. For example a Store that
did not sell a product for a given period.
27.Explain
in detail about
type 1,
type 2(SCD),
type 3 ?
type 1,
type 2(SCD),
type 3 ?
Type1:
Rows containing changes to existing dimensions are updated in the
target by overwriting the existing dimension. In the Type 1 Dimension
mapping, all rows contain current dimension data. Use the Type 1
Dimension mapping to update a slowly changing dimension table when
you do not need to keep any previous versions of dimensions in the
table.
Type
2:
The Type 2 Dimension Data mapping inserts both new and changed
dimensions into the target. Changes are tracked in the target table
by versioning the primary key and creating a version number for each
dimension in the table. Use the Type 2 Dimension/Version Data mapping
to update a slowly changing dimension table when you want to keep a
full history of dimension data in the table. Version numbers and
versioned primary keys track the order of changes to each dimension.
Type
3:
The Type 3 Dimension mapping filters source rows based on
user-defined comparisons and inserts only those found to be new
dimensions to the target. Rows containing changes to existing
dimensions are updated in the target.
28.What
is ER-Modeling and Dimensional modeling?
E-R
modeling is the relation between entities in the form of
normalization whereas dimensional modeling is the relation between
dimensions in the form of de-normalization. So E-R
modeling is entity relation between two tables.
Dimension modeling is contain three table Aggregate, Fact table, Dimension table
Dimension modeling is contain three table Aggregate, Fact table, Dimension table
ER
modeling:
- focused how data will be efficient for processing (insert, update, delete)
- Minimalize (limit to zero) data redundancies
- focused how data will be efficient for processing (insert, update, delete)
- Minimalize (limit to zero) data redundancies
Dimensional
modeling:
- focused how data will be efficient for retrieving
(example, by report and analysis tools).
- Many data redundancies
- Consist of Fact and Dimension table
- focused how data will be efficient for retrieving
(example, by report and analysis tools).
- Many data redundancies
- Consist of Fact and Dimension table
29.How
many types of facts and what are they?
There
are 6 types of facts theses are as follows: Fact
less Facts:Facts
without any measures.Additive
Facts:Fact
data that can be additive/aggregative.Non-Additive
facts:
Facts that are result of non-additionSemi-Additive
Facts:
Only few columns data can be added.Periodic
Facts:
That stores only one row per transaction that happened over a
period of time.Accumulating
Fact:
stores row for entire lifetime of event.
30.What
is the main difference between Inmon and Kimball philosophies of data
warehousing?
Both
differed in the concept of building the data warehouse...i.e.
Kimball--First
DataMarts--Combined way ---Data warehouse
Inmon---First
Data warehouse--Later----Datamarts
Kimball
views data warehousing as a constituency of data marts. Data marts
are focused on delivering business objectives for departments in the
organization. And the data warehouse is a conformed dimension of the
data marts. Hence a unified view of the enterprise can be obtained
from the dimension modeling on a local departmental level. He follows
bottom-up approach i.e., first creates individual Data Marts from the
existing sources and then create Data Warehouse.
Inmon
believes in creating a data warehouse on a subject-by-subject area
basis. Hence the development of the data warehouse can start with
data from the online store. Other subject areas can be added to the
data warehouse as their needs arise. Point-of-sale (POS) data can be
added later if management decides it is necessary.
He follows top-down approach i.e., first creates Data
Warehouse from the existing sources and then create individual
Data Marts.
31.What
is materialized view
A
pre-computed table comprising aggregated or joined data from fact and
possibly dimension tables is known as materialized view. It is also
known as a summary or aggregate table.
In
a relational database, the process of removing redundancy in data by
separating the data into multiple tables.
33.What
do you mean by parallelism?
Breaking
down of a task so that several processes do part of the work. When
multiple CPUs each do their portion simultaneously, very large
performance gains are possible.
34.
What is a Staging Area?
We
need to clean and process our operational data before putting it into
the warehouse. These operations are done at a special dedicated area
wherein all the data from all the sources are first dumped off; this
area is known as staging area.
35.
Guys At the end we must understand and will be able to draw the
architecture of a warehouse as needed. So just go through it…
Data Warehouse Architecture (Basic)
Figure
below shows a simple architecture for a data warehouse. End users
directly access data derived from several source systems through the
data warehouse.
Here
in this figure the metadata and raw data and summary data of a
traditional OLTP system is present. Summaries are very valuable in
data warehouses because they pre-compute long operations in advance.
For example, a typical data warehouse query is to retrieve something
like August sales.
Architecture of a Data Warehouse (Basic)
Data Warehouse Architecture (with a Staging Area)
In
this figure staging area is included as we know we need to clean and
process our operational data before putting it into the warehouse.
These operations are done at a special dedicated area wherein all the
data from all the sources are first dumped off, well known as staging
area. A staging area simplifies building summaries and general
warehouse management.
Figure1. Architecture of a Data Warehouse with a Staging Area
Data Warehouse Architecture (with a Staging Area and Data Marts)
You
may want to customize your warehouse's architecture for different
groups within your organization. You can do this by adding data
marts,
which are systems designed for a particular line of business. Figure
below illustrates an example where purchasing, sales, and inventories
are separated. In this example, a financial analyst might want to
analyze historical data for purchases and sales.
No comments:
Post a Comment