Data Warehouse FAQ’s
What is a Data Warehouse?
A Data Warehouse is the "corporate memory". Academics will say it is a subject oriented, point-in-time, inquiry only collection of operational data.
Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases. What is ETL/ How does Oracle support the ETL process?
ETL is the Data Warehouse acquisition processes of Extracting, Transforming (or Transporting) and Loading (ETL) data from source systems into the data warehouse.
Oracle supports the ETL process with their "Oracle Warehouse Builder" product. Many new features in the Oracle9i database will also make ETL processing easier. For example: - New MERGE command (also called UPSERT, Insert and update information in one step);
- External Tables allows users to run SELECT statements on external data files (with pipelining support).
What is the difference between a data warehouse and a data mart?
This is a heavily debated issue. There are inherent similarities between the basic constructs used to design a data warehouse and a data mart. In general a Data Warehouse is used on an enterprise level, while Data Marts is used on a business division/department level. A data mart only contains the required subject specific data for local analysis.
What is the difference between a W/H and an OLTP application?
Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases.
Warehouses are Time Referenced, Subject-Oriented, Non-volatile (read only) and Integrated. OLTP databases are designed to maintain atomicity, consistency and integrity (the "ACID" tests). Since a data warehouse is not updated, these constraints are relaxed.
What is the difference between OLAP, ROLAP, MOLAP and HOLAP?
ROLAP, MOLAP and HOLAP are specialized OLAP (Online Analytical Analysis) applications.
ROLAP stands for Relational OLAP. Users see their data organized in cubes with dimensions, but the data is really stored in a Relational Database (RDBMS) like Oracle. The RDBMS will store data at a fine grain level, response times are usually slow. MOLAP stands for Multidimensional OLAP. Users see their data organized in cubes with dimensions, but the data is store in a Multi-dimensional database (MDBMS) like Oracle Express Server. In a MOLAP system lot of queries have a finite answer and performance is usually critical and fast.
HOLAP stands for Hybrid OLAP, it is a combination of both worlds. Seagate Software's Holos is an example HOLAP environment. In a HOLAP system one will find queries on aggregated data as well as on detailed data.
What is the difference between an ODS and a W/H?
An ODS (Operational Data Store) is an integrated database of operational data. Its sources include legacy systems and it contains current or near term data. An ODS may contain 30 to 90 days of information.
A warehouse typically contains years of data (Time Referenced). Data warehouses group data by subject rather than by activity (subject-oriented). Other properties are: Non-volatile (read only) and Integrated. What Oracle tools can be used to design and build a W/H?
Data Warehouse Builder (or Oracle Data Mart builder), Oracle Designer, Oracle Express, Express Objects, etc.
When should one use an MD-database (multi-dimensional database) and not a relational one?
Data in a multi-dimensional database is stored as business people views it, allowing them to slice and dice the data to answer business questions. When designed correctly, an OLAP database will provide must faster response times for analytical queries.
Normal relational databases store data in two-dimensional tables and analytical queries against them are normally very slow. What is a star schema? Why does one design this way?
A single "fact table" containing a compound primary key, with one segment for each "dimension," and additional columns of additive, numeric facts.
Why? It allows for the highest level of flexibility of metadata
Low maintenance as the data warehouse matures
Best possible performance
When should you use a STAR and when a SNOW-FLAKE schema?
The star schema is the simplest data warehouse schema. Snow flake schema is similar to the star schema. It normalizes dimension table to save data storage space. It can be used to represent hierarchies of information.
What is the difference between Oracle Express and Oracle Discoverer?
Express is an MD database and development environment. Discoverer is an ad-hoc end-user query tool.
How can Oracle Materialized Views be used to speed up data warehouse queries?
With "Query Rewrite" (QUERY_REWRITE_ENABLED=TRUE in INIT.ORA) Oracle can direct queries to use pre-aggregated tables instead of scanning large tables to answer complex queries.
Materialized views in a W/H environments is typically referred to as summaries, because they store summarized data. What Oracle features can be used to optimize my Warehouse system?
The following Oracle features can be used to compliment your Warehouse system/database:
- From Oracle8i One can transport tablespaces between Oracle databases. Using this feature one can easily "detach" a tablespace for archiving purposes. One can also use this feature to quickly move data from an OLTP database to a Warehouse database.
- Data partitioning allows one to split big tables into smaller more manageable sub-tables (partitions). Data is automatically directed to the correct partition based on data ranges or hash values.
- Oracle Materialized Views can be used to pre-aggregate data. The Query Optimizer can direct queries to summary/ roll-up tables instead of the detail data tables (query rewrite). This will dramatically speed-up warehouse queries and saves valuable machine resources.
- Oracle Parallel Query can be used to speed up data retrieval by using multiple processes (and CPUs) to process a single task.
A: This is widely used approach for physically modeling a data warehouse. On an Entity Relationship Diagram (ERD) a Star Schema has a single fact table in the center surrounded by multiple dimension reference tables. Together they form the shape of a multi-pointed star.
2. Q: What is a Snowflake Schema?
A: A lot like a Star Schema a Snowflake Entity Relationship Diagram keeps the fact table in the middle but normalizes the dimensional reference tables. An example would be to break down the Time dimension and create tables for each level; years, quarters, months; weeks, days… These additional branches on the ERD create more of a Snowflake shape then Star.
3. Q: Which is better, Star or Snowflake?
A: Strict data warehousing rules would have you use a Star schema but in reality most designs tend to become Snowflakes. They each have their pros and cons but both are far better then trying to use a transactional system third-normal form design.
4. Q: Why can’t I use a copy of my transactional system for my data warehouse?
A: This is one of the absolute worst things you can do. A lot of people initially go down this road because a tool vendor will support the idea when making their sales pitch. Many of these attempts will even experience success for a short period of time. It’s not until your data sets grow and your business questions begin to be complex that this design mistake will really come out to bite you.
5. Q: What is OLAP?
A: OLAP or Online Analytical Processing is a special breed of software tools designed to report and analyze your data.
6. Q: What is a cube?
A: Cubes are proprietary data file structures used to hold the data which is been analyzed. With tools such as Cognos the cubes reside on a special cube server. Business Objects actually builds a local cube structure on the client machine when a report is run.
7. Q Why do some OLAP tools (Cognos, Business Objects) use cubes?
A: The primary advantages offered by cubes are;
- Cubes make it a lot easier to use a bad data model. Problems and challenges in your model can be essentially ‘hidden’ and corrected within the code that generates the cube and since tools like Cognos build their cubes on server typically off-business hours the performance issues these data models present can be also be hidden from your users.
- Server-side cubes provide extremely good report performance. You’ve essentially built all the various report permutations a user might ask for and little to know processing has to take place when a request for one is initiated.
The disadvantages of a cube are;
- Server-side cubes (Cognos) add an additional layer of complexity, administration, maintenance and point of failure for your Business Intelligence system. And because your essentially creating one more redundant place your data must be stored the size of your data-sets is limited.
- Client-side cubes (Business Objects) severely limit the size of data that can be accessed and analyzed in a single report. For each report a developer must identify the ‘scope of analysis’ and include the entire data-set within the client-side cube. This means if a user wants to run a quarterly report and then possibly drill-down into the weeks or even days of detail, an entire quarter of daily level data must be replicated down onto the clients machine into a cube file.
- Allows for bad data model design. This is really as much of a disadvantage as it is an advantage. Bad data modeling will eventually catch-up to you and create performance problems, analysis limitations and ultimately failure.
A: MicroStrategy’s approach called ROLAP (relational online analytical processing) skips the entire cube layer and generates highly optimized and database specific SQL for each user request. This means if a user asks for Q4 2001 sales data the tool will create the SQL and return those requested results and nothing more. If the user then decides to drill-down and investigate a specific week or day of interesting activity the tool will generate the appropriate optimized SQL to satisfy the new request. There is no need to create and maintain an additional redundant layer of cube data, nor is there any requirement to pre-determine the reports full ‘scope-of-analysis’ and fetch a large amount of useless detail data.
No comments:
Post a Comment