D A T A W A R E H O U S I N G D E S I G N T R A N S P A R E N C I E S
Objectives
2
Issues associated with designing a data
warehouse.
Technique for designing the database
component of a data warehouse called dimensionality modeling.
How a dimensional model (DM) differs from an
ER model.
A step-by-step methodology for designing a
data warehouse.
Criteria for assessing degree of dimensionality
Designing Data Warehouses
3
To begin a data warehouse project, need to find
answers for questions such as:
Kebutuhan user mana yang sangat mendesak
Data apa saja yang pertama kali harus disediakan Projek yang mana yang seharusnya bisa diperkecil
dan lebih mudah dikelola
Infrastruktur mana yang dapat digunakan untuk
awal yag nantinya dapat digunakan untuk dataware house sekala besar
Designing Data Warehouses
4
Banyak perusahaan menghindari kompleksitas
memulainya dengna satu atau beberapa data mart
Data mart memperbolehkan didesain lebih
simple dan dapat digunakan untuk user yang spesifik
Designing Data Warehouses
5
Hanya sedikit desainer yang bersedia
berkomitmen untuk desain enterprise-wide yang harus memenuhi semua kebutuhan pengguna
pada satu waktu.
Meskipun solusi sementara membangun data
mart, tujuan akhir tetap sama yaitu, penciptaan akhir dari sebuah gudang data sesuai dengan kebutuhan perusahaan.
Designing Data Warehouses
6
Pengumpulan persyaratan dan tahap analisis
dari proyek data warehouse melibatkan
stakeholder perusahaan dengan mewawancarai staf (seperti pengguna pemasaran, keuangan pengguna, dan pengguna penjualan) untuk
memungkinkan identifikasi sekumpulan prioritas persyaratan yang harus dipenuhi oleh
Designing Data Warehouses
7
Pada saat yang sama, wawancara dilakukan
dengan anggota staf yang bertanggung jawab
untuk sistem operasional, untuk mengidentifikasi sumber data yang bersih, data yang valid, dan
konsisten yang akan tetap dapat digunakan selama beberapa tahun ke depan.
Designing Data Warehouses
8
Wawancara akan memberikan informasi yang
diperlukan untuk bentuk top-down (kebutuhan pengguna) dan pandangan bottom-up (sumber data yang tersedia) dari data warehouse.
Komponen database dari data warehouse
dijelaskan menggunakan teknik pemodelan dimensi.
Dimensionality modeling
9
Teknik desain logis bertujuan untuk menyajikan
data dalam format standar, bentuk intuitif yang memungkinkan untuk mengakses dengan kinerja yang cukup bagus
Menggunakan konsep pemodelan ER dengan
beberapa pembatasan
Setiap model dimensi (DM) terdiri dari sebua table
dengan kunci komposit primer, yang disebut tabel fakta, dan sebuah table yang disebut tabel dimensi.
Dimensionality modeling
10
Setiap tabel dimensi memiliki kunci
(non-komposit) primer sederhana yang sesuai dengan salah satu komponen kunci komposit dalam tabel fakta.
Dimensionality modeling
11
Semua kunci natural diganti dengan kunci
pengganti (surrogate). Berarti setiap penghubung antara fakta dan tabel dimensi didasarkan pada kunci pengganti, bukan kunci natural.
Kunci pengganti memungkinkan data di
warehouse memiliki beberapa kebebasan dari data yang digunakan pada sistem OLTP.
Star schema for property sales of
Dimensionality modeling
13
Star schema is a logical structure that
has a fact table containing factual data in the center, surrounded by dimension
tables containing reference data, which can be denormalized.
Facts are generated by events that occurred in
the past, and are unlikely to change, regardless of how they are analyzed.
Dimensionality modeling
14
Bulk of data in data warehouse is in fact tables,
which can be extremely large.
Important to treat fact data as read-only
reference data that will not change over time.
Most useful fact tables contain one or more
numerical measures, or ‘facts’ that occur for each record and are numeric and additive.
Dimensionality modeling
15
Dimension tables usually contain descriptive
textual information.
Dimension attributes are used as the constraints
in data warehouse queries.
Star schemas can be used to speed up query
performance by denormalizing reference information into a single dimension table.
Dimensionality modeling
16
Snowflake schema is a variant of the star
schema where dimension tables do not contain denormalized data.
Starflake schema is a hybrid structure that
contains a mixture of star (denormalized) and snowflake (normalized) schemas.
Allows dimensions to be present in both forms to cater for different query
Property sales with normalized version of
Branch
dimension table
Dimensionality modeling
18
Predictable and standard form of the
underlying dimensional model offers important advantages:
Efficiency
Ability to handle changing requirements Extensibility
Ability to model common business situations Predictable query processing.
Comparison of DM and ER models
19
A single ER model normally decomposes
into multiple DMs.
Multiple DMs are then associated through
20
Langkah-langkah dalam merancang DW : Memilih proses
Memilih item data yang diperlukan
Mengidentifikasi dan membuat dimensi yang sesuai Memilih fakta
Menyimpan pra-perhitungan dalam tabel fakta Melengkapi urutan dari tabel dimensi
Memilih jangka waktu database
Memungkinkan mengubah dimensi secara perlahan Menentukan prioritas query dan mode query.
Step 1: Choosing the process
21
Memilih Proses (fungsi) yang mengacu pada
subyek data mart tertentu.
Data mart dibangun pertama kali harus menjadi
salah satu yang paling mungkin tepat waktu, sesuai anggaran, dan menjawab pertanyaan-pertanyaan bisnis yang paling penting secara komersial.
ER model of an extended version of
DreamHome
ER model of property sales business process of
Step 2: Choosing the grain
24
Putuskan record/kolom apa yang mewakili tabel
fakta
Identifikasi dimensi dari table fakta. Keputusan
detail tabel fakta juga menentukan field setiap tabel dimensi.
Juga meliputi waktu sebagai inti dari table dimensi
Step 3: Identifying and conforming the
dimensions
25
Dimensions set the context for asking questions
about the facts in the fact table.
If any dimension occurs in two data marts, they
must be exactly the same dimension, or one must be a mathematical subset of the other.
A dimension used in more than one data mart is
Star schemas for property sales and
property advertising
Step 4: Choosing the facts
27
The grain of the fact table determines which facts
can be used in the data mart.
Facts should be numeric and additive. Unusable facts include:
non-numeric facts non-additive facts
Property rentals with a badly
structured fact table
28Property rentals with fact table corrected
Step 5: Storing pre-calculations in the fact
table
30
Once the facts have been selected each should be
re-examined to determine whether there are opportunities to use pre-calculations.
Step 6: Rounding out the dimension tables
31
Text descriptions are added to the dimension tables. Text descriptions should be as intuitive and
understandable to the users as possible.
Usefulness of a data mart is determined by the scope
Step 7: Choosing the duration of the
database
32 Duration measures how far back in time the fact
table goes.
Very large fact tables raise at least two very
significant data warehouse design issues.
Often difficult to source increasing old data.
It is mandatory that the old versions of the important
dimensions be used, not the most current versions. Known as the ‘Slowly Changing Dimension’ problem.
Step 8: Tracking slowly changing
dimensions
33 Slowly changing dimension problem means that
the proper description of the old dimension data must be used with old fact data.
Often, a generalized key must be assigned to
important dimensions in order to distinguish
multiple snapshots of dimensions over a period of time.
Step 8: Tracking slowly changing
dimensions
34 Three basic types of slowly changing dimensions: Type 1, where a changed dimension attribute is
overwritten
Type 2, where a changed dimension attribute causes a
new dimension record to be created
Type 3, where a changed dimension attribute causes an
alternate attribute to be created so that both the old and new values of the attribute are simultaneously accessible in the same dimension record
Step 9: Deciding the query priorities and
the query modes
35
Most critical physical design issues
affecting the end-user’s perception includes:
physical sort order of the fact table on disk presence of pre-stored summaries or
aggregations.
Additional physical design issues include
administration, backup, indexing performance, and security.
Database Design Methodology for Data
Warehouses
36
Methodology designs a data mart that
supports requirements of particular business process and allows the easy
integration with other related data marts to form the enterprise-wide data
warehouse.
A dimensional model, which contains
more than one fact table sharing one or more conformed dimension tables, is referred to as a fact constellation.
Fact and dimension tables for each
business process of
DreamHome
Dimensional model (fact constellation) for
the
DreamHome
38data warehouse
Criteria for assessing the dimensionality
of a data warehouse
39
Criteria proposed by Ralph Kimball to
measure the extent to which a system supports the dimensional view of data warehousing.
Twenty criteria divided into three broad
groups: architecture, administration, and expression.
Criteria for assessing the
Criteria for assessing the
dimensionality of a data warehouse
41 Architectural criteria describes way the
entire system is organized.
Administration criteria are considered to
be essential to the ‘smooth running’ of a dimensionally-oriented data warehouse.
Expression criteria are mostly analytic
capabilities that are needed in real-life situations.