ISQS 6339 Data Management and Business Intelligence

 

Project Instructions

 

(Created 2/13/2012, subject to further modifications)

 

Instructor: Zhangxi Lin

 

Final Project Report Due: May 12, Wednesday

 

In this project up to 3 students will form a team to fulfill a data mart project. The completed project must address the following topics:

1.     Data mart planning

  1. Data mart design (must contain techniques of SCD, surrogate keys, and hierarchies)
  2. ETL system (must show how to use SAS programming in data cleansing and transformation)
  3. Data visualization (Use SSAS for KPIs and calculations; use SAS Enterprise Guide for OLAP and reporting)
  4. Reporting
  5. A data mining example using the functions in SQL Server

 

The project includes deliverables in five stages:

 

-----------------

Stage 1 (10%): Project proposal. The selection of the project mainly relies on the availability of the dataset and your understanding of the business story about the data. The deliverable is a project proposal in 2-4 pages, containing (1) Project topic, (2) The availability of the dataset, (3) analytic themes, (4) A conceptual dimensional model, and (5) A list of references. Due Mar 7, Wednesday.

 

FAQ:

What does “The availability of the data set” mean? You need to confirm the data set used for the project is available

What is “analytic themes”? They are business needs or requirements for the project. That is, what the purposes of the data mart are from the business perspective. You can find the cases in BI-04-DimensionalModel.ppt and BI-05-MMM.ppt. If you have the optional textbook “Delivering Business Intelligence” or “The Microsoft Data Warehouse Toolkit”, you can find relevant information in the book to get the clue. 

What is “A conceptual dimensional model”? It is the draft of the dimensional model in a diagram format to be implemented later.

 

-----------------

Stage 2 (20%): Data mart development. Refer to your data mart exercises, the data warehousing textbook, or the online materials for the tasks in this stage: (1) dimensional model design (at least four dimensions), and (2) data mart creation. The description of the data mart will be 3-5 pages including figures or screenshots of the designed model. Due Mar 26, Monday.

 

Contents of the deliverables:

1.     A dimensional model diagram, either PowerPoint slide or the screenshot of the data source view. This must be the refined version from the “conceptual dimensional model” in Stage 1.

2.     Screenshots of design pages for each table from SQL Server Management Studio.

3.     A brief description of the considerations in the dimensional model design.

a.     How time dimension is defined? In what granularity?

b.     The hierarchy within a dimension, if any. (For example, State, county, city, and street are in a hierarchical structure).

c.     The aggregate dimensions, if any. (For example, in MaxMinManufacturingDM, DimMachineType is an aggregate from DimMachine).

d.     Indicate the primary key and foreign keys in each table.

e.     Others.

 

-----------------

Stage 3 (20%): Data mart populating. Refer to your ETL exercises and the textbook for the tasks in this stage. The deliverables (Either electronic version or hard copy):

 

Part 1: The contents based on the proposal - (1) Project topic, (2) The availability of the dataset, (3) analytic themes, (4) A conceptual dimensional model, and (5) A list of references.

Part 2: (1) dimensional model design (at least four dimensions), and (2) data mart creation. Detailed variables in each table must be presented

Part 3 (new progresses):  (1) The screenshot of ETL system logic design, (2) A brief description of ETL system implementation, (3) Notes about data quality and other issues during data preprocessing.  It will be 2-3 pages including figures and/or screenshots.

 

Note: OLAP Cube design is optional at this stage. It is a must in Stage 4. This will allow you have enough time to refine the dimensional model, data cleansing, and ETL.

 

Hard copy of complete report at this stage, due: Apr 11, Wednesday.

 

-----------------

Stage 4: Presentations (20%), May 2 & May 7

 

Each team will have 20 minutes to present including 5 minutes for questions/answers.

 

The main contents in the presentation will include:

1) The explanation of analytic theme and application requirements, such as who are users, what functions they need, etc.

2) Multi-dimensional modeling with key considerations in the relationship between tables

3) Key issues in ETL: Data preprocessing, transformation, and loading

4) Analytical results, including demonstration of OLAP cube, sample tables and charts, and a SQL Server data mining example.

 

---------------------

Stage 5 (30%): The compilation of the previous deliverables with modifications and enhancements. Hardcopy and electronic version are both needed, due: May 12, Wednesday.

 

(1)   The report will contain the following parts

a.     Cover page. Including the team’s name, team members, title of the report, course number and name, and the date the report is submitted.

b.     Table of contents

c.     One-page executive summary

d.     Introduction section. Mainly based on the contents of Deliverable #1 but with extensions:

                                                  i.    The background of the project, such as what are the businesses supported by this project.

                                                 ii.    Analytic themes, i.e. business requirement to this project. Relevant business processes that generate data and use the data are expected

                                                iii.    The availability of the dataset, including the quality of the data. You have process the data for ETL. So you should have pretty much to say about this.

e.     Data mart development section, mainly deliverable #2 with necessary modifications

f.      Data mart populating section, mainly deliverable #3 with necessary modifications

g.     Analysis report section, mainly deliverable #4 with necessary modifications (highly suggested to do so)

                                                  i.    4-6 findings from the data analysis.

                                                 ii.    2-4 diagram/charts/tables generated with SAS Enterprise Guide.

                                                iii.    An optional choice of the data analysis is to do OLAP with the cube created from your data mart. Try to work out the OLAP application outcomes using SAS Enterprise Guide or SSAS. You may “drill down” or “drill up” the OLAP cube. Both SAS EG and SSAS Browser allow you to expand the dimension and to explore the data in tables or graphs. 

                                                iv.    A demonstrative data mining case using the functions provided by SQL Server (optional).

h.     Summary section of the project, 1-2 page. It may contain: what you have learned, the top three comments on this project, the progress of the project regarding the time schedule.

i.      Reference list. If a reference is from the Internet, you need to list the title of the page in addition to the url address.

(2)   A hardcopy of the report will be submitted for grading, plus an electronic copy emailed to isqs6347@gmail.com with a subject title: “ISQS 6339 Term Project – Team #”.

 

-----------------------------------------------

 

 

Project: Any dataset containing transaction logs, snapshot data, and other relevant information. Examples: Banking system, online C2C sales, student academic record management, stock portfolio management, etc.

 

Here are the cases that were done in spring 2011

 

1)   Online P2P Lending

 

“Online People-to-people (P2P) lending (also called Peer-to-Peer or social lending), allows individuals to lend and borrow directly among each other without the mediation of a creditor bank institution[1]. Online P2P lending achieves the reallocation of small funds between people, and satisfies the needs of society. It appeared in 2005 and has had a very rapid development during past several years. At present, there are about 30 Online P2P lending markets in more than 10 countries in the world wide, such as Zopa in UK and Japan, Prosper and LendingClub in the US, CommunityLend in Canada, LoanLand in Sweden Loanland, and PaiPaiDai, YiXin and QiFang in China.”

This project is to create a data mart with fact table of listings with dimensions of borrowers, lenders, transactions, and other information. The data mart will be used for research and business promotion as well.

 

2)   IMW property listing warehouse

 

A sample dataset COMMREX.COM-ACCESS_LOG-9711.rar is available at the shared directory. Three successive monthly datasets will be provided later. You will need to transform the data in different ways to make the data suitable to data warehousing.

 

3) Stock portfolio data mart

 

Choose up to 100 stocks in 3-5 sectors, and collect their time series data with price, volumes, and other useful measures. Build a data mart allowing quick retrieval of the information regarding their perfromance dynamics of sector, date, etc.

 

 

4) Cotton/cottonseed data warehousing

 

If you know the cotton agriculture, this project may interest you. The Excel files \\TechShare\coba\d\ISQS3358\Repository\ COTTON.XLS containing cotton/cottonseed dataset. The following are tasks to be fulfilled:

1)     Study the dataset “Cotton.xls” carefully. The main dataset is in “data” table. You also need to study other worksheet. You will notice that the main dataset has missed some values but this does not have major effect on this project.

2)     Define 5-8 analytic themes for the data warehouse.

3)     High level dimensional model design. This is based on the paperwork

a.     Identify 2-4 business processes

b.     Declaring the grain for the data warehouse

c.     Define dimensions

d.     Identifying the facts (according to the bus matrix)

4)     Detailed dimensional model development. This will be based on computer work.

a.     You must specify the primary key for each table.

b.     The model must contain at least one Type 2 SCD, one conformed dimension, and any other specific treatment of modeling.

c.     Indicate the components of junk dimension, heterogeneous facts, etc.

5)     Dimensional model review and validation.

6)     Populate the data warehouse with the cotton/cottonseed data (use SSIS and Management Studio for this ETL task).

7)     Develop a few applications, such as queries, OLAP applications (use SSAS and Management Studio for this).