ISQS 6339 Data Management and Business Intelligence (2015)

 

Project Instructions

 

(subject to further modifications)

 

Instructor: Zhangxi Lin

 

Final Project Report Due: May 12, Tuesday

 

There will be two data warehousing projects:

A.    Data warehousing using SQL Server 2008

B.    Data warehousing using Hadoop

Two data ware houses can share the same data source.

 

In this project there are Seven teams each with up to 6 students to fulfill three big data warehousing projects. These three projects include two data warehousing projects (one with SQL Server, another with Hadoop), and one Hadoop collaborative study.

A completed data warehousing project must address the following topics:

1.     Data mart planning

  1. Data mart design (must contain techniques of SCD, surrogate keys, and hierarchies)
  2. ETL system (must show how to use SAS programming in data cleansing and transformation)
  3. Data visualization (Use SSAS for KPIs and calculations; use SAS Enterprise Guide for OLAP and reporting)
  4. Reporting
  5. A data mining example using the functions in SQL Server

The following are Hadoop collaborative study topics

No:

Topic

Components

Team#

Presentation

1

Data warehousing

Focus: Hadoop Data warehouse design

HDFS, HBase, HIVE, Kylin NoSQL/NewSQL, Solr

DW1

4/7

2

Publicly available big data services

Focus: tools and free resources

Hortonworks, CloudEraHaaS, EC2

DW2

4/14

3

MapReduce & Data mining

Focus: Efficiency of distributed data/text mining

Mahout, H2O, R, Python

DW3

4/16

4

Big data ETL

Focus on applications: Heterogeneous data processing across platforms

Kettle, Flume, Sqoop, Impala, Chakwa. Dremel, Pig

DW4

4/21

5

Big data platform management

Focus: System management 

OozieZooKeeper, Ambari, Loom, Ganglia

DW5

4/23

6

Application development platform

Focus:

1) Algorithms and innovative development environments

2) Load balancing

Tomcat, Neo4J, Pig, Hue

DW6

4/30

7

Tools & Visualizations

Focus: Features for big data visualization and data utilization.

Pentaho, Tableau

Saiku, Mondrian, Gephi,

DW7

5/5

8

Streaming data processing

Focus: Efficiency and effectiveness of real-time data processing

 Spark, Storm, Kafka, Avro

 

Online

 

Tasks at proposal stage:

1) Define the scope of the topic to include more detailed functions the products will perform

2) Identify the latest available products and describe their functions. Add them into the list to up to 10 products and order their popularity from high to low.

3) Suggest the products in your list that could be used by other teams.

 

The project assessment will be based on several factors:

Completeness of the project, timeliness of submission, difficulty of the topic and data, amount of effort made, workload in data preprocessing, good understanding of course knowledge.

 

The project includes deliverables in five stages:

 

-----------------

Stage 1 (10%): SQL Server data warehousing project proposal. The selection of the project mainly relies on the availability of the dataset and your understanding of the business story about the data. The deliverable is a project proposal in 2-4 pages, containing (1) Project topic, (2) The availability of the dataset, (3) analytic themes, (4) A conceptual dimensional model, and (5) A list of references. Due Mar 3, Tuesday.

 

FAQ:

What does “The availability of the data set” mean? You need to confirm the data set used for the project is available

What is “analytic themes”? They are business needs or requirements for the project. That is, what the purposes of the data mart are from the business perspective. You can find the cases in BI-04-DimensionalModel.ppt and BI-05-MMM.ppt. If you have the optional textbook “Delivering Business Intelligence” or “The Microsoft Data Warehouse Toolkit”, you can find relevant information in the book to get the clue. 

What is “A conceptual dimensional model”? It is the draft of the dimensional model in a diagram format to be implemented later.

 

-----------------

Stage 2 (25%): SQL Server data mart implementation. Refer to your data mart exercises, the data warehousing textbook, or the online materials for the tasks in this stage: (1) dimensional model design (at least four dimensions), and (2) data mart creation. The description of the data mart will be 3-5 pages including figures or screenshots of the designed model. Due Mar 26, Thursday.

 

Contents of the deliverables:

1.     A dimensional model diagram. This must be the refined version from the “conceptual dimensional model” in Stage 1.

2.     Screenshots of design pages for each table from SQL Server Management Studio.

3.     A brief description of the considerations in the dimensional model design.

a.     How time dimension is defined? In what granularity?

b.     The hierarchy within a dimension, if any. (For example, State, county, city, and street are in a hierarchical structure).

c.     The aggregate dimensions, if any. (For example, in MaxMinManufacturingDM, DimMachineType is an aggregate from DimMachine).

d.     Indicate the primary key and foreign keys in each table.

e.     Others.

4.     Populated data mart

5.     SSIS applications: ETL

6.     SSAS applications: OLAP Cube and drilling results

 

-----------------

Stage 3 (10%): Hadoop data warehousing proposal. The deliverable is a project proposal in 3-5 pages, containing (1) Project topic, (2) refined analytic themes, (3) the corrected/improved dimensional model from the one in last stage, and (4) work plan: Hadoop account, data preprocessing and conversion, etc. Hard copy is required. Due April 14, Tuesday.

 

 

-----------------

Stage 4 (10%): Hadoop data mart implementation.

 

2-3 pages of implementation plan

 

Hard copy of complete report at this stage, due: April 30, Thursday.

 

---------------------

Stage 5 (45%): The compilation of the previous deliverables with modifications and enhancements. Hardcopy and electronic version are both needed, due: May 12, Tuesday.

 

The report is mainly for the Hadoop part, with SQL Server’s latest version in Del#3 as the attachment. 

(1)   The report will contain the following parts

a.     Cover page. Including the team’s name, team members, title of the report, course number and name, and the date the report is submitted.

b.    Table of contents

c.     One-page executive summary

d.    Introduction section. Mainly based on the contents of Deliverable #1 but with extensions:

                                                  i.    The background of the project, such as what are the businesses supported by this project.

                                                 ii.    Analytic themes, i.e. business requirement to this project. Relevant business processes that generate data and use the data are expected

                                                iii.    The availability of the dataset, including the quality of the data. You have process the data for ETL. So you should have pretty much to say about this.

e.     Data mart development section. This part only focus on the Hadoop part.

f.     Data mart populating section. Need to explain how SQL Server data was transferred to Hadoop.

g.    Analysis report section (you may use some of results from SQL Server project, but the Hadoop part is mandatory)

                                                  i.    4-6 findings from the data analysis.

                                                 ii.    2-4 diagram/charts/tables generated with SAS Enterprise Guide.

                                                iii.    An optional choice of the data analysis is to do OLAP with the cube created from your data mart. Try to work out the OLAP application outcomes using SAS Enterprise Guide or SSAS. You may “drill down” or “drill up” the OLAP cube. Both SAS EG and SSAS Browser allow you to expand the dimension and to explore the data in tables or graphs. 

                                                iv.    A demonstrative data mining case using the functions provided by SQL Server (optional).

h.     Summary section of the project, 1-2 page. It may contain: what you have learned, the top three comments on this project, the progress of the project regarding the time schedule.

i.      Reference list. If a reference is from the Internet, you need to list the title of the page in addition to the url address.

(2)   A hardcopy of the report will be submitted for grading, plus an electronic copy emailed to isqs6347@gmail.com with a subject title: “ISQS 6339 Term Project – Team #”. The file name must be: “ISQS6339 Team # - term project report”.

 

-----------------------------------------------