ISQS 6347 Data &
Text Mining Project (spring 2009)
(Check the Example)
This
project will allow students to practice data mining methods and SAS EM skills learned
from the class. The following are the multi-stage process to fulfill the
project:
Stage 1
(15%):
Identify
a project topic and determine the objectives of the data mining project. Find an available
dataset for the project. You can use one of the datasets you found for homework
1. Study and understand the dataset by exploring it. Pay attention at (the
contents are the same as before by reformatted):
1)
the quality of the data (any missing value),
2)
the
meaningful attributes (variables),
3)
attributes
(variables) distributions, and
4)
the types of variable values.
A
proposal of 2-4 pages is required, covering:
1)
motivations,
2)
objectives,
3)
business
background description,
4)
dataset
availability,
5)
data
quality, and
6)
the description of data preprocessing tasks if any.
Due on March 13,
Friday.
Stage 2
(15%):
Perform
necessary data cleansing and conversion tasks. A data
cleansing/preparation report of 2-3 pages is due on March 27, Friday
1)
The
proposal must be appended to this report with necessary modifications following
the feedbacks from the instructor
2)
You
can report the data quality status, such as missing values, coding conditions,
format problems, etc., and the tasks you have done in data preprocessing, such
as data format conversion, value recoding, etc. You can contact the instructor
for help in data processing.
3)
If
the data is already clean enough, you can provide the information about the
outcomes of data exploration. The reported information can include but not
restricted to size of data, distributions of key variables, any interesting
primitive findings, etc.
4)
In
the submitted report, you need to highlight the following information in the
cover page:
a.
The
type of this project, choosing one of the four: (1) Newly created, (2) Based on
other project but with different objectives and methods, (3) Sharing data with
others but different methods and models, or (4) Other type (please specify).
You need place this information on the cover page of the project stage-2
report.
b.
The
nature and source of the dataset: (1) Real data provided by a business, (2)
Real data manually collected from the Internet by yourself, (3) Survey data
collected by yourself, (4) Downloaded data from data mining information service
site, such as KDD, (5) Computer generated data by simulation (not encouraged),
(6) Other (please specify).
5)
Note:
the due date has been extended to March 27, Friday. You need to submit both
electronic copy and hardcopy by 5p on the date. Since you will have only one
week+ to complete the third stage assignment, your early fulfillment is the key
to deliver a quality job.
Stage 3
(20%):
Choose
some data mining techniques, such as Decision Tree, Regression, Clustering,
Association Analysis, Link Analysis, OR
Text mining, and use SAS Enterprise Miner to develop a data mining model
upon the dataset.
A data analysis report including three main parts:
1)
The
path or process how you study the data. You need to present the systematic
approach and appropriate methodology in a right logic how you are to conduct
the extended data mining.
2)
Data
analysis process report, including data preparation, data exploration,
progressive data mining process,
3)
Primitive
results, including data exploration outcomes, primitive findings and brief
explanations. Any charts/tables must come with enough explanation.
You
can merge some contents from previous reports with necessary modification. This
will make your report at this stage coherent and nice-looking. Simply
copy-pasting from the old to current will not help – A long report is not
favorable.
Due
on April 17.
Stage 4
(50%):
Conceive
a final project report based on the data mining analysis outcomes, with
necessary modification and refinements. Due on May 4.
A.
The final project report is the final deliverable for the
project.
It includes the following parts:
|
Project
title |
|
|
Class
number / Semester |
|
|
Student
name |
|
|
The
type of this project |
|
|
The
nature and source of the dataset |
|
|
Completion
date |
|
You
may reuse the materials you developed for your previous deliverables.
B.
In general, the report must demonstrate your knowledge in
both data mining and the addressed business issue. It must look
professional. The size of the report body should not exceed 30 pages (point-12
font). The detailed grading criteria:
1.
The
writing quality of the report, such as the completeness of the contents with
regard to the above requirements, and the coherence and correctness of the
writing.
2.
The
effort made in data collection and data preprocessing
3.
Data
mining skills and strategies
4.
Comprehensiveness
of data analysis results and explanations. Business background related and
in-depth discussions are encouraged.
5.
Completeness
and timeliness of the deliverables
C.
Issues you are to tackle during project accomplishment:
D.
The following situations may have negative effects on
your report evaluation and should be avoided:
1.
Irrational
research story and goal
2.
Insufficient
information
3.
Incoherent
logic flow in the report
4.
Overdo
the report with too much trivial information. That it, you need to know what is
important and what is not important.
5.
Use
too many charts generated from data mining process without enough explanation.
You
need to submit both electronic copy
(sent to Zhangxi.lin@hotmail.com)
and a hardcopy.
The
project report is due in noon on May 5,
2009.
Note: The effort made in data preparation will increase the
evaluation of the report.
------------------------------------------------