ISQS 6347 Data & Text Mining Project (spring 2009)


(Check the Example)


This project will allow students to practice data mining methods and SAS EM skills learned from the class. The following are the multi-stage process to fulfill the project:


Stage 1 (15%): Identify a project topic and determine the objectives of the data mining project. Find an available dataset for the project. You can use one of the datasets you found for homework 1. Study and understand the dataset by exploring it. Pay attention at (the contents are the same as before by reformatted):

1)    the quality of the data (any missing value),

2)    the meaningful attributes (variables),

3)    attributes (variables) distributions, and

4)    the types of variable values.

A proposal of 2-4 pages is required, covering:

1)    motivations,

2)    objectives,

3)    business background description,

4)    dataset availability,

5)    data quality, and

6)    the description of data preprocessing tasks if any.

Due on March 13, Friday.


Stage 2 (15%): Perform necessary data cleansing and conversion tasks. A data cleansing/preparation report of 2-3 pages is due on March 27, Friday

1)    The proposal must be appended to this report with necessary modifications following the feedbacks from the instructor

2)    You can report the data quality status, such as missing values, coding conditions, format problems, etc., and the tasks you have done in data preprocessing, such as data format conversion, value recoding, etc. You can contact the instructor for help in data processing.

3)    If the data is already clean enough, you can provide the information about the outcomes of data exploration. The reported information can include but not restricted to size of data, distributions of key variables, any interesting primitive findings, etc.

4)    In the submitted report, you need to highlight the following information in the cover page:

a.    The type of this project, choosing one of the four: (1) Newly created, (2) Based on other project but with different objectives and methods, (3) Sharing data with others but different methods and models, or (4) Other type (please specify). You need place this information on the cover page of the project stage-2 report.

b.    The nature and source of the dataset: (1) Real data provided by a business, (2) Real data manually collected from the Internet by yourself, (3) Survey data collected by yourself, (4) Downloaded data from data mining information service site, such as KDD, (5) Computer generated data by simulation (not encouraged), (6) Other (please specify).

5)    Note: the due date has been extended to March 27, Friday. You need to submit both electronic copy and hardcopy by 5p on the date. Since you will have only one week+ to complete the third stage assignment, your early fulfillment is the key to deliver a quality job.


Stage 3 (20%): Choose some data mining techniques, such as Decision Tree, Regression, Clustering, Association Analysis, Link Analysis, OR Text mining, and use SAS Enterprise Miner to develop a data mining model upon the dataset. A data analysis report including three main parts:

1)    The path or process how you study the data. You need to present the systematic approach and appropriate methodology in a right logic how you are to conduct the extended data mining.

2)    Data analysis process report, including data preparation, data exploration, progressive data mining process,

3)    Primitive results, including data exploration outcomes, primitive findings and brief explanations. Any charts/tables must come with enough explanation.

You can merge some contents from previous reports with necessary modification. This will make your report at this stage coherent and nice-looking. Simply copy-pasting from the old to current will not help A long report is not favorable.

Due on April 17.


Stage 4 (50%): Conceive a final project report based on the data mining analysis outcomes, with necessary modification and refinements. Due on May 4.


A.    The final project report is the final deliverable for the project. It includes the following parts:


  1. Cover page. The cover page must have a table at the bottom containing the following information, in addition to other information:


Project title


Class number / Semester


Student name


The type of this project


The nature and source of the dataset


Completion date



  1. Table of contents
  2. The project motivation and objectives. This section presents the background of the project, the importance of the project, the research questions, and project objectives. 1-2 pages
  3. The relevant research efforts or projects overview, 1 page
  4. Dataset description. It includes: where it comes from, the description of major attributes (variables), the quality of the dataset, and data preprocessing, 2-3 pages
  5. The data mining method you used and data mining process, such as the problem you encountered and how they were solved, 1 page
  6. Data mining outcomes. You need to attach the necessary charts from SAS Enterprise Miner, page number is not restricted
  7. Discussions, problems, and further work, 1-2 pages
  8. References, if any


You may reuse the materials you developed for your previous deliverables.


B.    In general, the report must demonstrate your knowledge in both data mining and the addressed business issue. It must look professional. The size of the report body should not exceed 30 pages (point-12 font). The detailed grading criteria:


1.    The writing quality of the report, such as the completeness of the contents with regard to the above requirements, and the coherence and correctness of the writing.

2.    The effort made in data collection and data preprocessing

3.    Data mining skills and strategies

4.    Comprehensiveness of data analysis results and explanations. Business background related and in-depth discussions are encouraged.

5.    Completeness and timeliness of the deliverables


C.   Issues you are to tackle during project accomplishment:


  1. The effect of dataset quality
  2. How to conduct data pre-processing
  3. How to select right variables for the model
  4. How to use other SAS EM nodes, such as Transform, Variable selection, Insight, Score, Assessment, Multiplot, Distribution Explorer, to improve the efficiency and effectiveness of data mining
  5. How to combine different data mining skills for the project, such as applying the stepwise regression for neural network variable selection.
  6. How to explain the data mining results


D.   The following situations may have negative effects on your report evaluation and should be avoided:


1.    Irrational research story and goal

2.    Insufficient information

3.    Incoherent logic flow in the report

4.    Overdo the report with too much trivial information. That it, you need to know what is important and what is not important.

5.    Use too many charts generated from data mining process without enough explanation.

You need to submit both electronic copy (sent to and a hardcopy.


The project report is due in noon on May 5, 2009.


Note: The effort made in data preparation will increase the evaluation of the report.