ISQS 6347 Data & Text Mining Lecture Outlines

 

(Total 29 lectures, January 9 - April 29, 2006)

 

Textbooks

 

SPB: Data Mining for Business Intelligence Galit Shmueli, Nitin R. Patel, Peter C. Bruce,

TSK: Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar

RG: Data Mining – A Tutorial Based Primer, Richard Roiger, Michael Geatz

 

Lecture 1, Date: 01/09/2008, Wednesday

 

Topic: Introduction to Data Mining

Handout: PowerPoint slides 1

Reading: SPB/RG/TSK Chapter 1

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 2, Date: 01/14/2008, Monday

 

Topic: Data mining fundamentals

Handout: PowerPoint slides 2

Reading: SPB/RG/TSK Chapter 2

 

1)       Introduction to SAS Enterprise Miner

2)       Data for data mining

3)       Case study

 

Dataset: Credit card promotion

 

Terminology: predictor, observation, confidence, dependent variable, estimation, response, score, supervised learning, unsupervised learning

 

Review questions:

1)       SPB p31 Problem 2.1 2.2, or RG pp30-31, question 1, 2

2)       Four different types of attributes of data, their properties

3)       Different types of records and data

4)       Data quality: what it is, how to guarantee

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 3, Date: 01/16/2008, Wednesday

 

Topic: Data mining fundamentals

1)       Data preprocessing

2)       How to evaluate the performance of a classification rule

3)       Data mining strategies

4)       An illustrative classification case

 

Handout: slides 3

 

Homework 1 (due 01/23/2008):

 

1)       Develop a decision tree using the credit card promotion data in the slide. You need to choose one of variables as the target. In addition, conceive a confusion matrix and indicate lift, coverage rate and accuracy rate.

2)       A dataset has 1000 records and 50 variables with 5% of value missing, spread randomly throughout the records and variables. An analyst decides to remove records that have missing values. About how many records would you expect be removed?

3)       Consider the following three-class confusion matrix. The matrix shows the classification results of a supervised model that uses previous voting records to determine the political party affiliation (Republican, Democrat, or Independent) of members of the United States Senate.

 

 

Rep

Dem

Ind

Rep

42

2

1

Dem

5

40

3

Ind

0

3

4

  1. What percent of the instances were correctly classified?
  2. According to the confusion matrix, how many Democrats are in the Senate? How many republicans? How many Independents?
  3. How many Republicans were classified as belonging to the Democratic Party?
  4. How many Independents were classified as Republicans?

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 4, Date: 01/23/2008, Wednesday

 

Topic: Classification modeling (I)

1)       Review of data mining tasks

2)       Decision tree modeling

3)       Determining the best split – Using the measure of GINI, Entropy, or misclassification error

 

Handout: slides 4

 

Review readings and questions:

1)       SPB p51, question 3.1 and 3.2. (or RG pp62, question 1-4)

2)       SPB p74, question 4.1, 4.2

3)       Getting Started with SAS EM 4.3, Chapter 1 – 2

4)       SAS data exploration PROCs: CONTENTS, FREQ

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 5, Date: 01/28/2008, Monday

 

Topic: Classification modeling (II)

1)       Quiz 1

2)       Determining when to stop splitting

3)       How to use SAS Enterprise Miner

 

Review readings and questions:

1)       Getting Started with SAS EM 4.3, Chapter 3-7

2)       SPB chapter 7, RG Chapter 3

3)       What is overfitting in the decision tree approach? How to prevent overfitting?

4)       How to decide to stop splitting a tree?

5)       How to evaluate the performance of a model?

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 6, Date: 01/30/2008, Wednesday

 

Topic: Illustrative case study

 

1)       Decision tree modeling with SAS Enterprise Miner

2)       Model evaluation


Review readings and questions:

1)       Why do we need to split a dataset into training, validation and test datasets? What are the different purposes of using validation and test datasets?

2)       What is Prior Probability? What is its relationship with the sample probability? How to define it in SAS EM?

3)       What is stratification? Why do we need it? How can you set stratification parameters in SAS EM?

 

Homework 2 (due a week later):

1)       Read Section 4.1 of “Effective Web Mining” (document name: CCWEB_TKIT.pdf, Page 4-1 to 4-34). Use dataset DMAIL (in the shared space under /Data directory) to complete the tasks described in the section. Then add in a Tree node and an Assessment node to the diagram  to compare the performance of two classification models. The deliverables include (1) the model diagram, (2) one of the Assessment diagram, and (3) the performance table in the results of the Assessment node. Add in a short explanation to the results.

2)       Use the results of the tree node in the above exercise. Input some of the node information to an Excel sheet (http://zlin.ba.ttu.edu/6347/Tree_gain.xls) and calculate:

a.       Gini values of the first two layers (may not have the third layer)

b.       Entropy values of the layers

c.       Gain ratio of layer 2

3)       Focuses: (1) Refining the model to get meaningful outcomes, (2) Learn how to explain the results

4)       Email the Word and Excel files to Zhangxi.lin@hotmail.com

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 7, Date: 02/04/2008, Monday

 

Topic: Classification modeling (III)

 

1)       Exercise 1

2)       Classification model scoring

3)       Missing value replacement

 

Review readings and questions:

1)       Applying Data Mining Techniques Using Enterprise Miner (ADMT_001.PDF, ADMT in short) chapter 2.

2)       Getting Started with SAS EM 4.3, Chapter 3-7

3)       SPB chapter 7 & 8, RG chapter 10 (pp 291-302)

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 8, Date: 02/06/2008, Wednesday

 

Topic: Logistic Regression with SAS EM

 

1) Logistic regression modeling for classification

 

Handout: Slides 5

 

Review readings and questions:

1)       Data Mining Using SAS Enterprise Miner: A Case Study Approach (DMCS), Chapter 1-2

2)       SPB chapter 8

3)       Decision tree vs. logistic regression – which one is better in construction of classification models?

4)       What are the functions of node Distribution Explorer, Multiplot, and Insight?

 

Homework assignment 3 (due a week later):

Go through Chapter 5 of ADMT. Using the same dataset “BUY” (available in the shared directory under /data subdirectory), construct a classification model using neural network, decision tree and regression nodes. Add an Assess node to compare the results. Present the outcomes with the following printouts:

- A lift chart

- A rotating plot as indicated in page 5-19 to 5-20

- The information from the results of Assessment node that can show the performance of three different classification methods

- Did you find anything else (up to two pieces) that you believe significant, for example the unexpected outcome of the decision tree. Then try to explain the situation with a couple sentences.

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 9, Date: 02/11/2008, Monday

 

Topic: Variable selection

 

1)       Quiz 2 (classification)

2)       Loan application data mining

3)       Classification model deployment

 

 

Review readings and questions:

1)       Applying Data Mining Techniques Using Enterprise Miner (ADMT_001.PDF) chapter 3-4

2)       Data Mining Using SAS Enterprise Miner: A Case Study Approach (DMCS), Chapter 3

3)       How to use two approaches for classification scoring?

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 10, Date: 02/13/2008, Wednesday

 

Topic: Neural Network Classification

 

1) Interactive Grouping node

2) Principle of neural network for data mining

3) Loan application data mining – Neural network

 

Handout: Slides 6

 

Review readings and questions:

1)      Applying Data Mining Techniques Using Enterprise Miner (ADMT_001.PDF) chapter 5

2)      SPB chapter 9, or RG Chapter 8

3)      Go through Chapter 5 of ADMT. Using the same dataset “BUY” (available in the shared directory under /data subdirectory), construct a classification model using neural network, decision tree and regression nodes. Add an Assess node to compare the results. Present the outcomes with the following printouts:

- A lift chart

- A rotating plot as indicated in page 5-19 to 5-20

- The information from the results of Assessment node that can show the performance of three different classification methods

- Did you find anything else (up to two pieces) that you believe significant, for example the unexpected outcome of the decision tree. Then try to explain the situation with a couple sentences.

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 11, Date: 02/18/2008, Monday

 

Topic: Neural Network Classification

 

1)       Ensemble models

2)       Predictive modeling with dataset MYRAW

3)       Classification modeling contest

 

Handout: Slides 7

 

Review readings and questions:

1)       Applying Data Mining Techniques Using Enterprise Miner (ADMT), Chapter 6

2)       Complete scoring and SAS coding following the instructions in Chapter 6 of ADMT.

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 12, Date: 02/20/2008, Wednesday

 

Topic:  Introduction to Clustering

1)       Why clustering

2)       Principles of clustering

 

Reading:

1)       Applying Data Mining Techniques Using Enterprise Miner (ADMT), Chapter 7

2)       SPB ch12, or RG Chapter 3 (Section 3.3)

3)       Clustering: An Introduction, http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/index.html

4)       K-mean clustering tutorial, http://people.revoledu.com/kardi/tutorial/kMean/index.html

 

Slides: DM8

Other: Clustering demon

 

Review questions:

1)       What are main difference between clustering and classification data mining?

2)       Check datasets HMEQ. What would be outcomes if you cluster it?

 

Homework assignment 4 (due a week later):

 

1)       SPB p237-239, Problem 12.1; or RG p103, Computational Questions: 10 (feel free to use the clustering worksheet http://zlin.ba.ttu.edu/6347/Clustering.xls )

2)       Use clustering approach to analyze dataset S3358 (ISQS 3358 student survey data) in shared directory under \Other_Data subdirectory. Report a few findings with selected screenshots of the representative results. The following are a few questions that could interest the instructor: