ISQS 6347 Data & Text Mining Lecture Outlines

 

(Last update: April 27, 2006)

 

(Total 29 lectures during January 12 to April 27, 2006)

 

Lecture 1, Date: 01/12/06

 

Topic: Introduction to Data Mining

Handout: PowerPoint slides

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 2, Date: 01/17/06

 

Topic: Data mining fundamentals (Concepts, case, data for data mining)

Handout: PowerPoint slides

Reading: RG Chapter 1

Dataset: Credit card promotion

Review questions:

1)       RG pp30-31, question 1, 2

2)       Four different types of attributes of data, their properties

3)       Different types of records and data

4)       Data quality: what it is, how to guarantee

5)       Data transformation

6)       Data preprocessing

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 3, Date: 01/19/06

 

Topic: Data mining fundamentals (Self-taught)

Handout: use the previous

Reading: RG Chapter 2

Review questions:

1)       Euclidean distance

2)       Minkowski distance

3)       Cosine similarity

4)       Data mining strategies

Homework assignment 1:

1)       RG p30, question 5

2)       RG p63, Computational Question 1

3)       Find two data mining project topics after checking a few cases on the web (such as www.kdnuggets.com). You may likely choose one of them for your midterm data mining project. Write a short paragraph to describe the goal and contents of each topic.

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 4, Date: 01/24/06

 

Topic: Basic data mining techniques & tools

Reading: RG Chapter 2, Online SAS materials (View PDF (2.24MB))

Handout: PowerPoint slides

Dataset for exercise (from RG): http://zlin.ba.ttu.edu/6347/Demons/CreditProm-tree.xls

SAS code for converting Excel files to SAS datasets: http://zlin.ba.ttu.edu/6347/Demons/import_excel.sas

Lecture outline:

1)       Review

2)       Data mining strategies

3)       Data mining model performance evaluation

4)       Introduction to SAS Enterprise Miner 4.3

Review readings and questions:

1)       RG pp62-63, question 1-4

2)       Getting Started with SAS EM 4.3, Chapter 1 & 2

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 5, Date: 01/26/06

 

Topic: SAS Enterprise Miner 4.3

Reading: RG Chapter 2 & 4, Online SAS materials (same as before)

Handout: PowerPoint slides

Review readings and questions:

1)       RG pp62-63, question 5-9

2)       Getting Started with SAS EM 4.3, Chapter 3-6

3)       How to conduct a decision tree induction?

4)       What are GINI, Entropy and misclassification indices?

Homework assignment 2: RG p64, Computational Question 2, 3

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 6, Date: 01/31/06

 

Topic: Decision Tree

Reading: RG Chapter 3, Appendix C, Online SAS materials (same as before)

Lecture outline:

1)       Review

2)       Gain Ratio

Review readings and questions:

1)       Search website to find any information about outfitting

2)       RG p102, Review question 1.

 

------ + ------ + ------- + ------- + ------- + -------

Lecture 7, Date: 02/02/06

 

Topic: SAS Enterprise Miner 4.3 (Decision Tree)

Reading: RG Chapter 3, Online SAS materials (same as before)

Lecture outline:

1)       Quiz question review

2)       Determine when to stop splitting a tree

3)       Practical issues in classification: Underfitting and overfitting

4)       Lab exercise: Task 1-6, 10 in “Getting Started …” (View PDF (2.24MB))

Homework assignment 3 (due next Thursday):

1)       Review task 1-14 in “Getting Started …”, focusing on the decision tree node. Try different splitting criteria and compare the outcomes. Report whatever problems you encountered to be discussed in the class (No need to submit)

2)       Open the result of the tree node. Save some of the node information to an Excel sheet (http://zlin.ba.ttu.edu/6347/Tree_gain.xls) and calculate:

a.       Gini values of the top three layers (feel free to do the fourth layer)

b.       Entropy values of the top three layers (feel free to do the fourth layer)

c.       Gain ratio of layer 2 and 3 (feel free to do the fourth layer)

Submit the results by email.

 

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 8, Date: 02/07/06

 

Topic: SAS Enterprise Miner 4.3 (Basic skills)

Reading: RG Chapter 3, Online SAS materials (“Getting Started …” Chapter 7)

Lecture outline:

Example flow diagram task 7-14:

1)       Basic use of different model nodes: Decision Tree, Regression, and Neural Network

2)       Variable transformation

3)       Assessment

4)       Scoring

Review questions:

1)       What are the roles of validation dataset and test dataset

2)       What is overfitting in the decision tree approach? How to prevent overfitting?

3)       How to decide to stop splitting a tree?

4)       How to evaluate the performance of a model?

5)       What is Prior Probability? What is its relationship with the sample probability? How to define it in SAS EM?

6)       What is stratification? Why do we need it? How can you set stratification parameters in SAS EM? 

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 9, Date: 02/09/06

 

Topic: SAS Enterprise Miner 4.3 (Decision tree & regression)

Reading: RG Chapter 3, Chapter 10 (pp291-308)

Lecture outline:

1)       Alternatives based on the example flow diagram task 1-16 (make use of the code for Task 15: http://zlin.ba.ttu.edu/6347/score_card.sas)

a.       Understanding data mining outcomes (http://zlin.ba.ttu.edu/6347/Lift.xls)

b.       Changing parameters

c.       Comparison between different outcomes

2)       An home equity data mining example (regression)

a.       How to handle missing values

b.       Understanding data replacement

c.       Fitting and comparing candidate models

Homework assignment 4:

1)       RG pp102-103, Computational Questions: 1, 4, 5

2)       Exploring the home equity data mining example (no need to submit)

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 10, Date: 02/14/06

 

Topic: SAS Enterprise Miner 4.3 (Variable selection)

Reading: RG Chapter 3,

Lecture outline:

1)       Variable Selection node & Score node

2)       An home equity data mining example (continued)

a.       Using Score code

b.       Using Variable Selection node

Review questions:

            N/A

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 11, Date: 02/16/06

 

Topic: Clustering

Reading: RG Chapter 3 (pp84-100) & 10 (pp308-317), TSK Chapter 8

Lecture outline: