ISQS 6347 Data & Text Mining Lecture Outlines

 

(Total 29 lectures, January 7 - April 27, 2009)

 

Textbooks

 

SPB: Data Mining for Business Intelligence Galit Shmueli, Nitin R. Patel, Peter C. Bruce,

TSK: Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar

RG: Data Mining – A Tutorial Based Primer, Richard Roiger, Michael Geatz

 

SAS Course Notes

EM_GS: Getting Start with SAS® Enterprise Miner 4.3

EM_TMGS: Getting Start with SAS® 9.1 Text Miner

CSA: Data Mining - A Case Study Approach

DMTM: Text Mining Using SAS® Software

ADMT: Applying Data Mining Techniques Using Enterprise Miner

CCWEB: Effective Web Mining: Attracting and Keeping Valued Cyber Consumers

 

 

Lecture 1, Date: 01/07/2009, Wednesday

 

Topic: Introduction to Data Mining

Handout: PowerPoint DM1.ppt

Reading: SPB/RG/TSK Chapter 1

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 2, Date: 01/12/2009, Monday

 

Topic: Data mining fundamentals

Handout: PowerPoint DM2.ppt

Reading: SPB/RG/TSK Chapter 2

 

1)       Introduction to SAS Enterprise Miner

2)       Data for data mining

3)       Data preprocessing

 

Terminology: predictor, observation, confidence, dependent variable, estimation, response, score, supervised learning, unsupervised learning

 

Review questions:

1)       SPB p31 Problem 2.1 2.2, or RG pp30-31, question 1, 2

2)       Four different types of attributes of data, their properties

3)       Different types of records and data

4)       Data quality: what it is, how to guarantee

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 3, Date: 01/14/2009, Wednesday

 

Topic: Data mining fundamentals

1)       Data mining strategies

2)       Illustrative cases

3)       Confusion matrix and classification performance evaluation

 

Handout: DM3.ppt

 

Homework 1 (due 01/21/2009):

 

1)       Develop a decision tree using the credit card promotion data in the slide. You need to choose one of variables as the target. In addition, conceive a confusion matrix and indicate lift, coverage rate and accuracy rate.

2)       A dataset has 1000 records and 50 variables with 5% of value missing, spread randomly throughout the records and variables. An analyst decides to remove records that have missing values. About how many records would you expect be removed?

3)       Consider the following three-class confusion matrix. The matrix shows the classification results of a supervised model that uses previous voting records to determine the political party affiliation (Republican, Democrat, or Independent) of members of the United States Senate.

 

 

Rep

Dem

Ind

Rep

42

2

1

Dem

5

40

3

Ind

0

3

4

  1. What percent of the instances were correctly classified?
  2. According to the confusion matrix, how many Democrats are in the Senate? How many republicans? How many Independents?
  3. How many Republicans were classified as belonging to the Democratic Party?
  4. How many Independents were classified as Republicans?
  5. What are the accuracy rates of the classification?
  6. What are the coverage rates of the classification?
  7. What are values of FPs and FN?

 

Submission format: Hardcopy.

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 4, Date: 01/21/2009, Wednesday

 

Topic: Classification modeling (I)

1)       Review of data mining tasks

2)       Decision tree modeling

3)       Determining the best split – Using the measure of GINI, Entropy, or misclassification error

 

Handout: DM4.ppt

 

Review readings and questions:

1)       SPB p51, question 3.2 and 3.4. (or RG pp62, question 1-4)

2)       SPB p74, question 4.1, 4.2

3)       Getting Started with SAS EM 4.3, Chapter 1 – 2

4)       SAS data exploration PROCs: CONTENTS, FREQ

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 5, Date: 01/26/2009, Monday

 

Topic: Classification modeling (II)

1)       Quiz 1

2)       Determining when to stop splitting

3)       How to use SAS Enterprise Miner

 

Review readings and questions:

1)       Getting Started with SAS EM 4.3, Chapter 3-7

2)       SPB chapter 7, RG Chapter 3

3)       What is overfitting in the decision tree approach? How to prevent overfitting?

4)       How to decide to stop splitting a tree?

5)       How to evaluate the performance of a model?

 

Homework 2 (due 02/04/2009 Wednesday):

1)       Read Section 4.1 of “Effective Web Mining” (document name: CCWEB_TKIT.pdf, Page 4-1 to 4-34). Use dataset DMAIL (in the shared space under \Data directory) to develop two decision tree models. One is basic without any parameter change, and another uses Gini splitting criterion. Then add an Assessment node to the diagram to compare the performance of two classification models. You feel free to try different splitting criteria: Chi-Square, GINI, and Entropy, and different other parameters in “Advanced” Tab in the Tree node configuration panel. The deliverables include

a.       the model diagram,

b.       one of the Assessment diagram, and

c.       the performance table in the results of the Assessment node (see the following). A short explanation to the results is necessary.

2)       Use the results of the tree node in the above exercise. Input some of the node information to an Excel sheet (http://zlin.ba.ttu.edu/6347/Tree_gain.xls) and calculate:

a.       Gini values of the first two layers (may not have the third layer)

b.       Entropy values of the layers

c.       Gain ratio of layer 2

Focuses: (1) Refining the model to get meaningful outcomes, (2) Learn how to explain the results

 

Submission format: Email the Word and Excel files to Zhangxi.lin@hotmail.com

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 6, Date: 01/28/2009, Wednesday

 

Topic: Illustrative case study

 

1)       Decision tree modeling with SAS Enterprise Miner

2)       Model evaluation


Review readings and questions:

1)       Why do we need to split a dataset into training, validation and test datasets? What are the different purposes of using validation and test datasets?

2)       What is Prior Probability? What is its relationship with the sample probability? How to define it in SAS EM?

3)       What is stratification? Why do we need it? How can you set stratification parameters in SAS EM?

 

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 7, Date: 02/02/2009, Monday

 

Topic: Classification modeling (III)

 

1)       Model performance

2)       Logistic regression modeling for classification

1)       Exercise 1

 

Handout: DM5.ppt

 

Review readings and questions:

1)       Applying Data Mining Techniques Using Enterprise Miner (ADMT_001.PDF, ADMT in short) chapter 2

2)       SPB chapter 7 & 8, RG chapter 10 (pp 291-302)

3)       What do oversampling and undersampling mean respectively?

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 8, Date: 02/04/2009, Wednesday

 

Topic: Logistic Regression with SAS EM

 

1)       Missing value replacement

2)       HMEQ data mining demonstration

3)       Classification model scoring

 

Review readings and questions:

1)       Data Mining Using SAS Enterprise Miner: A Case Study Approach (DMCS), Chapter 1-2

2)       SPB chapter 8

3)       Decision tree vs. logistic regression – which one is better in construction of classification models?

4)       How to use Interactive Grouping node?

5)       What are the functions of node Distribution Explorer, Multiplot, and Insight?

 

Homework assignment 3 (due 2/23/2009, Monday):

Go through Chapter 5 of ADMT. Using the same dataset “BUY” (available in the shared directory under /data subdirectory), construct a classification model using neural network, decision tree and regression nodes. Add an Assessment node to compare the results. Present the outcomes with the following printouts:

- A lift chart

- A rotating plot as indicated in page 5-19 to 5-20

- The information from the results of Assessment node that can show the performance of three different classification methods

- Did you find anything else (up to two pieces) that you believe significant, for example the unexpected outcome of the decision tree. Then try to explain the situation with a couple sentences.

 

Submission format: Hardcopy

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 9, Date: 02/09/2009, Monday

 

Topic: Variable selection

 

1)       Quiz 2 (classification)

2)       Loan application data mining

3)       Classification model deployment

 

Review readings and questions:

1)       Applying Data Mining Techniques Using Enterprise Miner (ADMT_001.PDF) chapter 3-4

2)       Data Mining Using SAS Enterprise Miner: A Case Study Approach (DMCS), Chapter 3

3)       How to use two approaches for classification scoring?

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 10, Date: 02/11/2009, Wednesday (rescheduled to 02/16/2009)

 

Topic: Neural Network Classification

 

Topic: Neural Network Classification

 

1)       Quiz 2 review

2)       Principle of neural network for data mining

3)       Loan application data mining – Neural network

Handout: DM6

 

Review readings and questions:

1)      Applying Data Mining Techniques Using Enterprise Miner (ADMT_001.PDF) chapter 5

2)      SPB chapter 9, or RG Chapter 8

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 11, Date: 02/16/2009, Monday (to be rescheduled)

 

Topic: Neural Network Classification

 

1)       Node Insight, Variable Selection, Code, etc.

2)       Ensemble models

3)       Review

 

Review reading:

Applying Data Mining Techniques Using Enterprise Miner (ADMT_001.PDF) chapter 6

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 12, Date: 02/18/2009, Wednesday

 

Topic: Review

 

1)       Predictive modeling with dataset MYRAW

2)       Classification modeling contest

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 13, Date: 02/23/2009, Monday

 

Topic:  Introduction to Clustering

 

1)       Quiz 3

 

Reading:

1)       Applying Data Mining Techniques Using Enterprise Miner (ADMT), Chapter 7

2)       SPB ch12, or RG Chapter 3 (Section 3.3)

3)       Clustering: An Introduction, http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/index.html

4)       K-mean clustering tutorial, http://people.revoledu.com/kardi/tutorial/kMean/index.html

 

Homework assignment 4 (due 03/04/2009):

 

Use SAS EM to analyze dataset S3358 (ISQS 3358 student survey data), which is in the shared directory under \3358SURVEY subdirectory. You need to convert the text file into SAS format by importing it. You do not need to use all the variables. So you need to study the survey form and explore the data. Report three findings with selected screenshots of the representative results. The following are a few questions to guide the analysis:

1.       How many groups should the students be divided? Why?

2.       What are the characteristics of each group?

3.       Which factors are more important in cluster the students

 

Datasets: 3358Survey.txt – to be converted into the SAS format

Submission format: Hardcopy

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 14, Date: 02/25/2009, Wednesday

 

Topic:  Clustering

1)       Why clustering

2)       Principles of clustering

3)       A clustering case study

 

Slides: DM7

Other: Clustering demon

 

Readings and review questions:

1)  Data Mining Using SAS Enterprise Miner: A Case Study Approach (DMCS), Chapter 4

2)  What are main difference between clustering and classification data mining?

3)  Check datasets HMEQ. What would be outcomes if you cluster it?

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 15, Date: 03/02/2009, Monday

 

Topic:  Clustering

1)       Clustering case study

2)       SOM clustering

3)       Hierarchical clustering

 

Slides: DM8

 

Readings and review questions:

1)       Hierarchical clustering, http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/hierarchical.html,

http://www.resample.com/xlminer/help/HClst/HClst_intro.htm

2)      Use the clustering worksheet (http://zlin.ba.ttu.edu/6347/Clustering.xls ) to explore different outcomes of clustering. Modify the coordinates of the instances to obtain different datasets. Check the outcomes. Selectively record the k-mean clustering iterations for 2 different sets of instances including the illustrative charts.

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 16, Date: 03/04/2009, Wednesday

 

Topic:  Association Analysis

1)       Introduction to associate analysis

2)       Sequential pattern analysis

3)       Exercise 2

 

Readings and review questions:

1)       ADMT Ch 8.1

2)       SBP chapter 11

3)       Data Mining Using SAS Enterprise Miner: A Case Study Approach (DMCS), Chapter 5

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 17, Date: 03/09/2009, Monday

 

Topic:  Association Analysis

 

1)       Quiz 4 (clustering)

2)       Evaluation of association patterns

3)       Dissociation analysis

4)       Link analysis

 

Readings and review questions:

1)       ADMT Ch 8.2

2)       SBP chapter 11

 

Homework assignment 5 (Due on March 23, Monday):

1)       Redo the associate analysis example ASSOCS (The dataset is in SAS EM library SAMPSIO. All the datasets used in this course notes are in SAMPSIO. Some may have different names. For example, DMWEB in the book becomes WEBPATH in SAS EM library SAMPSIO.).

2)       Do link analysis following DMCS with dataset Webpath in SAMPSIO.

3)       Report the outcomes

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 18, Date: 03/11/2009, Wednesday

 

Topic:  Association Analysis

1)       Other types of associate analysis

2)       Itemset generation - Apriori principle

3)       Association rule discovery and generation

 

Readings and review questions:

1)       ADMT Ch 8.1

2)       SBP chapter 11

3)       Data Mining Using SAS Enterprise Miner: A Case Study Approach (DMCS), Chapter 6

 

------ + ------ + ------- + ------- + ------- + -------

 

Spring break! J

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 19, Date: 03/23/2009, Monday

 

Topic:  Review of Association Analysis

1)       Review

2)       Exercise 3

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 20, Date: 03/25/2009, Wednesday

 

Topic:  Text Mining – Preliminary (I)

1)      Introduction to text mining

2)       Processing textual data

 

Slides: TM-1

Dataset: SVDTUTOR

 

Readings:

1)       Text Mining Using SAS Software (TMUS), Chapter 1

2)       CCWEB_TKIT Section 2.1, 2.2

3)       RG Ch11 (pp342-343)

 

Online materials for Singular Value Decomposition (SVD):

1)       Basics of Matrix: http://www.xycoon.com/matrix_algebra.htm

2)       http://mathworld.wolfram.com/SingularValueDecomposition.html

3)       http://www.uwlax.edu/faculty/will/svd/

4)       http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm

5)       http://kwon3d.com/theory/jkinem/svd.html

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 21, Date: 03/30/2009, Monday

 

Topic:  Text Mining – Preliminary (II)

 

1)       Quiz 5 (Association analysis)

2)       Transformations

3)       An illustrative example

 

Dataset: ABSTRACT

 

Reading:

1)       Getting Started with SAS Text Miner (GSTM) Ch1-4 (View PDF (947KB), pp1-30)

2)       Text Mining Using SAS Software (TMUS), Chapter 1

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 22, Date: 04/01/2009, Wednesday

 

Topic:  Exploratory Analysis of Documents

1)       Simple statistical analysis

2)       Text data exploration

 

Online reading materials of Hidden Markov Model (HMM):

1)       http://jedlik.phy.bme.hu/~gerjanos/HMM/node2.html

2)       http://www.csse.monash.edu.au/~lloyd/tildeMML/Structured/HMM.html

3)       http://www.autonlab.org/tutorials/hmm14.pdf

4)       Getting Started with SAS Text Miner (GSTM) Ch5-7 (View PDF (947KB))

 

Homework assignment 6 (due on April 14, 2008):

1)       Read the Amazon Book example in TMUS Section 2.3 and replicate it.

2)       Replicate the steps of insurance claim example in TMUS 3.3 (No need to report all outcomes except for (1) the model diagram, (2) the response chart from Assessment node). Why the outcomes from SVD based regression is better than the cluster ID based regression?

3)       Explain the configuration of SVD and roll-up term. What are the differences between them?

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 23, Date: 04/06/2009, Monday

 

Topic: In-class Exercise 4

 

1)       Text data exploration

2)       Exercise 4 (Text Mining)

 

Slides: TM-2

 

Reading: Text Mining Using SAS Software (TMUS), Chapter 2

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 24, Date: 04/08/2009, Wednesday

 

Topic:  Text Mining for Predictive Modeling

 

Case: Insurance subrogation

 

Dataset: INSSUBRO

Slides: TM-3

 

Readings: Text Mining Using SAS Software (TMUS), Chapter 3

 

------ + ------ + ------- + ------- + ------- + -------

 

Date: 04/13/2009, Monday

 

No class

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 25, Date: 04/15/2009, Wednesday

 

Topic:  Conducting a Text Mining Project

 

1)       Quiz 6

2)       Text mining data preparation

 

Readings:

1)       Text Mining Using SAS Software (TMUS), Chapter 3

2)       Reference papers (in the subdirectory \SUGI05 and \SUGI06 of the shared diskspace)

 

Homework assignment 7 (optional, due on April 27, 2008):

P5-75, CCWEB_KIT.pdf, Exercises, Problem 1-2.

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 26, Date: 04/20/2009, Monday

 

Topic:  Introduction to Web Mining

 

Lecture outline:

1)       Quiz 6 review

2)       Introduction to Web Mining

3)       FSLINKS dataset analysis

4)       Web access analysis

1.       Link analysis

2.       Association analysis

 

Datasets: FSLINKS, RLINKS

Reading: Effective Web Mining: Attracting and Keeping Valued Cyber Consumers (CCWEB), Chapter 1-2

Slides: WebMining1

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 27, Date: 04/22/2009, Wednesday

 

Topic:  Online targeted advertising

 

Lecture outline:

1)       Propensity-to-buy

2)       Exercise 5 (web mining)

 

Datasets: PROPBUY, BANNER, BANNERAD

Reading: Effective Web Mining: Attracting and Keeping Valued Cyber Consumers (CCWEB), Chapter 3-4

Slides: WebMining2

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 28, Date: 04/27/2009, Monday

 

Topic:  Online recommender systems

 

1)       Introduction to online recommender systems

 

Slides: WebMining3

 

------ + ------ + ------- + ------- + ------- + -------

The following are to be updated.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The lecture notes for the spring 2008 class

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 26, Date: 04/15/2008, Wednesday

 

Topic:  Web-based Recommender Systems

 

Lecture outline:

1)       Exercise 5 (Web mining)

2)       Principle of online recommendation systems

 

Reading: Effective Web Mining: Attracting and Keeping Valued Cyber Consumers (CCWEB), Chapter 6

 

Slides: WebMining3

 

Datasets: MOVIEBUY

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 27, Date: 04/20/2008, Monday

 

Topic: Web-based Recommender Systems

 

Reading: Effective Web Mining: Attracting and Keeping Valued Cyber Consumers (CCWEB), Chapter 6

Slides: WebMining3

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 28, Date: 04/22/2008, Wednesday

 

Topic:  Bayes Theorem and Data Mining

Slides: Bayes Theorem

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 29, Date: 04/27/2008, Monday

 

Topic:  Review

 

2)       Exercise 6 (Web mining)

3)       Quiz 7 (Web mining)

4)       Review

 

------ + ------ + ------- + ------- + ------- + -------