ISQS 6347 Data & Text Mining Lecture Outlines
(Total 29 lectures, January 7 -
April 27, 2009)
Textbooks
SPB: Data
Mining for Business Intelligence Galit Shmueli, Nitin R. Patel, Peter C. Bruce,
RG:
Data Mining – A Tutorial Based Primer,
Richard Roiger, Michael Geatz
SAS Course Notes
EM_GS:
Getting Start with SAS® Enterprise Miner 4.3
EM_TMGS:
Getting Start with SAS® 9.1 Text Miner
CSA:
Data Mining - A Case Study Approach
DMTM:
Text Mining Using SAS® Software
ADMT:
Applying Data Mining Techniques Using Enterprise Miner
CCWEB:
Effective Web Mining: Attracting and Keeping Valued Cyber Consumers
Lecture 1, Date: 01/07/2009, Wednesday
Topic:
Introduction to Data Mining
Handout:
PowerPoint DM1.ppt
------ +
------ + ------- + ------- + ------- + -------
Lecture 2, Date: 01/12/2009, Monday
Topic: Data
mining fundamentals
Handout:
PowerPoint DM2.ppt
1) Introduction to SAS
2) Data for data mining
3) Data preprocessing
Terminology:
predictor, observation, confidence, dependent variable, estimation, response,
score, supervised learning, unsupervised learning
Review
questions:
1) SPB p31 Problem 2.1 2.2, or RG
pp30-31, question 1, 2
2) Four different types of attributes
of data, their properties
3) Different types of records and data
4) Data quality: what it is, how to
guarantee
------ +
------ + ------- + ------- + ------- + -------
Lecture 3, Date: 01/14/2009,
Wednesday
Topic: Data
mining fundamentals
1) Data mining strategies
2) Illustrative cases
3) Confusion matrix and classification
performance evaluation
Handout:
DM3.ppt
Homework 1 (due
01/21/2009):
1) Develop a decision tree using the
credit card promotion data in the slide. You need to choose one of variables as
the target. In addition, conceive a confusion matrix and indicate lift,
coverage rate and accuracy rate.
2) A dataset has 1000 records and 50
variables with 5% of value missing, spread randomly throughout the records and
variables. An analyst decides to remove records that have missing values. About
how many records would you expect be removed?
3) Consider the following three-class
confusion matrix. The matrix shows the classification results of a supervised
model that uses previous voting records to determine the political party
affiliation (Republican, Democrat, or Independent) of members of the United
States Senate.
|
|
Rep |
Dem |
|
|
Rep |
42 |
2 |
1 |
|
Dem |
5 |
40 |
3 |
|
|
0 |
3 |
4 |
Submission format: Hardcopy.
------ +
------ + ------- + ------- + ------- + -------
Lecture 4, Date: 01/21/2009,
Wednesday
Topic:
Classification modeling (I)
1) Review of data mining tasks
2) Decision tree modeling
3) Determining the best split – Using
the measure of GINI, Entropy, or misclassification error
Handout:
DM4.ppt
Review
readings and questions:
1) SPB p51, question 3.2 and 3.4. (or
RG pp62, question 1-4)
2) SPB p74, question 4.1, 4.2
3) Getting Started with SAS EM 4.3,
Chapter 1 – 2
4) SAS data exploration PROCs:
CONTENTS, FREQ
------ +
------ + ------- + ------- + ------- + -------
Lecture 5, Date: 01/26/2009, Monday
Topic:
Classification modeling (II)
1) Quiz 1
2) Determining when to stop splitting
3) How to use SAS Enterprise Miner
Review readings
and questions:
1) Getting Started with SAS EM 4.3,
Chapter 3-7
2) SPB chapter 7, RG Chapter 3
3) What is overfitting in the decision
tree approach? How to prevent overfitting?
4) How to decide to stop splitting a
tree?
5) How to evaluate the performance of a
model?
Homework 2 (due
02/04/2009 Wednesday):
1) Read Section 4.1 of “Effective Web
Mining” (document name: CCWEB_TKIT.pdf, Page 4-1 to 4-34). Use dataset DMAIL
(in the shared space under \Data directory) to develop two decision tree
models. One is basic without any parameter change, and another uses Gini
splitting criterion. Then add an Assessment node to the diagram to compare the
performance of two classification models. You feel free to try different
splitting criteria: Chi-Square, GINI, and Entropy, and different other
parameters in “Advanced” Tab in the Tree node configuration panel. The
deliverables include
a. the model diagram,
b. one of the Assessment diagram, and
c. the performance table in the results
of the Assessment node (see the following). A short explanation to the results
is necessary.

2) Use the results of the tree node in
the above exercise. Input some of the node information to an Excel sheet (http://zlin.ba.ttu.edu/6347/Tree_gain.xls)
and calculate:
a. Gini values of the first two layers
(may not have the third layer)
b. Entropy values of the layers
c. Gain ratio of layer 2
Focuses: (1) Refining the model to get meaningful outcomes,
(2) Learn how to explain the results
Submission format: Email the Word and Excel files to Zhangxi.lin@hotmail.com
------ +
------ + ------- + ------- + ------- + -------
Lecture 6, Date: 01/28/2009,
Wednesday
Topic:
Illustrative case study
1) Decision tree modeling with SAS
Enterprise Miner
2) Model evaluation
Review
readings and questions:
1) Why do we need to split a dataset
into training, validation and test datasets? What are the different purposes of
using validation and test datasets?
2) What is Prior Probability? What is
its relationship with the sample probability? How to define it in SAS EM?
3) What is stratification? Why do we
need it? How can you set stratification parameters in SAS EM?
------ +
------ + ------- + ------- + ------- + -------
Lecture 7, Date: 02/02/2009, Monday
Topic:
Classification modeling (
1) Model performance
2) Logistic regression modeling for
classification
1) Exercise 1
Handout:
DM5.ppt
Review
readings and questions:
1) Applying Data Mining Techniques
Using Enterprise Miner (
2)
SPB chapter 7 & 8, RG chapter 10
(pp 291-302)
3)
What do oversampling and undersampling mean
respectively?
------ +
------ + ------- + ------- + ------- + -------
Lecture 8, Date: 02/04/2009,
Wednesday
Topic:
Logistic Regression with SAS EM
1) Missing value replacement
2) HMEQ data mining demonstration
3) Classification model scoring
Review
readings and questions:
1) Data Mining Using SAS Enterprise
Miner: A Case Study Approach (DMCS), Chapter 1-2
2)
SPB chapter 8
3) Decision tree vs. logistic
regression – which one is better in construction of classification models?
4) How to use Interactive Grouping
node?
5) What are the functions of node
Distribution Explorer, Multiplot, and Insight?
Homework assignment 3 (due
2/23/2009, Monday):
Go through Chapter 5 of
- A lift chart
- A rotating plot as indicated in page 5-19 to 5-20
- The information from the results of Assessment node that
can show the performance of three different classification methods
- Did you find anything else (up to two pieces) that you believe significant, for example the unexpected outcome of the decision tree. Then try to explain the situation with a couple sentences.
Submission format: Hardcopy
------ +
------ + ------- + ------- + ------- + -------
Lecture 9, Date: 02/09/2009, Monday
Topic:
Variable selection
1) Quiz 2 (classification)
2) Loan application data mining
3) Classification model deployment
Review
readings and questions:
1) Applying Data Mining Techniques
Using Enterprise Miner (
2) Data Mining Using SAS Enterprise
Miner: A Case Study Approach (DMCS), Chapter 3
3) How to use two approaches for
classification scoring?
------ +
------ + ------- + ------- + ------- + -------
Lecture 10, Date: 02/11/2009,
Wednesday (rescheduled to 02/16/2009)
Topic:
Neural Network Classification
Topic:
Neural Network Classification
1)
Quiz
2 review
2)
Principle
of neural network for data mining
3)
Loan
application data mining – Neural network
Handout:
DM6
Review
readings and questions:
1) Applying Data Mining Techniques
Using Enterprise Miner (
2) SPB chapter 9, or RG Chapter 8
------ +
------ + ------- + ------- + ------- + -------
Lecture 11, Date: 02/16/2009, Monday
(to be rescheduled)
Topic:
Neural Network Classification
1) Node Insight, Variable Selection,
Code, etc.
2) Ensemble models
3) Review
Review
reading:
Applying Data Mining Techniques Using Enterprise Miner (
------ +
------ + ------- + ------- + ------- + -------
Lecture 12, Date: 02/18/2009,
Wednesday
Topic:
Review
1) Predictive modeling with dataset
MYRAW
2) Classification modeling contest
------ +
------ + ------- + ------- + ------- + -------
Lecture 13, Date: 02/23/2009, Monday
Topic:
Introduction to Clustering
1) Quiz 3
1)
Applying Data Mining Techniques Using
2)
SPB ch12, or
RG Chapter 3 (Section
3.3)
3)
Clustering: An Introduction, http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/index.html
4)
K-mean clustering tutorial, http://people.revoledu.com/kardi/tutorial/kMean/index.html
Homework assignment 4 (due 03/04/2009):
Use SAS EM to analyze dataset S3358 (ISQS 3358 student
survey data), which is in the shared directory under \3358SURVEY subdirectory.
You need to convert the text file into SAS format by importing it. You do not need to use all the variables.
So you need to study the survey form and explore the data. Report three
findings with selected screenshots of the representative results. The following
are a few questions to guide the analysis:
1. How many groups should the students
be divided? Why?
2. What are the characteristics of each
group?
3. Which factors are more important in
cluster the students
Datasets: 3358Survey.txt – to be converted into the SAS
format
Submission format: Hardcopy
------ +
------ + ------- + ------- + ------- + -------
Lecture 14, Date: 02/25/2009,
Wednesday
Topic: Clustering
1) Why clustering
2) Principles of clustering
3) A clustering case study
Slides: DM7
Other: Clustering demon
1) Data Mining Using SAS Enterprise
Miner: A Case Study Approach (DMCS), Chapter 4
2) What are main
difference between clustering and classification data mining?
3) Check datasets HMEQ.
What would be outcomes if you cluster it?
------ +
------ + ------- + ------- + ------- + -------
Lecture 15, Date: 03/02/2009, Monday
Topic:
Clustering
1) Clustering case study
2) SOM clustering
3) Hierarchical clustering
Slides: DM8
1) Hierarchical clustering, http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/hierarchical.html,
http://www.resample.com/xlminer/help/HClst/HClst_intro.htm
2) Use the clustering worksheet (http://zlin.ba.ttu.edu/6347/Clustering.xls ) to explore different outcomes of clustering. Modify the coordinates of the instances to obtain different datasets. Check the outcomes. Selectively record the k-mean clustering iterations for 2 different sets of instances including the illustrative charts.
------ +
------ + ------- + ------- + ------- + -------
Lecture 16, Date: 03/04/2009,
Wednesday
Topic: Association Analysis
1) Introduction to associate analysis
2) Sequential pattern analysis
3) Exercise 2
1) ADMT Ch 8.1
2) SBP chapter 11
3) Data Mining Using SAS Enterprise
Miner: A Case Study Approach (DMCS), Chapter 5
------ +
------ + ------- + ------- + ------- + -------
Lecture 17, Date: 03/09/2009, Monday
Topic:
Association Analysis
1) Quiz 4 (clustering)
2) Evaluation of association patterns
3) Dissociation analysis
4) Link analysis
1) ADMT Ch 8.2
2) SBP chapter 11
Homework assignment 5
(Due on March 23, Monday):
1) Redo the associate analysis example
ASSOCS (The dataset is in SAS EM library SAMPSIO. All the datasets used in this
course notes are in SAMPSIO. Some may have different names. For example, DMWEB
in the book becomes WEBPATH in SAS EM library SAMPSIO.).
2) Do link analysis following DMCS with
dataset Webpath in SAMPSIO.
3) Report the outcomes
------ +
------ + ------- + ------- + ------- + -------
Lecture 18, Date: 03/11/2009,
Wednesday
Topic: Association Analysis
1) Other types of associate analysis
2) Itemset generation - Apriori
principle
3) Association rule discovery and
generation
1) ADMT Ch 8.1
2) SBP chapter 11
3) Data Mining Using SAS Enterprise
Miner: A Case Study Approach (DMCS), Chapter 6
------ +
------ + ------- + ------- + ------- + -------
Spring break! J
------ +
------ + ------- + ------- + ------- + -------
Lecture 19, Date: 03/23/2009, Monday
Topic:
Review of Association Analysis
1) Review
2) Exercise 3
------ +
------ + ------- + ------- + ------- + -------
Lecture 20, Date: 03/25/2009,
Wednesday
Topic:
Text Mining – Preliminary (I)
1)
Introduction
to text mining
2) Processing textual data
Slides: TM-1
Dataset:
SVDTUTOR
1) Text Mining Using SAS Software
(TMUS), Chapter 1
2) CCWEB_TKIT Section 2.1, 2.2
3) RG Ch11 (pp342-343)
Online
materials for Singular Value Decomposition (SVD):
1) Basics of Matrix: http://www.xycoon.com/matrix_algebra.htm
2) http://mathworld.wolfram.com/SingularValueDecomposition.html
3) http://www.uwlax.edu/faculty/will/svd/
4) http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm
5) http://kwon3d.com/theory/jkinem/svd.html
------ +
------ + ------- + ------- + ------- + -------
Lecture 21, Date: 03/30/2009, Monday
Topic: Text Mining – Preliminary (II)
1)
Quiz
5 (Association analysis)
2)
Transformations
3)
An
illustrative example
Dataset: ABSTRACT
1)
Getting
Started with SAS Text Miner (GSTM) Ch1-4 (View PDF (947KB), pp1-30)
2) Text Mining Using SAS Software
(TMUS), Chapter 1
------ +
------ + ------- + ------- + ------- + -------
Lecture 22, Date: 04/01/2009,
Wednesday
Topic: Exploratory Analysis of Documents
1) Simple statistical analysis
2) Text data exploration
Online
reading materials of Hidden Markov Model (HMM):
1) http://jedlik.phy.bme.hu/~gerjanos/HMM/node2.html
2) http://www.csse.monash.edu.au/~lloyd/tildeMML/Structured/HMM.html
3) http://www.autonlab.org/tutorials/hmm14.pdf
4) Getting Started with SAS Text Miner
(GSTM) Ch5-7 (View PDF (947KB))
Homework assignment 6
(due on April 14, 2008):
1) Read the Amazon Book example in TMUS
Section 2.3 and replicate it.
2) Replicate the steps of insurance
claim example in TMUS 3.3 (No need to report all outcomes except for (1) the
model diagram, (2) the response chart from Assessment node). Why the outcomes
from SVD based regression is better than the cluster ID based regression?
3) Explain the configuration of SVD and
roll-up term. What are the differences between them?
------ +
------ + ------- + ------- + ------- + -------
Lecture 23, Date: 04/06/2009, Monday
Topic:
In-class Exercise 4
1) Text data exploration
2) Exercise 4 (Text Mining)
Slides: TM-2
------ +
------ + ------- + ------- + ------- + -------
Lecture 24, Date: 04/08/2009,
Wednesday
Topic:
Text Mining for Predictive Modeling
Case: Insurance subrogation
Dataset:
INSSUBRO
Slides: TM-3
------ +
------ + ------- + ------- + ------- + -------
Date: 04/13/2009, Monday
No class
------ +
------ + ------- + ------- + ------- + -------
Lecture 25, Date: 04/15/2009,
Wednesday
Topic:
Conducting a Text Mining Project
1) Quiz 6
2) Text mining data preparation
1) Text Mining Using SAS Software
(TMUS), Chapter 3
2) Reference papers (in the
subdirectory \SUGI05 and \SUGI06 of the shared diskspace)
Homework assignment 7
(optional, due on April 27, 2008):
P5-75, CCWEB_KIT.pdf, Exercises, Problem 1-2.
------ +
------ + ------- + ------- + ------- + -------
Lecture 26, Date: 04/20/2009, Monday
Topic: Introduction to Web Mining
Lecture
outline:
1) Quiz 6 review
2) Introduction to Web Mining
3) FSLINKS dataset analysis
4) Web access analysis
1. Link analysis
2. Association analysis
Datasets:
FSLINKS, RLINKS
Slides:
WebMining1
------ +
------ + ------- + ------- + ------- + -------
Lecture 27, Date: 04/22/2009,
Wednesday
Topic: Online targeted advertising
Lecture
outline:
1) Propensity-to-buy
2) Exercise 5 (web mining)
Datasets:
PROPBUY, BANNER, BANNERAD
Slides:
WebMining2
------ +
------ + ------- + ------- + ------- + -------
Lecture 28, Date: 04/27/2009, Monday
Topic: Online recommender systems
1) Introduction to online recommender
systems
Slides:
WebMining3
------ +
------ + ------- + ------- + ------- + -------
The following are to be updated.
The lecture notes for the spring 2008 class
------ +
------ + ------- + ------- + ------- + -------
Lecture 26, Date: 04/15/2008,
Wednesday
Topic: Web-based Recommender Systems
Lecture
outline:
1) Exercise 5 (Web mining)
2) Principle of online recommendation
systems
Slides: WebMining3
Datasets:
MOVIEBUY
------ +
------ + ------- + ------- + ------- + -------
Lecture 27, Date: 04/20/2008, Monday
Topic:
Web-based Recommender Systems
Slides: WebMining3
------ +
------ + ------- + ------- + ------- + -------
Lecture 28, Date: 04/22/2008,
Wednesday
Topic: Bayes Theorem and Data Mining
Slides: Bayes Theorem
------ +
------ + ------- + ------- + ------- + -------
Lecture 29, Date: 04/27/2008, Monday
Topic: Review
2) Exercise 6 (Web mining)
3) Quiz 7 (Web mining)
4) Review
------ + ------ + ------- + ------- + ------- + -------