ISQS 6347 Data & Text Mining Lecture Outlines
(Total 29 lectures, January 9 -
April 29, 2006)
Textbooks
SPB: Data Mining for Business Intelligence Galit Shmueli, Nitin R. Patel, Peter
C. Bruce,
RG:
Data Mining – A Tutorial Based Primer,
Richard Roiger, Michael Geatz
Lecture 1, Date: 01/09/2008, Wednesday
Topic:
Introduction to Data Mining
Handout:
PowerPoint slides 1
------ + ------
+ ------- + ------- + ------- + -------
Lecture 2, Date: 01/14/2008, Monday
Topic: Data
mining fundamentals
Handout:
PowerPoint slides 2
1) Introduction to SAS
2) Data for data mining
3) Case study
Dataset:
Credit card promotion
Terminology:
predictor, observation, confidence, dependent variable, estimation, response,
score, supervised learning, unsupervised learning
Review
questions:
1) SPB p31 Problem 2.1 2.2, or RG
pp30-31, question 1, 2
2) Four different types of attributes
of data, their properties
3) Different types of records and data
4) Data quality: what it is, how to
guarantee
------ +
------ + ------- + ------- + ------- + -------
Lecture 3, Date: 01/16/2008,
Wednesday
Topic: Data
mining fundamentals
1) Data preprocessing
2) How to evaluate the performance of a
classification rule
3) Data mining strategies
4) An illustrative classification case
Handout: slides 3
Homework 1 (due
01/23/2008):
1) Develop a decision tree using the
credit card promotion data in the slide. You need to choose one of variables as
the target. In addition, conceive a confusion matrix and indicate lift,
coverage rate and accuracy rate.
2) A dataset has 1000 records and 50
variables with 5% of value missing, spread randomly throughout the records and
variables. An analyst decides to remove records that have missing values. About
how many records would you expect be removed?
3) Consider the following three-class
confusion matrix. The matrix shows the classification results of a supervised
model that uses previous voting records to determine the political party
affiliation (Republican, Democrat, or Independent) of members of the United
States Senate.
|
|
Rep |
Dem |
|
|
Rep |
42 |
2 |
1 |
|
Dem |
5 |
40 |
3 |
|
|
0 |
3 |
4 |
------ +
------ + ------- + ------- + ------- + -------
Lecture 4, Date: 01/23/2008,
Wednesday
Topic:
Classification modeling (I)
1) Review of data mining tasks
2) Decision tree modeling
3) Determining the best split – Using
the measure of GINI, Entropy, or misclassification error
Handout: slides 4
Review
readings and questions:
1) SPB p51, question 3.1 and 3.2. (or
RG pp62, question 1-4)
2) SPB p74, question 4.1, 4.2
3) Getting Started with SAS EM 4.3,
Chapter 1 – 2
4) SAS data exploration PROCs:
CONTENTS, FREQ
------ +
------ + ------- + ------- + ------- + -------
Lecture 5, Date: 01/28/2008, Monday
Topic:
Classification modeling (II)
1) Quiz 1
2) Determining when to stop splitting
3) How to use SAS Enterprise Miner
Review
readings and questions:
1) Getting Started with SAS EM 4.3,
Chapter 3-7
2) SPB chapter 7, RG Chapter 3
3) What is overfitting in the decision
tree approach? How to prevent overfitting?
4) How to decide to stop splitting a
tree?
5) How to evaluate the performance of a
model?
------ +
------ + ------- + ------- + ------- + -------
Lecture 6, Date: 01/30/2008,
Wednesday
Topic:
Illustrative case study
1) Decision tree modeling with SAS
Enterprise Miner
2) Model evaluation
Review
readings and questions:
1) Why do we need to split a dataset
into training, validation and test datasets? What are the different purposes of
using validation and test datasets?
2) What is Prior Probability? What is its
relationship with the sample probability? How to define it in SAS EM?
3) What is stratification? Why do we
need it? How can you set stratification parameters in SAS EM?
Homework 2 (due a week
later):
1) Read Section 4.1 of “Effective Web
Mining” (document name: CCWEB_TKIT.pdf, Page 4-1 to 4-34). Use dataset DMAIL
(in the shared space under /Data directory) to complete the tasks described in
the section. Then add in a Tree node and an Assessment node to the diagram to compare the performance of two classification
models. The deliverables include (1) the model diagram, (2) one of the
Assessment diagram, and (3) the performance table in the results of the
Assessment node. Add in a short explanation to the results.
2) Use the results of the tree node in
the above exercise. Input some of the node information to an Excel sheet (http://zlin.ba.ttu.edu/6347/Tree_gain.xls)
and calculate:
a. Gini values of the first two layers
(may not have the third layer)
b. Entropy values of the layers
c. Gain ratio of layer 2
3) Focuses: (1) Refining the model to
get meaningful outcomes, (2) Learn how to explain the results
4) Email the Word and Excel files to Zhangxi.lin@hotmail.com
------ + ------
+ ------- + ------- + ------- + -------
Lecture 7, Date: 02/04/2008, Monday
Topic:
Classification modeling (
1) Exercise 1
2) Classification model scoring
3) Missing value replacement
Review
readings and questions:
1) Applying Data Mining Techniques Using
Enterprise Miner (
2) Getting Started with SAS EM 4.3,
Chapter 3-7
3)
SPB chapter 7 & 8, RG chapter 10
(pp 291-302)
------ +
------ + ------- + ------- + ------- + -------
Lecture 8, Date: 02/06/2008,
Wednesday
Topic:
Logistic Regression with SAS EM
1) Logistic regression modeling for classification
Handout: Slides 5
Review
readings and questions:
1) Data Mining Using SAS Enterprise
Miner: A Case Study Approach (DMCS), Chapter 1-2
2)
SPB chapter 8
3) Decision tree vs. logistic regression
– which one is better in construction of classification models?
4) What are the functions of node
Distribution Explorer, Multiplot, and Insight?
Homework assignment 3 (due a
week later):
Go through Chapter 5 of
- A lift chart
- A rotating plot as indicated in page 5-19 to 5-20
- The information from the results of Assessment node that
can show the performance of three different classification methods
- Did you find anything else (up to two pieces) that you believe significant, for example the unexpected outcome of the decision tree. Then try to explain the situation with a couple sentences.
------ +
------ + ------- + ------- + ------- + -------
Lecture 9, Date:
Topic:
Variable selection
1) Quiz 2 (classification)
2) Loan application data mining
3) Classification model deployment
Review
readings and questions:
1) Applying Data Mining Techniques
Using Enterprise Miner (
2) Data Mining Using SAS Enterprise
Miner: A Case Study Approach (DMCS), Chapter 3
3) How
to use two approaches for classification scoring?
------ +
------ + ------- + ------- + ------- + -------
Lecture 10, Date:
Topic:
Neural Network Classification
1) Interactive Grouping node
2) Principle of neural network for
data mining
3) Loan application data mining
– Neural network
Handout: Slides 6
Review
readings and questions:
1) Applying Data Mining Techniques
Using Enterprise Miner (
2) SPB chapter 9, or RG Chapter 8
3) Go through Chapter 5 of
- A lift chart
- A rotating plot as indicated in page 5-19 to 5-20
- The information from the results of Assessment node that
can show the performance of three different classification methods
- Did you find anything else (up to two pieces) that you believe significant, for example the unexpected outcome of the decision tree. Then try to explain the situation with a couple sentences.
------ +
------ + ------- + ------- + ------- + -------
Lecture 11, Date: 02/18/2008, Monday
Topic:
Neural Network Classification
1) Ensemble models
2) Predictive modeling with dataset
MYRAW
3) Classification modeling contest
Handout: Slides 7
Review
readings and questions:
1) Applying Data Mining Techniques
Using
2) Complete scoring and SAS coding
following the instructions in Chapter 6 of
------ +
------ + ------- + ------- + ------- + -------
Lecture 12, Date: 02/20/2008,
Wednesday
Topic:
Introduction to Clustering
1) Why clustering
2) Principles of clustering
1)
Applying Data Mining Techniques Using
2)
SPB ch12, or
RG Chapter 3 (Section
3.3)
3)
Clustering: An Introduction, http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/index.html
4)
K-mean clustering tutorial, http://people.revoledu.com/kardi/tutorial/kMean/index.html
Slides: DM8
Other: Clustering demon
Review
questions:
1)
What are main difference between clustering and classification data mining?
2)
Check datasets HMEQ. What would be outcomes if you cluster it?
Homework assignment 4 (due a
week later):
1) SPB p237-239, Problem 12.1; or RG
p103, Computational Questions: 10 (feel free to use the clustering worksheet http://zlin.ba.ttu.edu/6347/Clustering.xls
)
2) Use clustering approach to analyze
dataset S3358 (ISQS 3358 student survey data) in shared directory under
\Other_Data subdirectory. Report a few findings with selected screenshots of
the representative results. The following are a few questions that could
interest the instructor: