ISQS 6347 Data & Text Mining Lecture Outlines
(Last update: April 27, 2006)
(Total 29 lectures during January 12
to April 27, 2006)
Lecture 1, Date: 01/12/06
Topic:
Introduction to Data Mining
Handout:
PowerPoint slides
------ +
------ + ------- + ------- + ------- + -------
Lecture 2, Date: 01/17/06
Topic: Data
mining fundamentals (Concepts, case, data for data mining)
Handout:
PowerPoint slides
Dataset:
Credit card promotion
Review
questions:
1) RG pp30-31, question 1, 2
2) Four different types of attributes
of data, their properties
3) Different types of records and data
4) Data quality: what it is, how to
guarantee
5) Data transformation
6) Data preprocessing
------ +
------ + ------- + ------- + ------- + -------
Lecture 3, Date: 01/19/06
Topic: Data
mining fundamentals (Self-taught)
Handout:
use the previous
Review
questions:
1) Euclidean distance
2) Minkowski distance
3) Cosine similarity
4) Data mining strategies
Homework
assignment 1:
1) RG p30, question 5
2) RG p63, Computational Question 1
3) Find two data mining project topics
after checking a few cases on the web (such as www.kdnuggets.com).
You may likely choose one of them for your midterm data mining project. Write a
short paragraph to describe the goal and contents of each topic.
------ + ------
+ ------- + ------- + ------- + -------
Lecture 4, Date: 01/24/06
Topic:
Basic data mining techniques & tools
Handout:
PowerPoint slides
Dataset for
exercise (from RG): http://zlin.ba.ttu.edu/6347/Demons/CreditProm-tree.xls
SAS code
for converting Excel files to SAS datasets: http://zlin.ba.ttu.edu/6347/Demons/import_excel.sas
Lecture
outline:
1) Review
2) Data mining strategies
3) Data mining model performance evaluation
4) Introduction to SAS
Review
readings and questions:
1) RG pp62-63, question 1-4
2) Getting Started with SAS EM 4.3,
Chapter 1 & 2
------ +
------ + ------- + ------- + ------- + -------
Lecture 5, Date: 01/26/06
Topic: SAS
Handout:
PowerPoint slides
Review
readings and questions:
1) RG pp62-63, question 5-9
2) Getting Started with SAS EM 4.3, Chapter
3-6
3) How to conduct a decision tree
induction?
4) What are GINI, Entropy and
misclassification indices?
Homework
assignment 2: RG p64, Computational Question 2, 3
------ +
------ + ------- + ------- + ------- + -------
Lecture 6, Date: 01/31/06
Topic:
Decision Tree
Lecture
outline:
1) Review
2) Gain Ratio
Review
readings and questions:
1) Search website to find any
information about outfitting
2) RG p102, Review question 1.
------ + ------
+ ------- + ------- + ------- + -------
Lecture 7, Date: 02/02/06
Topic: SAS
Lecture
outline:
1) Quiz question review
2) Determine when to stop splitting a
tree
3) Practical issues in classification:
Underfitting and overfitting
4) Lab exercise: Task 1-6, 10 in
“Getting Started …” (View PDF (2.24MB))
Homework assignment
3 (due next Thursday):
1) Review task 1-14 in “Getting Started
…”, focusing on the decision tree node. Try different splitting criteria and
compare the outcomes. Report whatever problems you encountered to be discussed
in the class (No need to submit)
2) Open the result of the tree node.
Save some of the node information to an Excel sheet (http://zlin.ba.ttu.edu/6347/Tree_gain.xls)
and calculate:
a. Gini values of the top three layers
(feel free to do the fourth layer)
b. Entropy values of the top three
layers (feel free to do the fourth layer)
c. Gain ratio of layer 2 and 3 (feel
free to do the fourth layer)
Submit the results by email.
------ +
------ + ------- + ------- + ------- + -------
Lecture 8, Date: 02/07/06
Topic: SAS
Enterprise Miner 4.3 (Basic skills)
Lecture
outline:
Example flow diagram task 7-14:
1) Basic use of different model nodes:
Decision Tree, Regression, and Neural Network
2) Variable transformation
3) Assessment
4) Scoring
Review
questions:
1) What are the roles of validation
dataset and test dataset
2) What is overfitting in the decision
tree approach? How to prevent overfitting?
3) How to decide to stop splitting a
tree?
4) How to evaluate the performance of a
model?
5) What is Prior Probability? What is
its relationship with the sample probability? How to define it in SAS EM?
6) What is stratification? Why do we
need it? How can you set stratification parameters in SAS EM?
------ +
------ + ------- + ------- + ------- + -------
Lecture 9, Date: 02/09/06
Topic: SAS
Enterprise Miner 4.3 (Decision tree & regression)
Lecture
outline:
1) Alternatives based on the example
flow diagram task 1-16 (make use of the code for Task 15: http://zlin.ba.ttu.edu/6347/score_card.sas)
a. Understanding data mining outcomes (http://zlin.ba.ttu.edu/6347/Lift.xls)
b. Changing parameters
c. Comparison between different
outcomes
2) An home equity data mining example
(regression)
a. How to handle missing values
b. Understanding data replacement
c. Fitting and comparing candidate
models
Homework
assignment 4:
1) RG pp102-103, Computational
Questions: 1, 4, 5
2) Exploring the home equity data
mining example (no need to submit)
------ +
------ + ------- + ------- + ------- + -------
Lecture 10, Date: 02/14/06
Topic: SAS
Enterprise Miner 4.3 (Variable selection)
Lecture
outline:
1) Variable Selection node & Score
node
2) An home equity data mining example
(continued)
a. Using Score code
b. Using Variable Selection node
Review
questions:
N/A
------ +
------ + ------- + ------- + ------- + -------
Lecture 11, Date: 02/16/06
Topic:
Clustering
Reading: RG
Chapter 3 (pp84-100) & 10 (pp308-317), TSK Chapter 8
Lecture
outline: