ISQS 6347 Data & Text Mining Lecture Outlines

 

(Total 30 lectures, January 23 - May 7, 2012)

 

SAS Course Notes

AAEM: Applied Analytics Using SAS® Enterprise Miner™

EM_GS: Getting Start with SAS® Enterprise Miner

EM_TMGS: Getting Start with SAS® 9.1 Text Miner

CSA: Data Mining - A Case Study Approach

DMTM: Text Mining Using SAS® Software

ADMT: Applying Data Mining Techniques Using Enterprise Miner

CCWEB: Effective Web Mining: Attracting and Keeping Valued Cyber Consumers

 

Optional Textbooks

 

SPB: Data Mining for Business Intelligence Galit Shmueli, Nitin R. Patel, Peter C. Bruce,

TSK: Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar

RG: Data Mining – A Tutorial Based Primer, Richard Roiger, Michael Geatz

 

------ + ------ + ------- + ------- + ------- + -------

 

Homework assignments: http://zlin.ba.ttu.edu/6347/HWISQS6347-12.htm

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 1, Date: 01/23/2012, Monday

 

Topic: Introduction to Data Mining

 

Reading: SPB/RG/TSK Chapter 1

 

References:

1.     SAS resources for instructors and students

  1. SAS courses

Homework 1 (due 02/06/2012):

 

1)     Develop a decision tree manually using the credit card promotion data in the slide (the one with 15 observations). You need to choose one of variables as the target. Once the decision tree is done, pick up one rule that is explanatory enough to conceive a confusion matrix and indicate lift, coverage rate and accuracy rate.

2)     A dataset has 1000 records and 50 variables with 5% of value missing, spread randomly throughout the records and variables. An analyst decides to remove records that have missing values. About how many records would you expect be removed?

3)     Consider the following three-class confusion matrix. The matrix shows the classification results of a supervised model that uses previous voting records to determine the political party affiliation (Republican, Democrat, or Independent) of members of the United States Senate.

 

 

Rep

Dem

Ind

Rep

42

2

1

Dem

5

40

3

Ind

0

3

4

  1. What percent of the instances were correctly classified?
  2. According to the confusion matrix, how many Democrats are in the Senate? How many republicans? How many Independents?
  3. How many Republicans were classified as belonging to the Democratic Party?
  4. How many Independents were classified as Republicans?
  5. What are the accuracy rates of the classification for each column?
  6. What are the coverage rates of the classification?
  7. What are values of FPs and FN? (Hints: split the matrix into three 2x2 matrices for Rep, Dem, and Ind respectively)

 

4)     Go through AAEM Chapter 2. Use SAS Enterprise Miner 6.1 to complete the exercise on p.2-62. Screenshot the results – a few that can explain your work is enough. You need to define a new library “AAEM61” using the dataset of aaem61, which has become available in the share directory.

 

Submission format: Hardcopy. Please submit the homework to TA.

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 2, Date: 01/25/2012, Wednesday

 

Topic: Data mining fundamentals

 

Reading: SPB/RG/TSK Chapter 2

 

1)     Illustrative cases

2)     Confusion matrix

3)     Introduction to SAS Enterprise Miner

 

Terminology: predictor, observation, confidence, dependent variable, estimation, response, score, supervised learning, unsupervised learning

 

Readings and review questions:

1)     What is data mining?

2)     What is decision tree?

3)     What is confusion matrix?

4)     Getting Started with SAS EM 5.3/6.1

5)     AAEM Chapter 1

6)     Data mining case:

·         ROI CASE STUDY - SAS BUSINESS INTELLIGENCE IBM

·         Scandinavian Airlines Modernize Business Intelligence Capabilities 

 

SAS EM Exercises:

Explore the feature of Citrix SAS Enterprise Miner 6.1

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 3, Date: 01/30/2012, Monday

 

Topic: Data mining fundamentals

 

1)     Data for data mining

2)     Data exploration

3)     Data preprocessing

 

Readings:

1)     Getting Started with SAS EM 4.3/5.3/6.1

2)     AAEM Chapter 2

 

Review questions

·         Four different types of attributes of data, their properties

·         Different types of records and data

·         Data quality: what it is, how to guarantee

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 4, Date: 02/01/2012, Wednesday

 

Topic: Classification modeling (I)

 

1)     Decision tree modeling

2)     Determining the best split – Using the measure of GINI, Entropy, or misclassification error

 

Readings:

1)     TSK chapter 4

2)     Getting Started with SAS EM, Chapter 7

3)     AAEM Chapter 2

 

Exercises:

Exercises of AAEM: p2-14, p2-34, p2-62

 

Demo: Define a project, data exploration

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 5, Date: 02/06/2012, Monday

 

Topic: Classification modeling (II)

 

1)     Quiz 1

2)     Determining when to stop splitting

3)     Other issues in decision tree modeling

 

Review readings and questions:

1)     TSK chapter 4 (It provides extensive information about decision tree modeling.)

2)     AAEM Chapter 3 (self-study 3.4 and 3.5)

3)     What is overfitting in the decision tree approach? How to prevent overfitting?

4)     How to decide to stop splitting a tree?

5)     How to evaluate the performance of a model?

 

Homework 2 (due 02/20/2012 Monday):

1)     Check Section 4.1 of “Effective Web Mining” (document name: CCWEB_TKIT.pdf, Page 4-1 to 4-34). Use dataset DMAIL (in the shared space under \Datasets\DATA_WM directory) to develop two decision tree models. One is basic without any parameter change, and another uses Gini splitting criterion. Then add an Assessment node to the diagram to compare the performance of two classification models. You don’t need to read the section in details since it is based on older version of SAS EM, but focus on: (1) the explanations of the variables, (2) which variable is the target, (3) which variables are configured (see p.4-12). You can also explore the dataset to understand its quality and variable distributions. You feel free to try different splitting criteria: Chi-Square, GINI, and Entropy, and different other parameters. If you more information about how to use SAS EM 5.3 to solve the problem, you can check Chapter 3 in AAEM61.

2)     Construct a logistic regression model for the same dataset. Compare the results with that from the decision tree model

3)     AAEM61 p.3-111-112, Exercises for Chapter 3.

 

The deliverables include

a.     the model diagram,

b.     one of the Assessment charts,

c.     the performance table in the results of the Assessment node, and

d.     short explanations to each of the results.

 

Submission: Hardcopy.

 

Online references:

1)     Statistical hypothesis testing: http://en.wikipedia.org/wiki/Statistical_hypothesis_testing

2)     Null hypothesis: http://en.wikipedia.org/wiki/Null_hypothesis

3)     Alternative hypothesis: http://en.wikipedia.org/wiki/Alternative_hypothesis

4)     p-value (wiki): http://en.wikipedia.org/wiki/P-value

5)     What is p-value: http://www.childrens-mercy.org/stats/definitions/pvalue.htm

6)     Likelihood ratios in diagnostic testing: http://en.wikipedia.org/wiki/Likelihood_ratios_in_diagnostic_testing

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 6, Date: 02/08/2012, Wednesday

 

Topic: Decision tree

 

1)     Decision tree modeling with SAS Enterprise Miner 6.1

2)     Classification modeling performance evaluation


Review readings:

1)     TSK chapter 4 (It provides extensive information about decision tree modeling.)

2)     AAEM Chapter 3 (self-study 3.4 and 3.5)

 

Review questions:

1)     Three prediction types in decision tree: decisions, rankings, and estimates (See p.3-70 of AAEM61)

2)     Why do we need to split a dataset into training, validation and test datasets? What are the different purposes of using validation and test datasets?

3)     What is Prior Probability? What is its relationship with the sample probability? How to define it in SAS EM?

4)     What is stratification? Why do we need it? How can you set stratification parameters in SAS EM?

 

Demo: Decision tree

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 7, Date: 02/13/2012, Monday

 

Topic: Constructing Decision Trees with SAS EM 6.1

 

1)     Prior probability

2)     Weighted decisions

3)     Exercise 1 (decision tree)

 

Review readings and questions:

1)     AAEM61 Chapter 3

2)     What are oversampling and undersampling respectively?

 

Online references:

1)     Logistic regression (Wikipedia.prg): http://en.wikipedia.org/wiki/Logistic_regression

2)     An introduction to logistic regression: http://luna.cas.usf.edu/~mbrannic/files/regression/Logistic.html

3)     Logistic regression vs. OLS regression: http://www.upa.pdx.edu/IOA/newsom/da2/ho_logistic.pdf

4)     Binary vs. multinominal logistic regression: http://faculty.chass.ncsu.edu/garson/PA765/logistic.htm

5)     Odds in logistic regressions: http://www.jerrydallal.com/LHSP/logistic.htm

6)     SAS tutorial: http://support.sas.com/documentation/cdl/en/anlystug/58352/HTML/default/chap11_sect4.htm

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 8, Date: 02/15/2012, Wednesday

 

Topic:  Classification Model Assessment

 

1)     Quiz 2

2)     Logistic regression

3)     Model fit statistics

4)     Statistical graphics – ROC Chart and Response Chart

 

Reading:

1)     AAEM61 Chapter 3, p.3-69 to 3-76

2)     AAEM61 Chapter 6

 

Review questions:

 

1)     Concordance vs. discordance (See p.3-73 of AAEM61)

2)     Complexity optimization

a.     Decisions: accuracy/misclassification (not weighted), profit/loss (weighted)

b.    Rankings: concordance/discordance

c.     Estimates: squared errors

 

Online references:

1)     Cumulative Lift: http://www.information-management.com/news/5329-1.html

2)     Schwarz Bayesian Criterion (SBC): http://en.wikipedia.org/wiki/Bayesian_information_criterion

3)     Klomogorov-Smirnov test (K-S test): http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

4)     Schwarz-Bayesian Criterion (SBC): http://www.associatedcontent.com/article/2175099/the_schwarz_bayesian_criterion_and.html

5)     Maximum likelihood: http://en.wikipedia.org/wiki/Maximum_likelihood

6)     ROC Chart: http://mrvar.fdv.uni-lj.si/pub/mz/mz3.1/vuk.pdf

7)     ROC Chart video: part 1 (9’41”), part 2 (3’28”).

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 9, Date: 02/20/2012, Monday

 

Topic:  Classification Model Assessment

 

1)     Adjusting for separate sampling

2)     Profit matrices

3)     Model evaluations

 

SAS EM Lab: Model assessment

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 10, Date: 02/22/2012, Wednesday

 

Topic:  Principles of Pattern Discovery

 

1)     Concepts of pattern discovery

2)     Distance measurements

3)     Term project orientation

4)     Basic concepts

5)     K-means methods

 

Review readings and questions:

1)      AAEM61 Chapter 8

2)      TSK chapter 8

 

Homework 3 (due 03/19/2012, Wednesday):

1) AAEM61 p.6-48, Exercises for Chapter 6.

2) AAEM61 p.8-58 to 8-59, Exercises for Chapter 8 (clustering).

3)  AAEM61 p.8-78 to 8-79, Exercises for Chapter 8 (Association analysis).

It is good that you have the solutions right after each exercise. Then you can compare your results with the answer keys.

Deliverables:

1)     The screenshots of the final results

2)     The screenshots demonstrating your specific work

3)     Your answers to the questions with blanks in the exercises

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 11, Date: 02/27/2012, Monday

 

Topic: Clustering (1)

 

1)     Exercise 2 review

2)     Introduction to clustering

 

e-learning – Clustering 1

 

Review reading:

3)     TSK chapter 8

4)     AAEM61 chapter 8

5)     Clustering: An Introduction, http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/index.html

6)     K-mean clustering tutorial, http://people.revoledu.com/kardi/tutorial/kMean/index.html

 

Review questions:

1)  What are main difference between clustering and classification data mining?

2)  Check datasets HMEQ. What would be outcomes if you cluster it?

3) Use the clustering worksheet (http://zlin.ba.ttu.edu/6347/Clustering.xls ) to explore different outcomes of clustering. Modify the coordinates of the instances to obtain different datasets. Check the outcomes. Selectively record the k-mean clustering iterations for 2 different sets of instances including the illustrative charts.

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 12, Date: 02/29/2012, Wednesday

 

Topic: Clustering (2)

 

1)     Hierarchical clustering

2)     Self-organizing map (SOM) method

3)     Applications

 

e-learning - Clustering 2

 

SAS EM Lab: Clustering

 

------ + ------ + ------- + ------- + ------- + -------

Lecture 13, Date: 03/05/2012, Monday

 

Topic: Association Analysis (1)

 

1)     Exercise 3 Part I

2)     Associate analysis

 

Exercise 3 Part I

 

e-learning – Association Analysis

 

References:

1) Hierarchical clustering

1.     http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/hierarchical.html

2.     http://www.resample.com/xlminer/help/HClst/HClst_intro.htm

2)     Link analysis

3)     Web link analysis: http://nlp.stanford.edu/IR-book/html/htmledition/link-analysis-1.html

4)     Contingency table: http://en.wikipedia.org/wiki/Contingency_table

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 14, Date: 03/07/2012, Wednesday

 

Topic: Association Analysis (2)

1)     Quiz 3

2)     Association analysis

 

e-learning – Pattern discovery Review & Exercise 3 (Part II)

 

Note: Exercise 3 is due on Monday, 3/26/2012.

 

SAS EM Lab: Exercise 3

 

 

------ + ------ + ------- + ------- + ------- + -------

 

Spring Break

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 15, Date: 03/19/2012, Monday

 

Topic: Review

 

1)     Review

1.     Exercise 1

2.     Exercise 2

3.     Exercise 3 part I

4.     Quiz 3

2)     Exercise 3 Part II

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 16, Date: 03/21/2012, Wednesday

 

Topic: Association Analysis

 

e-learning – Association Analysis

 

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 17, Date: 03/26/2012, Monday

 

Topic:  An Introduction to Text Mining

 

1)       What is text mining

2)       Case study - SVDTUTOR

 

Slides: TM-1

Dataset: SVDTUTOR, Fedpapers

 

Readings: DMTM 1.1

 

Online references:

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 18, Date: 03/28/2012, Wednesday

 

Topic: Principles of Text Mining (I)

 

1)     Quiz 4 (Association analysis)

2)     Data preparation

3)     Data conversion

 

Slides: TM-1

Readings: DMTM 1.2-1.4

 

Online materials for Singular Value Decomposition (SVD):

1)     Basics of Matrix: http://www.xycoon.com/matrix_algebra.htm

2)     http://mathworld.wolfram.com/SingularValueDecomposition.html

3)     http://www.uwlax.edu/faculty/will/svd/

4)     http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm

5)     http://kwon3d.com/theory/jkinem/svd.html

 

 

SAS EM Lab: Text mining

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 19, Date: 04/02/2012, Monday

 

Topic:  Principles of Text Mining (II)

 

1)     Term-document matrix

2)     Singular value decomposition (SVD)

 

Slides: TM-1&2

Readings: DMTM 1.2-1.4

 

Homework assignment 5 (optional, due 04/23/2012): (check here)

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 20, Date: 04/04/2012, Wednesday

 

Topic: Exploratory analysis of documents (I)

 

1)     Case study – SAS courses

2)     Exercise 4

 

Slides: TM-2

Readings: DMTM Chapter 2

 

SAS EM Lab: Exercise 4

 

------ + ------ + ------- + ------- + ------- + -------

 

Date: 04/09/2012, Monday

 

No class

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 21, Date: 04/011/2012, Wednesday

 

Topic:  Exploratory analysis of documents (II)

 

1)     Quiz 5 (Text mining)

2)     Issues in predictive text mining

3)     Case study – Recovering potentials in worker’s compensation insurance claims

 

Slides: TM-2

Readings: DMTM Chapter 2, TMGS

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 22, Date: 04/016/2012, Monday

 

Topic: Logistic Regression (I)

 

1)     Selecting Regression Inputs

2)     Optimizing Regression Complexity

3)     Transforming Inputs

 

Review readings and questions:

1)     AAEM61 Chapter 4

2)     SPB chapter 7 & 8, RG chapter 10 (pp 291-302)

3)     Questions

a.     Decision tree vs. logistic regression – which one is better in construction of classification models?

b.    What is odds ratio? Could you interpret the significance of an input variable in a logistic regression?

c.     Itemize the main focuses in reviewing logistic regression results.

 

Online references:

1)     Chi-square significance test : http://faculty.chass.ncsu.edu/garson/PA765/chisq.htm

2)     Odds = p / (1 – p), and odds against = 1 / odds: http://en.wikipedia.org/wiki/Odds

3)     Odds ratio : http://en.wikipedia.org/wiki/Odds_ratio

4)     Type 3 analysis of effect (testing the significance after adding a new input variable to the model which already had other inputs.): http://www.technion.ac.il/docs/sas/stat/chap29/sect31.htm

5)     SAS data analysis (addressing Type 3 analysis): http://www.ats.ucla.edu/stat/sas/dae/intreg.htm

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 23, Date: 04/018/2012, Wednesday

 

Topic: Logistic Regression (II)

 

1)     Categorical Inputs

2)     Classification model scoring

 

Reading: AAEM61 Chapter 4

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 24, Date: 04/23/2012, Monday

 

Topic:  Neural networks (I)

 

1)     Quiz 6 (Logistic regression)

2)     Principles of neural network

3)     Applying neural network for classification

 

Reading:  AAEM61 chapter 5

 

Online references:

1) Hyperbolic tangent:

a.     http://math.jccc.net:8180/webMathematica/MSP/mmartin/tanh.msp

b.    http://www.2dcurves.com/exponential/exponentialht.html

c.     http://en.wiktionary.org/wiki/hyperbolic_tangent

2)       Hyperbolic sine: http://en.wiktionary.org/wiki/hyperbolic_sine

3)       Hyperbolic cosine: http://en.wiktionary.org/wiki/hyperbolic_cosine

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 25, Date: 04/25/2012, Wednesday

 

Topic:  Neural networks (II)

 

8)     Principles of neural network

9)     Applying neural network for classification

 

Reading: AAEM61 chapter 5

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 26, Date: 04/30/2012, Monday

 

Topic:  Model implementation

 

1)     Internally scored data

2)     Score data modules

 

Reading: AAEM61 chapter 7

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 27, Date: 05/02/2012, Wednesday

 

Topic:  Special Topics

 

1)     Quiz 7 (Neural network)

2)     Exercise 6

3)     Ensemble modeling

4)     Variable selection

 

Reading: AAEM61 chapter 9

 

------ + ------ + ------- + ------- + ------- + -------

Lecture 28, Date: 05/05/2012, Monday

 

Topic:  Review

 

------ + ------ + ------- + ------- + ------- + -------