ISQS 6347 Data & Text Mining Lecture Outlines

 

(Total 30 lectures, January 17 - May 2, 2013)

 

SAS Course Notes

AAEM: Applied Analytics Using SAS® Enterprise Miner™

EM_GS: Getting Start with SAS® Enterprise Miner

EM_TMGS: Getting Start with SAS® 9.1 Text Miner

CSA: Data Mining - A Case Study Approach

DMTM: Text Mining Using SAS® Software

ADMT: Applying Data Mining Techniques Using Enterprise Miner

CCWEB: Effective Web Mining: Attracting and Keeping Valued Cyber Consumers

 

Optional Textbooks

 

SPB: Data Mining for Business Intelligence Galit Shmueli, Nitin R. Patel, Peter C. Bruce,

TSK: Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar

RG: Data Mining – A Tutorial Based Primer, Richard Roiger, Michael Geatz

 

------ + ------ + ------- + ------- + ------- + -------

 

Home | Schedule | Projects | SAS Online Demos | Homework

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 1, Date: 01/17/2013, Thursday

 

Topic: Introduction to Data Mining

 

Reading: SPB/RG/TSK Chapter 1

 

References:

1.     SAS resources for instructors and students

2.     SAS courses

3.     Data mining tutorials

4.     ZenTut tutorials

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 2, Date: 01/22/2013, Tuesday

 

Topic: Data mining fundamentals

 

Reading: SPB/RG/TSK Chapter 2

 

1)     Illustrative cases

2)     Confusion matrix

3)     Introduction to SAS Enterprise Miner

 

Terminology: predictor, observation, confidence, dependent variable, estimation, response, score, supervised learning, unsupervised learning

 

Readings and review questions:

1)     What is data mining?

2)     What is decision tree?

3)     What is confusion matrix?

4)     Getting Started with SAS EM 5.3/6.1

5)     AAEM Chapter 1

6)     Data mining case: Scandinavian Airlines Modernize Business Intelligence Capabilities 

 

SAS EM Exercises:

Explore the feature of Citrix SAS Enterprise Miner 6.1

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 3, Date: 01/24/2013, Thursday

 

Topic: Data mining fundamentals

 

1)     Data for data mining

2)     Data exploration

3)     Data preprocessing

 

Readings:

1)     Getting Started with SAS EM 4.3/5.3/6.1

2)     AAEM Chapter 2

 

Review questions

·         Four different types of attributes of data, their properties

·         Different types of records and data

·         Data quality: what it is, how to guarantee

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 4, Date: 01/29/2013, Tuesday

 

Topic: Classification modeling (I)

 

1)     Decision tree modeling

2)     Determining the best split – Using the measure of GINI, Entropy, or misclassification error

3)     Determining when to stop splitting

 

Readings:

1)     TSK chapter 4

2)     Getting Started with SAS EM, Chapter 7

3)     AAEM Chapter 2

 

Exercises:

Exercises of AAEM: p2-14, p2-34, p2-62

 

Demo: Define a project, data exploration

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 5, Date: 01/31/2013, Thursday

 

Topic: Classification modeling (II)

 

1)     Quiz 1

2)     Three types of predictive modeling

3)     Data mining demonstration using SAS Enterprise Miner 6.1 – the case of Exercise 1

 

Review readings and questions:

1)     TSK chapter 4 (It provides extensive information about decision tree modeling.)

2)     AAEM Chapter 3 (self-study 3.4 and 3.5)

3)     What is overfitting in the decision tree approach? How to prevent overfitting?

4)     How to decide to stop splitting a tree?

5)     How to evaluate the performance of a model?

 

Online references:

1)     Statistical hypothesis testing: http://en.wikipedia.org/wiki/Statistical_hypothesis_testing

2)     Null hypothesis: http://en.wikipedia.org/wiki/Null_hypothesis

3)     Alternative hypothesis: http://en.wikipedia.org/wiki/Alternative_hypothesis

4)     p-value (wiki): http://en.wikipedia.org/wiki/P-value

5)     What is p-value: http://www.childrens-mercy.org/stats/definitions/pvalue.htm

6)     Likelihood ratios in diagnostic testing: http://en.wikipedia.org/wiki/Likelihood_ratios_in_diagnostic_testing

 

Homework 2 (due 02/19/2012 Tuesday):

1)     Check Section 4.1 of “Effective Web Mining” (document name: CCWEB_TKIT.pdf, Page 4-1 to 4-34). Use dataset DMAIL (in the shared space under \Datasets\DATA_WM directory) to develop two decision tree models. One is basic without any parameter change, and another uses Gini splitting criterion. Then add an Assessment node to the diagram to compare the performance of two classification models. You don’t need to read the section in details since it is based on older version of SAS EM, but focus on: (1) the explanations of the variables, (2) which variable is the target, (3) which variables are configured (see p.4-12). You can also explore the dataset to understand its quality and variable distributions. You feel free to try different splitting criteria: Chi-Square, GINI, and Entropy, and different other parameters. If you more information about how to use SAS EM 5.3 to solve the problem, you can check Chapter 3 in AAEM61.

2)     AAEM61 p.3-111-112, Exercises for Chapter 3.

 

The deliverables include

a.     the model diagram,

b.     one of the Assessment charts,

c.     the performance table in the results of the Assessment node, and

d.     short explanations to each of the results.

 

Submission: Hardcopy.

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 6, Date: 02/05/2013, Tuesday

 

Topic: Decision Tree (I)

 

1)     Decision tree modeling with SAS Enterprise Miner 6.1

2)     Other issues in decision tree modeling

3)     Classification modeling performance evaluation

4)     Exercise 1


Review readings:

1)     TSK chapter 4 (It provides extensive information about decision tree modeling.)

2)     AAEM Chapter 3 (self-study 3.4 and 3.5)

 

Review questions:

1)     Three prediction types in decision tree: decisions, rankings, and estimates (See p.3-70 of AAEM61)

2)     Why do we need to split a dataset into training, validation and test datasets? What are the different purposes of using validation and test datasets?

3)     What is Prior Probability? What is its relationship with the sample probability? How to define it in SAS EM?

4)     What is stratification? Why do we need it? How can you set stratification parameters in SAS EM?

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 7, Date: 02/07/2013, Thursday

 

Topic: Decision Tree (II)

 

1)     Prior probability

2)     Weighted decisions

 

Review readings and questions:

1)     AAEM61 Chapter 3

2)     What are oversampling and undersampling respectively?

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 8, Date: 02/12/2013, Tuesday

 

Topic: Classification Model Assessment

 

1)     Quiz 2

2)     Model fit statistics

3)     Statistical graphics – ROC Chart and Response Chart

 

Reading:

1)     AAEM61 Chapter 3, p.3-69 to 3-76

2)     AAEM61 Chapter 6

 

Review questions:

 

1)     Concordance vs. discordance (See p.3-73 of AAEM61)

2)     Complexity optimization

a.     Decisions: accuracy/misclassification (not weighted), profit/loss (weighted)

b.     Rankings: concordance/discordance

c.     Estimates: squared errors

 

Online references:

1)     Cumulative Lift: http://www.information-management.com/news/5329-1.html

2)     Schwarz Bayesian Criterion (SBC): http://en.wikipedia.org/wiki/Bayesian_information_criterion

3)     Klomogorov-Smirnov test (K-S test): http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

4)     Schwarz-Bayesian Criterion (SBC): http://www.associatedcontent.com/article/2175099/the_schwarz_bayesian_criterion_and.html

5)     Maximum likelihood: http://en.wikipedia.org/wiki/Maximum_likelihood

6)     Receiver Operating Characteristics Chart: http://www.predixionsoftware.com/predixion/help/Insight_Analytics/Viewer_v2_topics/Accuracy_Charts/ROC_Chart.htm

7)     ROC Chart video: part 1 (9’41”), part 2 (3’28”).

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 9, Date: 02/14/2013, Thursday

 

Topic:  Classification Model Assessment & Logistic Regression

 

1)     Adjusting for separate sampling

2)     Profit matrices

3)     Model evaluations

4)     Introduction to logistic regression

5)     Exercise 2

 

Homework 3 (due 02/28/2013, Thursday):

1)     AEM61 p.4-82, Exercises for Chapter 4.

2)     AEM61 p.6-48, Exercises for Chapter 6.

It is good that you develop the solutions for each exercise before you can compare your results with the answer keys.

Deliverables:

·         The screenshots of the final results

·         The screenshots demonstrating your specific work

·         Your answers to the questions with blanks in the exercises

 

Online references:

1)     Logistic regression (Wikipedia.prg): http://en.wikipedia.org/wiki/Logistic_regression

2)     An introduction to logistic regression: http://luna.cas.usf.edu/~mbrannic/files/regression/Logistic.html

3)     Logistic regression vs. OLS regression: http://www.upa.pdx.edu/IOA/newsom/da2/ho_logistic.pdf

4)     Binary vs. multinominal logistic regression: http://faculty.chass.ncsu.edu/garson/PA765/logistic.htm

5)     Odds in logistic regressions: http://www.jerrydallal.com/LHSP/logistic.htm

6)     SAS tutorial: http://support.sas.com/documentation/cdl/en/anlystug/58352/HTML/default/chap11_sect4.htm

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 10, Date: 02/19/2013, Tuesday

 

Topic: Logistic Regression (I)

 

1)     Selecting Regression Inputs

2)     Optimizing Regression Complexity

3)     Transforming Inputs

4)     Exercise 2

 

Review readings and questions:

1)     AAEM61 Chapter 4

2)     SPB chapter 7 & 8, RG chapter 10 (pp 291-302)

3)     Questions

a.     Decision tree vs. logistic regression – which one is better in construction of classification models?

b.     What is odds ratio? Could you interpret the significance of an input variable in a logistic regression?

c.     Itemize the main focuses in reviewing logistic regression results.

 

Online references:

1)     Chi-square significance test : http://faculty.chass.ncsu.edu/garson/PA765/chisq.htm

2)     Odds = p / (1 – p), and odds against = 1 / odds: http://en.wikipedia.org/wiki/Odds

3)     Odds ratio : http://en.wikipedia.org/wiki/Odds_ratio

4)     Type 3 analysis of effect (testing the significance after adding a new input variable to the model which already had other inputs.): http://www.technion.ac.il/docs/sas/stat/chap29/sect31.htm

5)     SAS data analysis (addressing Type 3 analysis): http://www.ats.ucla.edu/stat/sas/dae/intreg.htm

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 11, Date: 02/21/2013, Thursday

 

TopicClassification modeling review

 

Review readings: AAEM61 Chapter 2, 3, 4, 6

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 12, Date: 02/26/2013, Tuesday

 

TopicModel Implementation

 

1)     Internally scored data

2)     Score data modules

 

Readings:

1)     AAEM61 Chapter 5

2)     http://www.kdnuggets.com/polls/2008/using-PMML-to-deploy-data-mining.htm

3)     PMML http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 13, Date: 02/28/2013, Thursday

 

Topic: Principles of Clustering

 

1)     Concepts of pattern discovery

2)     Distance measurements

3)     Basic concepts

4)     K-means methods

 

Review reading:

1)     TSK chapter 8

2)     AAEM61 chapter 8

3)     Clustering: An Introduction, http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/index.html

4)     K-mean clustering tutorial, http://people.revoledu.com/kardi/tutorial/kMean/index.html

 

Review questions:

1)  What are main difference between clustering and classification data mining?

2)  Check datasets HMEQ. What would be outcomes if you cluster it?

3) Use the clustering worksheet (http://zlin.ba.ttu.edu/6347/Clustering.xls ) to explore different outcomes of clustering. Modify the coordinates of the instances to obtain different datasets. Check the outcomes. Selectively record the k-mean clustering iterations for 2 different sets of instances including the illustrative charts.

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 14, Date: 03/05/2013, Tuesday

 

TopicPrinciples of Association Analysis

 

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 15, Date: 03/07/2013, Thursday

 

Topic: Pattern Discovery Case Study - Clustering (1)

 

See: http://zlin.ba.ttu.edu/6347/PatternDiscovery-Clustering2013.htm

------ + ------ + ------- + ------- + ------- + -------

 

Spring Break

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 16, Date: 03/19/2013, Tuesday

 

Topic: Pattern Discovery Case Study – Clustering (2)

 

See: http://zlin.ba.ttu.edu/6347/PatternDiscovery-Clustering2013-2.htm

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 17, Date: 03/21/2013, Thursday

 

Topic: Pattern Discovery Case Study – Association Analysis

 

See: http://zlin.ba.ttu.edu/6347/PatternDiscovery-AA2013.htm

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 18, Date: 03/26/2013, Tuesday

 

Topic: Pattern Discovery Review

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 19, Date: 03/28/2013, Thursday

 

Topic:  An Introduction to Text Mining (I)

 

1)       What is text mining

2)       Case study – SVDTUTOR

3)       Quiz 5 (clustering, association analysis)

 

Slides: TM-1

Dataset: SVDTUTOR, Fedpapers

 

Readings: DMTM 1.1

 

Online references:

 

Homework assignment 5 (optional, due 04/16/2013)

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 20, Date: 04/02/2013, Tuesday

 

Topic:  An Introduction to Text Mining (II)

 

1)     Review Quiz 5

2)     Review Exercise 2

3)     Fedpapers case

4)     Textual data preparation

 

Slides: TM-1

Dataset: SVDTUTOR, Fedpapers

 

Readings: DMTM 1.1-1.2

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 21, Date: 04/04/2013, Thursday

 

Topic: Principles of Text Mining (I)

 

1)     Exercise 4

2)     Data preparation: Synonym, Stop/Start list, Part of Speech, Stemming

3)     Data conversion

 

Slides: TM-1

Readings: DMTM 1.3-1.4

 

Online materials for Singular Value Decomposition (SVD):

1)     Basics of Matrix: http://www.xycoon.com/matrix_algebra.htm

2)     http://mathworld.wolfram.com/SingularValueDecomposition.html

3)     http://www.uwlax.edu/faculty/will/svd/

4)     http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm

5)     http://kwon3d.com/theory/jkinem/svd.html

 

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 22, Date: 04/09/2013, Tuesday

 

Topic:  Principles of Text Mining (II)

 

1)     Term-document matrix

2)     Singular value decomposition (SVD)

 

Slides: TM-1&2

Readings: DMTM 1.2-1.4

 

 

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 23, Date: 04/11/2013, Thursday

 

Topic: Exploratory analysis of documents (I)

 

1)     Case study – SAS courses

2)     Exercise 4

 

Slides: TM-2

Readings: DMTM Chapter 2

                             

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 24, Date: 04/16/2013, Tuesday

 

Topic:  Exploratory analysis of documents (II)

 

1)     Quiz 6 (Text mining)

2)     Issues in predictive text mining

3)     Case study – Recovering potentials in worker’s compensation insurance claims

 

Slides: TM-2

Readings: DMTM Chapter 2, TMGS

 

Homework assignment 5 (optional, due 04/30/2013)

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 25, Date: 04/18/2013, Thursday

 

Topic:  Neural networks (I)

 

1)     Exercise 5

2)     Principles of neural network

3)     Applying neural network for classification

 

Reading:  AAEM61 chapter 5

 

Online references:

1) Hyperbolic tangent:

a.     http://math.jccc.net:8180/webMathematica/MSP/mmartin/tanh.msp

b.     http://www.2dcurves.com/exponential/exponentialht.html

c.     http://en.wiktionary.org/wiki/hyperbolic_tangent

2)       Hyperbolic sine: http://en.wiktionary.org/wiki/hyperbolic_sine

3)       Hyperbolic cosine: http://en.wiktionary.org/wiki/hyperbolic_cosine

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 26, Date: 04/23/2013, Tuesday

 

Topic:  Neural networks (II)

 

1)     Principles of neural network

2)     Applying neural network for classification

 

Reading: AAEM61 Chapter 5

 

------ + ------ + ------- + ------- + ------- + -------