ISQS 6347 Data & Text Mining Lecture Outlines

 

SAS Course Notes                                                                                    

AAEM: Applied Analytics Using SAS® Enterprise Miner™

EM_GS: Getting Start with SAS® Enterprise Miner

EM_TMGS: Getting Start with SAS® 9.1 Text Miner

CSA: Data Mining - A Case Study Approach

DMTM: Text Mining Using SAS® Software

ADMT: Applying Data Mining Techniques Using Enterprise Miner

CCWEB: Effective Web Mining: Attracting and Keeping Valued Cyber Consumers

 

Optional Textbooks

 

SPB: Data Mining for Business Intelligence Galit Shmueli, Nitin R. Patel, Peter C. Bruce,

TSK: Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, and Vipin Kumar

RG: Data Mining – A Tutorial Based Primer, Richard Roiger, Michael Geatz

 

------ + ------ + ------- + ------- + ------- + -------

 

Home | Projects | SAS Online Demos | Homework

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 1, Date: 01/15/2015, Thursday

 

Topic: Introduction to Data Mining

 

Reading: SPB/RG/TSK Chapter 1

 

References:

1.     SAS resources for instructors and students

2.     SAS courses

3.     Data mining tutorials

4.     ZenTut tutorials

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 2, Date: 01/20/2015, Tuesday

 

Topic: Data mining fundamentals

 

Reading: SPB/RG/TSK Chapter 2

 

1)     Illustrative cases

2)     Confusion matrix

 

Terminology: predictor, observation, confidence, dependent variable, estimation, response, score, supervised learning, unsupervised learning

 

Readings and review questions:

1)     What is data mining?

2)     What is decision tree?

3)     What is confusion matrix?

4)     Getting Started with SAS EM 5.3/6.1

5)     AAEM Chapter 1

6)     Data mining case: Scandinavian Airlines Modernize Business Intelligence Capabilities 

 

Homework 1 (due 02/03/2014, Tuesday)

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 3, Date: 01/22/2015, Thursday

 

Topic: Data mining fundamentals

 

1)     Introduction to SAS Enterprise Miner

2)     Data for data mining

3)     Data exploration

4)     Data preprocessing

 

Readings:

1)     Getting Started with SAS EM 4.3/5.3/6.1

2)     AAEM Chapter 2

 

Review questions

·         Four different types of attributes of data, their properties

·         Different types of records and data

·         Data quality: what it is, how to guarantee

 

SAS EM Exercises:

Explore the feature of Citrix SAS Enterprise Miner 6.1

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 4, Date: 01/27/2015, Tuesday

 

Topic: Classification modeling (I)

 

1)     Decision tree modeling

2)     Determining the best split – Using the measure of GINI, Entropy, or misclassification error

3)     Determining when to stop splitting

 

Readings:

1)     TSK chapter 4

2)     Getting Started with SAS EM, Chapter 7

3)     AAEM Chapter 2

 

Exercises:

Exercises of AAEM: p2-14, p2-34, p2-62

 

Demo: Define a project, data exploration

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 5, Date: 01/29/2015, Thursday

 

Topic: Classification modeling (II)

 

1)     Data mining demonstration using SAS Enterprise Miner 13.2 – the case of Exercise 1

2)     Exercise 1

 

Review readings and questions:

1)     TSK chapter 4 (It provides extensive information about decision tree modeling.)

2)     AAEM Chapter 3 (self-study 3.4 and 3.5)

3)     What is overfitting in the decision tree approach? How to prevent overfitting?

4)     How to decide to stop splitting a tree?

5)     How to evaluate the performance of a model?

 

Online references:

1)     Statistical hypothesis testing: http://en.wikipedia.org/wiki/Statistical_hypothesis_testing

2)     Null hypothesis: http://en.wikipedia.org/wiki/Null_hypothesis

3)     Alternative hypothesis: http://en.wikipedia.org/wiki/Alternative_hypothesis

4)     p-value (wiki): http://en.wikipedia.org/wiki/P-value

5)     What is p-value: http://www.childrens-mercy.org/stats/definitions/pvalue.htm

6)     Likelihood ratios in diagnostic testing: http://en.wikipedia.org/wiki/Likelihood_ratios_in_diagnostic_testing

 

Homework 2 (due 02/17/2015 Tuesday):

 

Submission: Hardcopy.

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 6, Date: 02/03/2015, Tuesday

 

Topic: Decision Tree (I)

 

1)     Quiz 1

2)     Three types of predictive modeling

3)     Dataset  PVA97NK

4)     Classification modeling performance evaluation


Review readings:

1)     TSK chapter 4 (It provides extensive information about decision tree modeling.)

2)     AAEM Chapter 3 (self-study 3.4 and 3.5)

 

Review questions:

1)     Three prediction types in decision tree: decisions, rankings, and estimates (See p.3-70 of AAEM61)

2)     Why do we need to split a dataset into training, validation and test datasets? What are the different purposes of using validation and test datasets?

3)     What is Prior Probability? What is its relationship with the sample probability? How to define it in SAS EM?

4)     What is stratification? Why do we need it? How can you set stratification parameters in SAS EM?

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 7, Date: 02/05/2015, Thursday

 

Topic: Decision Tree (II)

 

1)     Working with dataset PVA97NK

2)     Understand the results of data mining

3)     Tuning up the decision tree model

 

Review readings and questions:

1)     AAEM61 Chapter 3

2)     Concordance vs. discordance (See p.3-73 of AAEM61)

3)     Complexity optimization

1.     Decisions: accuracy/misclassification (not weighted), profit/loss (weighted)

2.     Rankings: concordance/discordance

3.     Estimates: squared errors

4)     What are oversampling and undersampling respectively?

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 8, Date: 02/10/2015, Tuesday

 

Topic: Classification Model Assessment

 

1)     Model fit statistics

2)     Statistical graphics – ROC Chart and Response Chart

 

Reading:

1)     AAEM61 Chapter 3, p.3-69 to 3-76

2)     AAEM61 Chapter 6

 

Review questions:

1)     How to draw a ROC chart

2)     Confusion matrix revisited

 

Online references:

1)     Cumulative Lift: http://www.information-management.com/news/5329-1.html

2)     Schwarz Bayesian Criterion (SBC): http://en.wikipedia.org/wiki/Bayesian_information_criterion

3)     Klomogorov-Smirnov test (K-S test): http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

4)     Schwarz-Bayesian Criterion (SBC): http://www.associatedcontent.com/article/2175099/the_schwarz_bayesian_criterion_and.html

5)     Maximum likelihood: http://en.wikipedia.org/wiki/Maximum_likelihood

6)     Receiver Operating Characteristics Chart: http://www.predixionsoftware.com/predixion/help/Insight_Analytics/Viewer_v2_topics/Accuracy_Charts/ROC_Chart.htm

7)     ROC Chart video: part 1 (9’41”), part 2 (3’28”).

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 9, Date: 02/12/2015, Thursday

 

Topic:  Classification Model Assessment & Logistic Regression

 

1)     Adjusting for separate sampling

2)     Profit matrices

3)     Model evaluations

4)     Introduction to logistic regression

5)     Exercise 2

 

Homework 3 (due 03/03/2015, Tuesday)

 

Online references:

1)     Logistic regression (Wikipedia.prg): http://en.wikipedia.org/wiki/Logistic_regression

2)     An introduction to logistic regression: http://luna.cas.usf.edu/~mbrannic/files/regression/Logistic.html

3)     Logistic regression vs. OLS regression: http://www.upa.pdx.edu/IOA/newsom/da2/ho_logistic.pdf

4)     Binary vs. multinominal logistic regression: http://faculty.chass.ncsu.edu/garson/PA765/logistic.htm

5)     Odds in logistic regressions: http://www.jerrydallal.com/LHSP/logistic.htm

6)     SAS tutorial: http://support.sas.com/documentation/cdl/en/anlystug/58352/HTML/default/chap11_sect4.htm

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 10, Date: 02/17/2015, Tuesday

 

Topic: Logistic Regression (I)

 

1)     Quiz 2

2)     Selecting Regression Inputs

3)     Optimizing Regression Complexity

4)     Transforming Inputs

5)     Exercise 2

 

Review readings and questions:

1)     AAEM61 Chapter 4

2)     SPB chapter 7 & 8, RG chapter 10 (pp 291-302)

3)     Questions

a.     Decision tree vs. logistic regression – which one is better in construction of classification models?

b.     What is odds ratio? Could you interpret the significance of an input variable in a logistic regression?

c.     Itemize the main focuses in reviewing logistic regression results.

 

Online references:

1)     Chi-square significance test : http://faculty.chass.ncsu.edu/garson/PA765/chisq.htm

2)     Odds = p / (1 – p), and odds against = 1 / odds: http://en.wikipedia.org/wiki/Odds

3)     Odds ratio : http://en.wikipedia.org/wiki/Odds_ratio

4)     Type 3 analysis of effect (testing the significance after adding a new input variable to the model which already had other inputs.): http://www.technion.ac.il/docs/sas/stat/chap29/sect31.htm

5)     SAS data analysis (addressing Type 3 analysis): http://www.ats.ucla.edu/stat/sas/dae/intreg.htm

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 11, Date: 02/19/2015, Thursday

 

TopicModel Implementation

 

1)     Internally scored data

2)     Score data modules

3)     Exercise 2 (continued)

 

Review readings: AAEM61 Chapter 2, 3, 4, 6

1)     AAEM61 Chapter 7

2)     http://www.kdnuggets.com/polls/2008/using-PMML-to-deploy-data-mining.htm

3)     PMML http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 12, Date: 02/24/2015, Tuesday

 

Topic: Neural Network Classification Modeling

 

1)     Principles of neural network

2)     Applying neural network for classification

 

Reading:  AAEM61 chapter 5

 

Online references:

1) Hyperbolic tangent:

a.     http://math.jccc.net:8180/webMathematica/MSP/mmartin/tanh.msp

b.     http://www.2dcurves.com/exponential/exponentialht.html

c.     http://en.wiktionary.org/wiki/hyperbolic_tangent

2)       Hyperbolic sine: http://en.wiktionary.org/wiki/hyperbolic_sine

3)       Hyperbolic cosine: http://en.wiktionary.org/wiki/hyperbolic_cosine

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 13, Date: 02/26/2015, Thursday

 

TopicClassification modeling review

 

Reading:  AAEM61 chapter 2-7

 

Homework 4 (due 03/31/2015, Tuesday)

 

Review questions:

1)  What are main difference between clustering and classification data mining?

2)  Check datasets HMEQ. What would be outcomes if you cluster it?

3) Use the clustering worksheet (http://zlin.ba.ttu.edu/6347/Clustering.xls ) to explore different outcomes of clustering. Modify the coordinates of the instances to obtain different datasets. Check the outcomes. Selectively record the k-mean clustering iterations for 2 different sets of instances including the illustrative charts.

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 14, Date: 03/03/2015, Tuesday

 

Topic: Principles of Clustering

 

1)     Concepts of pattern discovery

2)     Distance measurements

3)     Basic concepts

4)     K-means methods

 

Online instructions: http://zlin.ba.ttu.edu/6347/PatternDiscovery-Clustering2013.htm

 

Review reading:

1)     TSK chapter 8

2)     AAEM61 chapter 8

3)     Clustering: An Introduction, http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/index.html

4)     K-mean clustering tutorial, http://people.revoledu.com/kardi/tutorial/kMean/index.html

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 15, Date: 03/05/2015, Thursday

 

Topic: Principles of Association Analysis

 

1)     Association analysis

·         Concepts

·         Market basket analysis (AAEM61 Section 8.3)

·         Performance assessment and contingency table

2)     Theoretic issues in pattern discovery (optional)

·         Itemset generation

·         Association rule discovery

 

Review reading:

1)     TSK chapter 6

2)     AAEM61 chapter 8

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 16, Date: 03/10/2015, Tuesday

 

Topic: Pattern Discovery Case Study – Clustering

 

See: http://zlin.ba.ttu.edu/6347/PatternDiscovery-Clustering2013-2.htm

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 17, Date: 03/12/2015, Thursday

 

Topic: Pattern Discovery Case Study – Association Analysis

 

See: http://zlin.ba.ttu.edu/6347/PatternDiscovery-AA2013.htm

 

------ + ------ + ------- + ------- + ------- + -------

 

Spring Break

 J

------ + ------ + ------- + ------- + ------- + -------

 

The following are to be revised.

 

 

 

 

 

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 18, Date: 03/24/2015, Tuesday

 

Topic: Pattern Discovery Review

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 19, Date: 03/26/2015, Thursday

 

Topic:  An Introduction to Text Mining (I)

 

1)       What is text mining

2)       Case study – SVDTUTOR

 

Slides: TM-1

Dataset: SVDTUTOR, Fedpapers

 

Readings: DMTM 1.1

 

Online references:

 

Homework assignment 5 (optional, due 04/14/2015)

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 20, Date: 03/31/2015, Tuesday

 

Topic:  An Introduction to Text Mining (II)

 

1)     Review Exercise 2

2)     Fedpapers case

3)     Textual data preparation

 

Slides: TM-1

Dataset: SVDTUTOR, Fedpapers

 

Readings: DMTM 1.1-1.2

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 21, Date: 04/02/2015, Thursday

 

Topic: Principles of Text Mining (I)

 

1)     Exercise 4

2)     Data preparation: Synonym, Stop/Start list, Part of Speech, Stemming

3)     Data conversion

 

Slides: TM-1

Readings: DMTM 1.3-1.4

 

Online materials for Singular Value Decomposition (SVD):

1)     Basics of Matrix: http://www.xycoon.com/matrix_algebra.htm

2)     http://mathworld.wolfram.com/SingularValueDecomposition.html

3)     http://www.uwlax.edu/faculty/will/svd/

4)     http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm

5)     http://kwon3d.com/theory/jkinem/svd.html

 

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 22, Date: 04/07/2015, Tuesday

 

Topic:  Principles of Text Mining (II)

 

1)     Term-document matrix

2)     Singular value decomposition (SVD)

 

Slides: TM-1&2

Readings: DMTM 1.2-1.4

 

 

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 23, Date: 04/09/2015, Thursday

 

Topic: Exploratory analysis of documents (I)

 

1)     Case study – SAS courses

2)     Exercise 4

 

Slides: TM-2

Readings: DMTM Chapter 2

                             

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 24, Date: 04/14/20145, Tuesday

 

Topic:  Exploratory analysis of documents (II)

 

1)     Issues in predictive text mining

2)     Case study – Recovering potentials in worker’s compensation insurance claims

 

Slides: TM-2

Readings: DMTM Chapter 2, TMGS

 

Homework assignment 5 (optional, due 04/28/2015)

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 25, Date: 04/16/2015, Thursday

 

------ + ------ + ------- + ------- + ------- + -------

 

Lecture 26, Date: 04/21/2015, Tuesday

 

 

------ + ------ + ------- + ------- + ------- + -------

 

To be continued