ISQS 3358 Business Intelligence

 

Instructor: Zhangxi Lin

 

------ + ------ + ------- + ------- + ------- + -------

 

Project 5, Due: 05/02, Wednesday

 

Exploratory analysis of eBay trader reputation scores

 

Please find two eBay trader reputation score datasets: S-0301.xls and B-0301.xls. You are going to use them to conduct an exploratory study and reveal a few important findings from the datasets.

 

About datasets: S-0301 contains the seller reputation scores. The so-called “sellers” could be buyers sometimes but we collected the data when the traders were selling something. There are two Datasets in the file. One is initially collected data and another is the dataset collected 6 months later using the same seller IDs. Similarly, B-0301 contains buyer reputation scores which were collected when the traders were bidding something online. The datasets were also collected with a gap of six months between, using the same IDs.

 

So you have the above four datasets and you are wondering the following questions:

 

  1. What are the distributions of the datasets? (You can see the formula used in file NORMALTEST.XLS in the shared directory. If the Wald-Stat is less than the Chi-critical value 5.99, then the normality hypothesis is accepted. For more information, please read the paper: http://129.118.51.86/zlin/pdf/DSS-reputation.pdf)
  2. What are differences between the reputation scores of sellers and buyers? (Try to use t-test to examine)
  3. Are there any relationship between the scores generated in two different periods, say, current one with the one 6-month ago? (Try regression)
  4. How previous negative reputation rate related to the later negative reputation rate? (Compare the ratio of negative and total scores in two periods. Regression is encouraged.)
  5. Anything else that may interest you.

 

To study the datasets, you may use regression, graphing, t-test, etc. with SAS EG or MS Excel. You can refer to three papers for the ideas:

1)       The Decision Support Systems paper: http://129.118.51.86/zlin/pdf/DSS-reputation.pdf - longer one

2)       The IRMA conference paper: http://zlin.ba.ttu.edu/papers/outgoing/IRMA-paper.pdf - shorter but less informative

3)       The book chapter: http://zlin.ba.ttu.edu/papers/published/14%20Lin.pdf

 

Or normality test formula, check http://zlin.ba.ttu.edu/3358/NORMALTEST.XLS

 

Deliverables:

1)       General descriptive analysis of the datasets: descriptive analysis, graphs/charts

2)       Findings – at least four findings supported by explanations, charts or tables.

3)       Discussions and comments

 

Any creative ideas and insightful findings are welcome and will be rewarded with extra credits. You need to submit at least the electronic copy, and the hardcopy of the report is a favorable one unless you have difficulty to make it.

 

FAQ:

 

The following are answers to some common questions:

 

  1. What are in the datasets?

 

In this project, there are two pairs of datasets for the reputation scores of sellers and buyers collected at two different time points with a gap of 6-month. One pairs of datasets is from the online traders when they were selling something and another is from the trader when they were buying something.

 

Online reputation system in eBay is getting very popular for business and for research as well. After an auction, both the buyer and the seller can rate each other using the services provided by Feedback Forum. A positive comment from a unique trading partner adds one point to the net reputation score, a negative comment deducts one point from the net reputation score, and a neutral comment does not affect the net reputation score. The data can be used for analyze the status of the market.

 

  1. What are the observations in the dataset?

 

They are 6-month positive scores, 6-month neutral scores, 6-month negative scores, 6-month total scores.

 

  1. What kinds of analyses can be done with the datasets?

 

1)       Test the growth of the transactions in different period

2)       Test the difference of the transactions between sellers and buyers

3)       Test the relevance of reputation scores, or negative score ratios between different periods

4)       The graph of the score distribution

5)       Histograms

6)       Differences between sellers and buyers in their negative or neutral scores

7)       Clustering the traders based on their reputation scores

8)       Others

 

  1. How to test the normality of a distribution?

 

Based on the current datasets, do the following:

1)       Sort the dataset observations from high to low

2)       If this is for lognormality test, apply logarithm function to the observations of the variable. If not the case skip this step

3)       Apply “Descriptive statistics” function of data analysis in Excel to get the statistics from the variable in the dataset

4)       Calculate the Wald-statistics value -- Wald-stat = degree_of_freedom * (skewness2 / 6 + kurtosis2 / 24). You can make use of the formula in http://zlin.ba.ttu.edu/3358/NORMALTEST.XLS

5)       If the Wald-stat is less than the chi-critical value 5.99, then the hypothesis of normality is accepted, otherwise, the hypothesis is rejected.

 

  1. How to compare the means of a variable in two datasets?

 

You may use t-test which can be done using the Data Analysis function in Excel.

 

Note: BOTH Enterprise Guide and Miner are available on Tech main library’s computers

 

 

------ + ------ + ------- + ------- + ------- + -------

 

Project 4, Due: 04/20, Friday

 

Internet log data analysis

 

I. Web Log: Find the dataset commrex.com-access_log-9711.txt in the shared directory. This is a web log file with 15517 web transaction records happening in November 1997 on http://www.commrex.com, hosted by IMW (http://www.inetworks.com). Fulfill the following tasks:

 

  1. Convert the file into SAS dataset. You need to use Excel to split the columns. Then use Base SAS to do the conversion. The format of the dataset is: “User”, “Date”, “Time”, “Method”, “Page”, “Status”, “Size”. You need “User”, “Date”, “Time”, and “Page”. You may remove other columns. See the sample file.
  2. Combine “User” and “Date” as the session ID; use “Time” as sequence; use “Page” as target. Then do association analysis. Report the outcomes.
  3. Use Tabular Summary function of SAS Enterprise Guide to produce the summary table with rows as “User”, columns as “Date”, and the cells as frequencies visiting different web pages. Export the table to a SAS dataset.

 

II. Online Database Log: Find the dataset DATALOG17_1.SAS7BDAT in the shared directory. This is the one from the database log #5 file. Fulfill the following tasks:

1.       Use Tabular Summary function of SAS Enterprise Guide to produce the summary table with rows as “ID” (User Id and date are in the same column in the dataset), columns as “Actions”, and the cells as frequencies doing different actions.

2.       Export the table to a SAS dataset for clustering. You need to perform the following tasks in order to obtain a workable file (any other alternative approach is encouraged):

a.       Copy the html page to an Excel file

b.       Remove the “Total” column and row, as well as other useless information.

3.       Use the above outcomes in the Excel file to perform clustering. Report the outcomes.

 

Deliverables:

1.       A cover page with necessary information

2.       A summary of the dataset conversion for any problems and how you solve the problems (1 page)

3.       Report the main findings from Association analysis with supporting evidence

4.       The tabular summary reports from each of the datasets (one sample page for each is enough)

5.       Report the main findings from Clustering analysis with supporting evidence

6.       Anything else that could add values to this project

 

------ + ------ + ------- + ------- + ------- + -------

 

Project 3, Due: 04/13, Friday

 

BI/DM Class Survey data comparison

 

 

I. Based on the answer for Problem 1 of Project 1 “Find a piece of news, short video (less than 10 minutes), or an online story, which is a case of business intelligence. Provide the url with a short comment on it, explaining why and how it is related to BI.”, provide further comments from one or two of the following perspectives (referring to the textbook by David Loshin):

  1. Data quality
  2. Information integration
  3. Data mining
  4. Others if any.

 

    Deliverables:

  1. Copy and paste the previous answers in Project 1.
  2. A short paragraph for the new comments

 

II. Use s3358 and s6347 SAS datasets in the shared directory to conduct the following decision tree analysis with SAS Enterprise Miner:

 

  1. Create the decision trees from the two datasets with the target variable respectively: VoIP, Math3
  2. Study and compare the decision trees. Identify the differences between these two trees and provide necessary explanations. Please retrieve the rules for the desired results from each decision tree. There are four decision trees.
  3. Draw confusion matrices for the four rules. Calculate coverage rate, accuracy rate, and lift for each of the rules.
  4. Write down a paragraph of conclusive comments on the results.

 

Note: CPM = Critical path method.

 

   Deliverables:

  1. A cover page with necessary information
  2. Descriptions of the parameter configurations for the decision trees (1 page)
  3. Decision tree diagrams
  4. Rules
  5. Confusion matrices and the values of coverage rate, accuracy rate, and lift for each of the rules.
  6. A paragraph of conclusive comments
  7. Anything else that could add values to this project

 

The above are compiled into a Word file and mailed to zhangxi.lin@gmail.com.

 

------ + ------ + ------- + ------- + ------- + -------

 

Project 2, Due: 04/04, Wednesday

 

Cotton/cottonseed data query

 

The Excel files containing cotton/cottonseed dataset is available in the shared directory. The following are tasks to be fulfilled:

1)       Study the dataset “Cotton.xls” carefully. The main dataset is in “data” table. You also need to use “location” table”. Another two tables are optional. You will notice that the main dataset has missed some value but this does not have major effect on this project.

2)       Convert the “Data” table into SAS format. If you get problems using SAS Enterprise Guide, you may try to use base SAS (main platform) to do the task.

3)       Use “Data” and “Location” table to implement the following parameterized query (See Section 6.2, SAS EG Course Notes)

a.       “State” selection

b.       “Location” (“Loc” in “Data” table) selection

c.       “Variety” selection (multiple selections are allowed)

d.       Any other parameter selection

4)       Do a summary statistics based on the query outcome using a few selected variables. Draft a few sentences to explain the outcome (See Section 5.1, SAS EG Course Notes)

5)       Use “Location” as the row and “Irrigation” as the column to create a tabular summary report (See Section 5.3, SAS EG Course Notes). Write a few sentences to explain the data.

6)       Import the summary statistics outcome to an Excel file

 

Deliverables:

1)       2-3 screenshots for showing the outcomes of the query

2)       A summary statistics report and the explanation of the report

3)       A tabular summary report and the explanation of the report

4)       Compile the report with a cover page indicating: Project 2, your name, course number, completion date, etc. and submit the report.

5)       Email the report file and the EG project file to the instructor

 

Note: Two groups will be randomly chosen to demonstrate their projects in the class.

 

------ + ------ + ------- + ------- + ------- + -------

 

Project 1, Due: 03/26, Monday

 

Database log file analysis

 

  1. Find a piece of news, short video (less than 10 minutes), or an online story, which is a case of business intelligence. Provide the url with a short comment on it, explaining why and how it is related to BI. The comment should not exceed 200 words.
  2. Right   Download dataset “data.log.5” in the shared server under \ISQS3358.

1)       Preprocess the dataset to a computable format

2)       Convert the dataset into SAS format

3)       Use at least four different functions of Enterprise Guide in “Describe” pull-down menu to explore the dataset

4)       Identify 4-6 findings from the dataset that you believe valuable for decision making. Write a couple of sentences for each finding to explain the statistics data.

5)       The submissions of the report includes:

a.       The findings

b.       At least one one-way frequency table with a short explanation

c.       At least one two-way frequency table with a short explanation

6)       Staple the report with a cover page indicating: Project 1, your name, course number, completion date, etc.

 

------ + ------ + ------- + ------- + ------- + -------