ISQS 3358 Business Intelligence
Instructor:
------ +
------ + ------- + ------- + ------- + -------
Project 5, Due: 05/02, Wednesday
Exploratory analysis of eBay trader
reputation scores
Please find
two eBay trader reputation score datasets: S-0301.xls and B-0301.xls. You are
going to use them to conduct an exploratory study and reveal a few important
findings from the datasets.
About
datasets: S-0301 contains the seller reputation scores. The so-called “sellers”
could be buyers sometimes but we collected the data when the traders were
selling something. There are two Datasets in the file. One is initially
collected data and another is the dataset collected 6 months later using the same
seller IDs. Similarly, B-0301 contains buyer reputation scores which were
collected when the traders were bidding something online. The datasets were
also collected with a gap of six months between, using the same IDs.
So you have
the above four datasets and you are wondering the following questions:
To study
the datasets, you may use regression, graphing, t-test, etc. with SAS EG or MS
Excel. You can refer to three papers for the ideas:
1) The Decision Support Systems paper: http://129.118.51.86/zlin/pdf/DSS-reputation.pdf
- longer one
2) The IRMA conference paper: http://zlin.ba.ttu.edu/papers/outgoing/IRMA-paper.pdf
- shorter but less informative
3) The book chapter: http://zlin.ba.ttu.edu/papers/published/14%20Lin.pdf
Or
normality test formula, check http://zlin.ba.ttu.edu/3358/NORMALTEST.XLS
Deliverables:
1) General descriptive analysis of the
datasets: descriptive analysis, graphs/charts
2) Findings – at least four findings
supported by explanations, charts or tables.
3) Discussions and comments
Any
creative ideas and insightful findings are welcome and will be rewarded with
extra credits. You need to submit at least the electronic copy, and the
hardcopy of the report is a favorable one unless you have difficulty to make
it.
FAQ:
The
following are answers to some common questions:
In this
project, there are two pairs of datasets for the reputation scores of sellers
and buyers collected at two different time points with a gap of 6-month. One pairs
of datasets is from the online traders when they were selling something and
another is from the trader when they were buying something.
Online
reputation system in eBay is getting very popular for business and for research
as well. After an auction, both the buyer and the seller can rate each other
using the services provided by Feedback Forum. A positive comment from a unique
trading partner adds one point to the net reputation score, a negative comment
deducts one point from the net reputation score, and a neutral comment does not
affect the net reputation score. The data can be used for analyze the status of
the market.
They are
6-month positive scores, 6-month neutral scores, 6-month negative scores,
6-month total scores.
1) Test the growth of the transactions in
different period
2) Test the difference of the transactions
between sellers and buyers
3) Test the relevance of reputation
scores, or negative score ratios between different periods
4) The graph of the score distribution
5) Histograms
6) Differences between sellers and
buyers in their negative or neutral scores
7) Clustering the traders based on
their reputation scores
8) Others
Based on
the current datasets, do the following:
1) Sort the dataset observations from
high to low
2) If this is for lognormality test,
apply logarithm function to the observations of the variable. If not the case
skip this step
3) Apply “Descriptive statistics”
function of data analysis in Excel to get the statistics from the variable in
the dataset
4) Calculate the Wald-statistics value
-- Wald-stat = degree_of_freedom * (skewness2 / 6 + kurtosis2
/ 24). You can make use of the formula in http://zlin.ba.ttu.edu/3358/NORMALTEST.XLS
5) If the Wald-stat is less than the
chi-critical value 5.99, then the hypothesis of normality is accepted,
otherwise, the hypothesis is rejected.
You may use
t-test which can be done using the Data Analysis function in Excel.
Note: BOTH
------ +
------ + ------- + ------- + ------- + -------
Project 4, Due: 04/20, Friday
Internet log data analysis
I. Web Log:
Find the dataset commrex.com-access_log-9711.txt
in the shared directory. This is a web log file with 15517 web transaction
records happening in November 1997 on http://www.commrex.com,
hosted by IMW (http://www.inetworks.com).
Fulfill the following tasks:
II. Online
Database Log: Find the dataset DATALOG17_1.SAS7BDAT in the shared directory.
This is the one from the database log #5 file. Fulfill the following tasks:
1. Use Tabular Summary function of SAS
Enterprise Guide to produce the summary table with rows as “ID” (User Id and
date are in the same column in the dataset), columns as “Actions”, and the
cells as frequencies doing different actions.
2. Export the table to a SAS dataset
for clustering. You need to perform the following tasks in order to obtain a workable
file (any other alternative approach is encouraged):
a. Copy the html page to an Excel file
b. Remove the “Total” column and row,
as well as other useless information.
3. Use the above outcomes in the Excel
file to perform clustering. Report the outcomes.
Deliverables:
1. A cover page with necessary
information
2. A summary of the dataset conversion
for any problems and how you solve the problems (1 page)
3. Report the main findings from
Association analysis with supporting evidence
4. The tabular summary reports from
each of the datasets (one sample page for each is enough)
5. Report the main findings from
Clustering analysis with supporting evidence
6. Anything else that could add values
to this project
------ + ------ + ------- + ------- + ------- + -------
Project 3, Due: 04/13,
Friday
BI/DM Class Survey
data comparison
I. Based on the answer for Problem 1 of Project 1
“Find a piece of news, short video (less than 10 minutes), or an online story,
which is a case of business intelligence. Provide the url with a short comment
on it, explaining why and how it is related to BI.”, provide further comments
from one or two of the following perspectives (referring to the textbook by
David Loshin):
Deliverables:
II. Use s3358 and s6347 SAS datasets in the shared
directory to conduct the following decision tree analysis with SAS Enterprise
Miner:
Note: CPM = Critical path method.
Deliverables:
The above are compiled into a Word file and mailed to
zhangxi.lin@gmail.com.
------ + ------ + ------- + ------- + ------- +
-------
Project 2, Due: 04/04,
Wednesday
Cotton/cottonseed data
query
The Excel files containing cotton/cottonseed dataset
is available in the shared directory. The following are tasks to be fulfilled:
1)
Study
the dataset “Cotton.xls” carefully. The main dataset is in “data” table. You
also need to use “location” table”. Another two tables are optional. You will
notice that the main dataset has missed some value but this does not have major
effect on this project.
2)
Convert
the “Data” table into SAS format. If you get problems using SAS Enterprise
Guide, you may try to use base SAS (main platform) to do the task.
3)
Use
“Data” and “Location” table to implement the following parameterized query (See Section 6.2, SAS EG Course Notes)
a.
“State”
selection
b.
“Location”
(“Loc” in “Data” table) selection
c.
“Variety”
selection (multiple selections are allowed)
d.
Any
other parameter selection
4)
Do
a summary statistics based on the query outcome using a few selected variables.
Draft a few sentences to explain the outcome (See Section 5.1, SAS EG Course
Notes)
5)
Use
“Location” as the row and “Irrigation” as the column to create a tabular
summary report (See Section 5.3, SAS EG Course Notes). Write a few sentences to
explain the data.
6)
Import
the summary statistics outcome to an Excel file
Deliverables:
1)
2-3
screenshots for showing the outcomes of the query
2)
A
summary statistics report and the explanation of the report
3)
A
tabular summary report and the explanation of the report
4)
Compile
the report with a cover page indicating: Project 2, your name, course number,
completion date, etc. and submit the report.
5)
Email
the report file and the EG project file to the instructor
Note: Two groups will be randomly chosen to
demonstrate their projects in the class.
------ + ------
+ ------- + ------- + ------- + -------
Project 1, Due: 03/26,
Monday
Database log file
analysis
1)
Preprocess
the dataset to a computable format
2)
Convert
the dataset into SAS format
3)
Use
at least four different functions of Enterprise Guide in “Describe” pull-down
menu to explore the dataset
4)
Identify
4-6 findings from the dataset that you believe valuable for decision making.
Write a couple of sentences for each finding to explain the statistics data.
5)
The
submissions of the report includes:
a.
The
findings
b.
At
least one one-way frequency table with a short explanation
c.
At
least one two-way frequency table with a short explanation
6)
Staple
the report with a cover page indicating: Project 1, your name, course number,
completion date, etc.
------ + ------ + ------- + ------- + ------- +
-------