STAT 29000

Project 1

due September 4, 2015

Put all of your solutions into one text file for your group.
For example, the file for project 1, group 3, should be called: p01g3.R and
it should be stored in the folder: /proj/gpproj15/p01g3/

Question 1

1a. An “.Rda” file format is used to store data for use with R. Use the “load” function to load review.Rda into the R environment. Use the “ls” function to discover the data frame’s name. What is the data frame’s name?

1b. When dealing with an unfamiliar dataset, it is typically best to get a bird’s eye view of the data. Use the summary function to find the names of the columns.
The data frame contains 8 components (the votes component has 3 columns):


Alternatively, you can use the names function to learn this information:


You can see that votes has some embedded columns this way:


You can check the type of each vector as follows:


Also you can see that review_frame is a data.frame


and that the votes are an embedded data.frame with the review_frame:


1c. How long are each of the columns?

1d. What is the average number of stars given to a review?

1e. What is the user id of the individual who wrote the review with the most “useful” votes?

1f. Assuming the funniest reviews got the most funny votes, print the text of the funniest review.

1g. What is the distribution of the number of stars?

Question 2

2a. Create a new factor called totalvotes, which sums the numbers of funny, useful, and cool votes.

2b. How many of the reviews received at least 160 votes?

2c. Print the user_id’s of the people who wrote the ten reviews that were voted on the most.

Question 3

3a. Now use the “load” function to load business.Rda into the R environment. Once again, use the “ls” function to discover the data frames’s name. What is the data frame’s name?

3b. Use the names command on the data frame to find out what variables are stored in the data frame.

3c. How many unique states are a part of this data set? Hint: The factor “state” is useful here.

3d. The state closest to Purdue that is also in YELP’s dataset is Illinois. How many businesses in Illinois are in the dataset?

3e. How many Illinois businesses have strictly more than 50 reviews?

Question 4

4a. How many businesses are listed in Illinois?

4b. How many businesses are listed in Arizona?

4c. The review dataset and the business dataset have a single factor in common–business_id. The business dataset has the state in which the business resides, and the review dataset doesn’t. Let’s say that a state is more popular if it has the most votes per business (regardless of whether the votes are high or low). Which state’s businesses are more popular by this measure (i.e., by most votes per business), Arizona or Illinois? (You will need to use both data sets for this.)

Sketch of one kind of method for solution: Essentially we want to first identify which business_id’s are in Illinois, and then use the %in% command to identify which of the businesses in the review_frame correspond to Illinois, and then tally their number of reviews, and finally divide by the answer in 4a. Then we want to repeat this process for Arizona.

Question 5

5a. What does the function tolower do?

5b. How many of the review texts contain the word happy? (case-insensitive) Hint: it will be helpful to read about the grepl command.

5c. How many of the review texts contain the word good? (case-insensitive)

5d. How many of the review texts contain both of these two words? (case-insensitive) Hint: You can use one ampersand for the logical “and”.