STAT 29000

Project 2

due September 11, 2015

Put all of your solutions into one text file for your group.
For example, the file for project 2, group 7, should be called: p02g7.Rmd and
it should be stored in the folder: /proj/gpproj15/p02g7/


Please submit your solutions in R Markdown (i.e., with an Rmd file) at the end of the project. It is not necessary to use R Markdown while you are initially solving the problems, unless you want to… but please format the final submission for the project in R Markdown, by the time that you are finished.

NOTE: If you are using R Markdown, when you load the review.Rda file and the business.Rda in question 1b (below), please use {r cache=TRUE} instead of {r} on the line where the Rda files are loaded, so that R only loads each Rda file once. Similarly, when you load the csv files, it is worthwhile to use cache=TRUE again.


Question 1

1a. A “.csv” file format stands for “comma separated value”, and is a very popular format to store data. The file review.csv is extracted from review.Rda. Even though the file contains the same data, it is twice the size!

Import review.csv into the variable called “review” using the read.csv function. Using the function proc.time(), find out how long it takes R to load this csv file. Simply run

startingtime <- proc.time()

before the command runs and then

stoppingtime <- proc.time()

after the read.csv command runs, and then take the difference of the two times, to find out how long it took to load the data.

Notice that read.csv has the parameter head=TRUE by default, which is good, since the csv file has the variable names stored on line 1, as a header, that lets R know the intended names of the variables found on all of the rest of the lines of the file.

1b. Now load the equivalent “review.Rda” file into R using the load function. As above, use proc.time() to time how long this takes.

1c. Which format is faster to read into R? Rda or csv?

1d. Make sure that the data frames from the Rda and csv files of the review.Rda versus review.csv are the same dimensions.

Question 2

2a. Use the strptime function to convert the “date” factor from a factor to a POSIXt data type. This will allow you to add and subtract dates easily.

2b. Find the time (in the format %Y-%m-%d) of the first review (chronologically). Find the time of the last review (chronologically). Now take a difference. This allows us to see the length of the time period in which reviews were collected.

Question 3

3a. Use strsplit (with “-” as the split parameter) to break the strings in the dates of the reviews into their component years, months, and dates. Then use unlist to combine the results into a vector that has all of the years, months, and dates.

3b. From the vector above, extract the years of each review. (Hint: You can use the seq command, with by=3, as an index to your vector; this will allow you to extract every third element of your vector.) Check to make sure that the number of years in the vector that you created is the same as the number of reviews in the data set.

Question 4

4a. Use tapply to find the average number of stars per year. Hint: Use the vector of years that you created in question 3b above.

4b. Similarly to when you identified the total votes in Project 1, create a new column in the review_frame data frame that contains the mean of all three votes. Hint: If you use the mapply function, it is necessary to take a sum first, and then divide by 3. It is not necessary to use the mapply. It is probably just easier to take a mean directly.

4c. Use tapply with the business data set to see how many reviews have been made of open businesses and how many have been made of closed businesses. (Use the review_count column to get the number of reviews of each business.)

Question 5

5a. Use tapply to get the number of businesses per state.

5b. Use tapply to get the average number of reviews within each state.

5c. How many businesses in each state have karaoke?

5d. With regard to alcohol service, how many businesses are listed as having a full_bar? beer_and_wine? none? For how many businesses is it unknown whether alcohol is served? Use the table command to answer all four of these questions at once. The table command has a parameter that allows the NA elements to show up (check the documentation for the table command).

Question 6

6a. Create a function that takes a factor with categorical variables and spits out a labeled pie chart (it is up to you whether or not to include the NA values). Call the function: givemepie. You can use the function “pie” within your function.

6b. Use your function on the alcohol factor.

Question 7

7a. What fraction of businesses have latitude between 32 and 40? 40 and 48? 48 and 57? Use only 1 line of code (Hint: Use the tapply function with specified values of cuts and breaks.)

7b. What fraction of businesses have longitude between -120 and -80? -80 and -40? -40 and 0? 0 and 40? Use only 1 line of code.

7c. In one line of code show what percent of businesses lies within the intersection of each of (a) and (b)’s breaks. You should end up with a 3x4 matrix of percentages.