STAT 29000

Project 2 Solutions

Question 1

1a. The time it takes to read the review.csv file is:

startingtime <- proc.time()
myreviewDF <- read.csv("review.csv")
stoppingtime <- proc.time()
stoppingtime - startingtime
##    user  system elapsed 
##  99.676   1.508 101.359

1b. The time it takes to read the review.Rda file is:

startingtime <- proc.time()
load("review.Rda")
stoppingtime <- proc.time()
stoppingtime - startingtime
##    user  system elapsed 
##  16.462   0.215  17.278

1c. The Rda file should be a lot faster to read.

1d. We can check that the number of rows of the two resulting data frames are the same.

dim(review_frame)
## [1] 1569264       8
dim(myreviewDF)
## [1] 1569264      11

The number of columns are not exactly the same, because as you will recall, the review_frame has a data frame embedded within one column. Moreover, the data frame from the csv file has an extra column (called X) at the start, which just includes the row numbers of the reviews. This is not always the case, but it happens to be the case for this particular data frame.

Question 2

2a. Here we convert the dates into POSIXt data format.

mytimes <- strptime(review_frame$date, "%Y-%m-%d")

2b. The time of the first review chronologically is

min(mytimes)
## [1] "2004-10-12 EDT"

The time of the last review chronologically is:

max(mytimes)
## [1] "2015-01-08 EST"

The length of time during which reviews were collected is:

max(mytimes) - min(mytimes)
## Time difference of 3740.042 days

This could also be performed by computing:

diff(range(mytimes))
## Time difference of 3740.042 days

In the command above, the range shows the minimum and the maximum values of the vector.

Question 3

3a. We split the strings and then unlist the results:

v <- unlist(strsplit(review_frame$date,"-"))

3b. Here are the data from the years:

myyears <- v[seq(1,length(v),by=3)]

The length of this vector agrees with the number of rows in the data frame, i.e., is the same as the number of reviews in the data set:

length(myyears)
## [1] 1569264
dim(review_frame)[1]
## [1] 1569264

Question 4

4a. The number of reviews per year is found by taking the length of each of the parts of the review_frame$stars, according to the analogous value of myyears

tapply(review_frame$stars, myyears, length)
##   2004   2005   2006   2007   2008   2009   2010   2011   2012   2013 
##     13    680   4239  17724  45117  72948 137764 209429 244106 336273 
##   2014   2015 
## 486306  14665

4b. One possible method, using the mapply, is to do this:

mynewcolumn <- mapply(sum, review_frame$votes$funny, review_frame$votes$useful, review_frame$votes$cool)/3

An alternative method, we could use this approach:

mynewcolumn <- (review_frame$votes$funny + review_frame$votes$useful + review_frame$votes$cool)/3

Finally, we can append mynewcolumn to the data frame (which I am calling averagevotes) by writing:

review_frame$averagevotes <- mynewcolumn

4c. We first load the business_frame

load("business.Rda")

We sum the number of review_count’s, according to the value (TRUE or FALSE) of whether the business is open or closed.

tapply(business_frame$review_count, business_frame$open, sum)
##   FALSE    TRUE 
##  159561 1570264

Question 5

5a. We can extract the number of businesses per state by taking the length of the vector of businesses in a state, i.e., by looking at the state vector and breaking it up, according to the state vector’s values itself.

tapply(business_frame$state, business_frame$state, length)
##    AZ    BW    CA   EDH   ELN   FIF   HAM    IL   KHL    MA   MLN    MN 
## 25230   934     3  2971    10     4     1   627     1     1   123     1 
##    NC   NTH    NV    NW    ON    OR    PA    QC    RP    SC   SCB    WA 
##  4963     1 16485     1   351     1  3041  3921    13   189     3     1 
##    WI   XGL 
##  2307     1

This can also be done, equivalently, in the following way:

table(business_frame$state)
## 
##    AZ    BW    CA   EDH   ELN   FIF   HAM    IL   KHL    MA   MLN    MN 
## 25230   934     3  2971    10     4     1   627     1     1   123     1 
##    NC   NTH    NV    NW    ON    OR    PA    QC    RP    SC   SCB    WA 
##  4963     1 16485     1   351     1  3041  3921    13   189     3     1 
##    WI   XGL 
##  2307     1

5b. Now we take a mean of the review_count, splitting up the data according to the states in which the businesses are found.

tapply(business_frame$review_count, business_frame$state, mean)
##        AZ        BW        CA       EDH       ELN       FIF       HAM 
## 25.238962  9.368308 14.333333 11.560417  5.100000  9.000000  3.000000 
##        IL       KHL        MA       MLN        MN        NC       NTH 
## 20.987241  8.000000  4.000000  7.593496  4.000000 20.651823 17.000000 
##        NV        NW        ON        OR        PA        QC        RP 
## 45.672066  5.000000  8.826211  4.000000 23.810917 13.917113  5.769231 
##        SC       SCB        WA        WI       XGL 
## 12.798942  5.666667  9.000000 20.669267  3.000000

5c. We just sum the TRUE’s (which become 1’s) and the FALSE’s (which become 0’s) in the karaoke vector, split according to the states, and we ignore all of the NA’s too.

tapply(business_frame$attributes$Music$karaoke, business_frame$state, sum, na.rm=T)
##  AZ  BW  CA EDH ELN FIF HAM  IL KHL  MA MLN  MN  NC NTH  NV  NW  ON  OR 
##  35   0   0   5   0   0   0   1   0   0   0   0   6   0  30   0   1   0 
##  PA  QC  RP  SC SCB  WA  WI XGL 
##  10   8   0   0   0   0   2   0

5d. We can use useNA = “always” to see the NA values for the alcohol service:

table(business_frame$attributes$Alcohol,useNA="always")
## 
## beer_and_wine      full_bar          none          <NA> 
##          2983          9069          8405         40727

Question 6

6a. There are several possible ways to accomplish this. Here is one such function:

givemepie <- function (x) {
  pie(table(x,useNA="always"))
}

6b. Here is the resulting pie chart.

givemepie(business_frame$attributes$Alcohol)

Question 7

7a. We can cut the latitude according to the specified values.

tapply(business_frame$latitude, cut(business_frame$latitude, breaks=c(32,40,48,57)), length)/61184
##    (32,40]    (40,48]    (48,57] 
## 0.76608264 0.16751111 0.06640625

Since we are just using the “length” function, it does not actually matter what we put into the first coordinate, as long as it has the right length. Any of these would do, since they all have the same length:

length(business_frame$latitude)
## [1] 61184
length(business_frame$longitude)
## [1] 61184
length(seq(1,61184))
## [1] 61184

Many other possibilities exist. So we get the same answer as above, for instance, if we substitute a dummy vector into the first coordinate. This only works because we are just taking length of whatever pieces we get.

tapply(seq(1,61184), cut(business_frame$latitude, breaks=c(32,40,48,57)), length)/61184
##    (32,40]    (40,48]    (48,57] 
## 0.76608264 0.16751111 0.06640625

For a humorous way of seeing this, we could even write the following, and we would get the same answer:

tapply(rep("pizza",times=61184), cut(business_frame$latitude, breaks=c(32,40,48,57)), length)/61184
##    (32,40]    (40,48]    (48,57] 
## 0.76608264 0.16751111 0.06640625

7b. We handle the longitude in a similar way:

tapply(business_frame$longitude, cut(business_frame$longitude, breaks=c(-120,-80,-40,0,40)), length)/61184
## (-120,-80]  (-80,-40]    (-40,0]     (0,40] 
## 0.83472803 0.09886572 0.05091200 0.01549425

but we could also have used any of the approaches similar to the above, e.g.,

tapply(seq(1,61184), cut(business_frame$longitude, breaks=c(-120,-80,-40,0,40)), length)/61184
## (-120,-80]  (-80,-40]    (-40,0]     (0,40] 
## 0.83472803 0.09886572 0.05091200 0.01549425

7c. Now we apply the cuts for both a and b simultaneously. You could put this all onto one line, but it might look a little long.

tapply(seq(1,61184), list( 
    cut(business_frame$latitude, breaks=c(32,40,48,57)),
    cut(business_frame$longitude, breaks=c(-120,-80,-40,0,40))
), length)/61184
##         (-120,-80]  (-80,-40]  (-40,0]     (0,40]
## (32,40]  0.7660826         NA       NA         NA
## (40,48]  0.0686454 0.09886572       NA         NA
## (48,57]         NA         NA 0.050912 0.01549425

If you really want to do it in a slight more readable way,here is one possible method:

myfirstcut <- cut(business_frame$latitude, breaks=c(32,40,48,57))
mysecondcut <- cut(business_frame$longitude, breaks=c(-120,-80,-40,0,40))
tapply(seq(1,61184), list(myfirstcut, mysecondcut), length)/61184
##         (-120,-80]  (-80,-40]  (-40,0]     (0,40]
## (32,40]  0.7660826         NA       NA         NA
## (40,48]  0.0686454 0.09886572       NA         NA
## (48,57]         NA         NA 0.050912 0.01549425