# STAT 29000
# Project 2 Solutions
### Question 1
1a. The time it takes to read the review.csv file is:
```{r cache=TRUE}
startingtime <- proc.time()
myreviewDF <- read.csv("review.csv")
stoppingtime <- proc.time()
stoppingtime - startingtime
```
1b. The time it takes to read the review.Rda file is:
```{r cache=TRUE}
startingtime <- proc.time()
load("review.Rda")
stoppingtime <- proc.time()
stoppingtime - startingtime
```
1c. The Rda file should be a lot faster to read.
1d. We can check that the number of rows of the two resulting data frames are the same.
```{r}
dim(review_frame)
dim(myreviewDF)
```
The number of columns are not exactly the same, because as you will recall, the review_frame has a data frame embedded within one column. Moreover, the data frame from the csv file has an extra column (called X) at the start, which just includes the row numbers of the reviews. This is not always the case, but it happens to be the case for this particular data frame.
### Question 2
2a. Here we convert the dates into POSIXt data format.
```{r}
mytimes <- strptime(review_frame$date, "%Y-%m-%d")
```
2b. The time of the first review chronologically is
```{r}
min(mytimes)
```
The time of the last review chronologically is:
```{r}
max(mytimes)
```
The length of time during which reviews were collected is:
```{r}
max(mytimes) - min(mytimes)
```
This could also be performed by computing:
```{r}
diff(range(mytimes))
```
In the command above, the range shows the minimum and the maximum values of the vector.
### Question 3
3a. We split the strings and then unlist the results:
```{r}
v <- unlist(strsplit(review_frame$date,"-"))
```
3b. Here are the data from the years:
```{r}
myyears <- v[seq(1,length(v),by=3)]
```
The length of this vector agrees with the number of rows in the data frame, i.e., is the same as the number of reviews in the data set:
```{r}
length(myyears)
dim(review_frame)[1]
```
### Question 4
4a. The number of reviews per year is found by taking the length of each of the parts of the review_frame$stars, according to the analogous value of myyears
```{r}
tapply(review_frame$stars, myyears, length)
```
4b. One possible method, using the mapply, is to do this:
```{r}
mynewcolumn <- mapply(sum, review_frame$votes$funny, review_frame$votes$useful, review_frame$votes$cool)/3
```
An alternative method, we could use this approach:
```{r}
mynewcolumn <- (review_frame$votes$funny + review_frame$votes$useful + review_frame$votes$cool)/3
```
Finally, we can append mynewcolumn to the data frame (which I am calling averagevotes) by writing:
```{r}
review_frame$averagevotes <- mynewcolumn
```
4c. We first load the business_frame
```{r cache=TRUE}
load("business.Rda")
```
We sum the number of review_count's, according to the value (TRUE or FALSE) of whether the business is open or closed.
```{r}
tapply(business_frame$review_count, business_frame$open, sum)
```
### Question 5
5a. We can extract the number of businesses per state by taking the length of the vector of businesses in a state, i.e., by looking at the state vector and breaking it up, according to the state vector's values itself.
```{r}
tapply(business_frame$state, business_frame$state, length)
```
This can also be done, equivalently, in the following way:
```{r}
table(business_frame$state)
```
5b. Now we take a mean of the review_count, splitting up the data according to the states in which the businesses are found.
```{r}
tapply(business_frame$review_count, business_frame$state, mean)
```
5c. We just sum the TRUE's (which become 1's) and the FALSE's (which become 0's) in the karaoke vector, split according to the states, and we ignore all of the NA's too.
```{r}
tapply(business_frame$attributes$Music$karaoke, business_frame$state, sum, na.rm=T)
```
5d. We can use useNA = "always" to see the NA values for the alcohol service:
```{r}
table(business_frame$attributes$Alcohol,useNA="always")
```
### Question 6
6a. There are several possible ways to accomplish this. Here is one such function:
```{r}
givemepie <- function (x) {
pie(table(x,useNA="always"))
}
```
6b. Here is the resulting pie chart.
```{r}
givemepie(business_frame$attributes$Alcohol)
```
### Question 7
7a. We can cut the latitude according to the specified values.
```{r}
tapply(business_frame$latitude, cut(business_frame$latitude, breaks=c(32,40,48,57)), length)/61184
```
Since we are just using the "length" function, it does not actually matter what we put into the first coordinate, as long as it has the right length. Any of these would do, since they all have the same length:
```{r}
length(business_frame$latitude)
length(business_frame$longitude)
length(seq(1,61184))
```
Many other possibilities exist. So we get the same answer as above, for instance, if we substitute a dummy vector into the first coordinate. This only works because we are just taking length of whatever pieces we get.
```{r}
tapply(seq(1,61184), cut(business_frame$latitude, breaks=c(32,40,48,57)), length)/61184
```
For a humorous way of seeing this, we could even write the following, and we would get the same answer:
```{r}
tapply(rep("pizza",times=61184), cut(business_frame$latitude, breaks=c(32,40,48,57)), length)/61184
```
7b. We handle the longitude in a similar way:
```{r}
tapply(business_frame$longitude, cut(business_frame$longitude, breaks=c(-120,-80,-40,0,40)), length)/61184
```
but we could also have used any of the approaches similar to the above, e.g.,
```{r}
tapply(seq(1,61184), cut(business_frame$longitude, breaks=c(-120,-80,-40,0,40)), length)/61184
```
7c. Now we apply the cuts for both a and b simultaneously. You could put this all onto one line, but it might look a little long.
```{r}
tapply(seq(1,61184), list(
cut(business_frame$latitude, breaks=c(32,40,48,57)),
cut(business_frame$longitude, breaks=c(-120,-80,-40,0,40))
), length)/61184
```
If you really want to do it in a slight more readable way,here is one possible method:
```{r}
myfirstcut <- cut(business_frame$latitude, breaks=c(32,40,48,57))
mysecondcut <- cut(business_frame$longitude, breaks=c(-120,-80,-40,0,40))
tapply(seq(1,61184), list(myfirstcut, mysecondcut), length)/61184
```