STAT 29000

Project 1 Solutions

Question 1

1a. First we load the review.Rda file

load("review.Rda")

Then we use the “ls” function to see that “review_frame” is the name of the data frame:

ls()
## [1] "review_frame"

1b. The names of the columns are:

summary(review_frame)
##      votes.funny         votes.useful          votes.cool     
##  Min.   :  0.00000    Min.   :  0.00000    Min.   :  0.00000  
##  1st Qu.:  0.00000    1st Qu.:  0.00000    1st Qu.:  0.00000  
##  Median :  0.00000    Median :  0.00000    Median :  0.00000  
##  Mean   :  0.47889    Mean   :  1.07162    Mean   :  0.59418  
##  3rd Qu.:  0.00000    3rd Qu.:  1.00000    3rd Qu.:  1.00000  
##  Max.   :141.00000    Max.   :166.00000    Max.   :137.00000  
##    user_id           review_id             stars           date          
##  Length:1569264     Length:1569264     Min.   :1.000   Length:1569264    
##  Class :character   Class :character   1st Qu.:3.000   Class :character  
##  Mode  :character   Mode  :character   Median :4.000   Mode  :character  
##                                        Mean   :3.743                     
##                                        3rd Qu.:5.000                     
##                                        Max.   :5.000                     
##      text               type           business_id       
##  Length:1569264     Length:1569264     Length:1569264    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 

Alternatively, you can use the names function to learn this information:

names(review_frame)
## [1] "votes"       "user_id"     "review_id"   "stars"       "date"       
## [6] "text"        "type"        "business_id"

You can see that votes has some embedded columns this way:

names(review_frame$votes)
## [1] "funny"  "useful" "cool"

You can check the type of each vector as follows:

class(review_frame$votes$funny)
## [1] "integer"
class(review_frame$votes$useful)
## [1] "integer"
class(review_frame$votes$cool)
## [1] "integer"
class(review_frame$user_id)
## [1] "character"
class(review_frame$review_id)
## [1] "character"
class(review_frame$stars)
## [1] "integer"
class(review_frame$date)
## [1] "character"
class(review_frame$text)
## [1] "character"
class(review_frame$type)
## [1] "character"
class(review_frame$business_id)
## [1] "character"

Also you can see that review_frame is a data frame

class(review_frame)
## [1] "data.frame"

and that the votes are an embedded data frame with the review_frame:

class(review_frame$votes)
## [1] "data.frame"

1c. The dimension of the data frame is:

dim(review_frame)
## [1] 1569264       8

so the number of rows is:

dim(review_frame)[1]
## [1] 1569264

1d. The average number of stars given to a review is

mean(review_frame$stars)
## [1] 3.742656

1e. The row of the data frame that has the review with the most “useful” votes is

which.max(review_frame$votes$useful)
## [1] 1179107

So the user id of the individual who wrote that review is:

review_frame$user_id[which.max(review_frame$votes$useful)]
## [1] "WJSNywtir04BgDDpZVZMpg"

1f. The row of the data frame that has the review with the most “funny” votes is

which.max(review_frame$votes$funny)
## [1] 1179107

So the user id of the individual who wrote that review is:

review_frame$text[which.max(review_frame$votes$funny)]
## [1] "I'm the first real person to review this place, let all other fake spammers be gone!  Yelp should really work that shizzle out.  Zack S...um this place closes at 11 so you couldn't have possibly hit the bar...so yeah if your gonna post fake reviews at least check your facts..\n\nThis place has so much potential, yet the ridiculously bad service just overshadowed everything good they did.\n\nThis is probably the worst service I have ever received in my life.  \n\nDo you guys remember in Pretty Woman when Julia Roberts goes into the first store on Rodeo and the snooty lady acts like she's too good for this place?  Well that's what they did to us.\n\nWhere's the hostess?  Oh she's chatting with her friend...*ignoring me*\n\nWe finally get seated and then we sit and wait...\n\nand wait...\n\nand wait...\n\nfinally I get up and ask one of the waiters at the cash register to send over someone...\n\nShe treats us like we're a nuisance to her, she's singing along to the song that's playing...(WTF?)  I think they turned up the music louder as more people left because by the time it was near closing time it was blaring hip hop music...really weird considering it was a nice casual Italian restaurant.\n\nWe order almost 200 bucks worth of food, and then she's like \"is that it?\"\n\nexcuse me bitch...watch the attitude...\n\nWait, wait, watch as all the waiters get together to talk about us...seriously I can see you whispering about us....this is so unprofessional right now....\n\nWe seriously waited for like an hour and a half, it was ridiculous!  The place was practically empty!  The waiters were standing around and chatting with each other.  I took pictures as evidence!\n\nThis place is supposed to be a nice place, for the prices they charge they should have a whole restaurant re-staffed because it was ridiculous!  I have never felt more uncomfortable and treated so rudely in my life!\n\nI also wanted to order take out for later and apparently they don't have take out boxes?  You're a restaurant?  You don't have boxes?  Seriously?\n\nThis is the worst experience in a restaurant I have ever had.  I've gotten better service at Carls Jr in a shady neighborhood than this.  \n\nI won't hesitate to tell everyone to avoid this restaurant when you're in Vegas because this was outrageous!  No one treats me like that and gets away with it!  I wish I had a You've been yelped card so I could give them a piece of my mind.\n\nI'm like those crazy housewives who have nothing better to do...I won't stop calling to speak to management until someone is fired.  I'm serious, no one gets away with treating me like that without suffering repercussions.  We don't play that, I'm sorry I'm not one of those quiet asian people who take your shit...I will not be silenced... You best believe.  You're done..The end.\n\n\n*note I left for the manager*\n\n\"This is probably the most unprofessional and ridiculous restaurant I have ever had the misfortune to experience, not just from the our waitress, but the waitstaff and blatant ignorance from most of the employees I had to encounter.\"\n                                                                   Anthony Nguyen"

1g. The distribution of the number of stars is:

table(review_frame$stars)
## 
##      1      2      3      4      5 
## 159811 140608 222719 466599 579527

Question 2

2a. The sum of the numbers of funny, useful, and cool votes is:

totalvotes <- review_frame$votes$funny + review_frame$votes$useful + review_frame$votes$cool

(OK, I stored it into a vector, not a factor.)

2b. The number of reviews that received at least 160 votes is:

sum(totalvotes >= 160)
## [1] 28

2c. The user_id’s of the people who wrote the ten reviews that were voted on the most are:

topreviewcounts <- sort(totalvotes, decreasing=T)[1:10]
review_frame$user_id[totalvotes >= min(topreviewcounts)]
##  [1] "8j5rre5uA2TxjX8Fk9Je3Q" "zfb_dSwWV5mV4f_ZAgkYbg"
##  [3] "fr3HXiNw5JiIIspADCS5gA" "zfb_dSwWV5mV4f_ZAgkYbg"
##  [5] "C8ZTiwa7qWoPSMIivTeSfw" "gFTglOy-Skssv7TiuW-D8g"
##  [7] "WJSNywtir04BgDDpZVZMpg" "YpvGOfegYJ2w8CNITiIv1A"
##  [9] "ptFwVDjiEKug1qGmYyZ_yw" "WmAyExqSWoiYZ5XEqpk_Uw"

Question 3

3a. First we load the review.Rda file

load("business.Rda")

Then we use the “ls” function to see that “business_frame” is the name of the data frame:

ls()
## [1] "business_frame"  "review_frame"    "topreviewcounts" "totalvotes"

3b. The variables stored in the data frame are:

names(business_frame)
##  [1] "business_id"   "full_address"  "hours"         "open"         
##  [5] "categories"    "city"          "review_count"  "name"         
##  [9] "neighborhoods" "longitude"     "state"         "stars"        
## [13] "latitude"      "attributes"    "type"

3c. The unique states that are a part of this data set are:

table(business_frame$state)
## 
##    AZ    BW    CA   EDH   ELN   FIF   HAM    IL   KHL    MA   MLN    MN 
## 25230   934     3  2971    10     4     1   627     1     1   123     1 
##    NC   NTH    NV    NW    ON    OR    PA    QC    RP    SC   SCB    WA 
##  4963     1 16485     1   351     1  3041  3921    13   189     3     1 
##    WI   XGL 
##  2307     1

So the number of such states is:

length(table(business_frame$state))
## [1] 26

3d. The number of businesses in Illinois that are in the dataset is:

sum(business_frame$state == "IL")
## [1] 627

3e. The number of Illinois businesses that have strictly more than 50 reviews is:

sum(business_frame$review_count[business_frame$state == "IL"] > 50)
## [1] 64

Question 4

4a. The number of businesses that are listed in Illinois is:

ILcount <- sum(business_frame$state == "IL")
ILcount
## [1] 627

4b. The number of businesses that are listed in Arizona is:

AZcount <- sum(business_frame$state == "AZ")
AZcount
## [1] 25230

4c. The business ID’s that are for companies in IL are:

ILbusinesses <- business_frame$business_id[business_frame$state == "IL"]

So the total number of votes for IL businesses is:

sum(totalvotes[review_frame$business_id %in% ILbusinesses])
## [1] 16970

So the number of votes per business in IL is:

sum(totalvotes[review_frame$business_id %in% ILbusinesses])/ILcount
## [1] 27.06539

The business ID’s that are for companies in AZ are:

AZbusinesses <- business_frame$business_id[business_frame$state == "AZ"]

So the total number of votes for AZ businesses is:

sum(totalvotes[review_frame$business_id %in% AZbusinesses])
## [1] 1333428

So the number of votes per business in AZ is:

sum(totalvotes[review_frame$business_id %in% AZbusinesses])/AZcount
## [1] 52.85089

So by this measure, the Arizona businesses are more popular.

Question 5

5a. The function tolower changes a string into its lower-case representation.

5b. The number of review texts that contain the word happy is:

happytexts <- grepl("happy", tolower(review_frame$text))
sum(happytexts)
## [1] 108090

5c. The number of review texts that contain the word good is:

goodtexts <- grepl("good", tolower(review_frame$text))
sum(goodtexts)
## [1] 613548

5d. The number of review texts that contain both the word happy and the word good is:

sum(happytexts & goodtexts)
## [1] 51428