STAT 29000

Project 5

due October 9, 2015

Put all of your solutions into one text file for your group.
For example, the file for project 5, group 7, should be called: p05g7.Rmd and
it should be stored in the folder: /proj/gpproj15/p05g7/


Please submit your solutions in R Markdown (i.e., with an Rmd file) at the end of the project. It is not necessary to use R Markdown while you are initially solving the problems, unless you want to… but please format the final submission for the project in R Markdown, by the time that you are finished.

NOTE: In some questions, it will be helpful to use {r cache=TRUE}

SECOND NOTE: Although you will be able to see the ggmaps in Rstudio and when you “knit” a pdf document, you will probably not be able to see them properly when you “knit” an html document. It seems that there is some incompatibility with html.

Question 1

1a. Load the airplane data from 2008. Make a new data frame that contains only the 15th, 16th, and 19th columns, i.e., the ArrDelay, DepDelay, and the Distance, and that only contains every 1000th row of the original data frame, i.e., it contains the 1st row, 1001st row, 2001st row, etc. (You can either index the columns by the numbers 15, 16, 19, or by the names of the columns; it is worthwhile to make sure that you know how to do this both ways.)

1b. Read the help documentation for the “pairs” function (which generates scatterplot matrices) and take a look at the examples at the end of the “pairs” documentation.

1c. Use the pairs function to build a scatterplot of the data frame that you built in 1a.

1d. Which two of the three variables (ArrDelay, DepDelay, and Distance) do you think are most correlated? Why?

Question 2

2a. Using Google and the help utility in RStudio, install the package called “ggmap”.

2b. Using Google and the help utility in RStudio, load ggmap into the R environment.

2c. Create a map containing all of Europe.

2d. Create a map containing the United States (excluding Hawaii and Alaska).

2e. Map the points of each business from the business_frame (in the business.Rda from the Yelp Dataset Challenge) on the USA map.

2f. Map only the Illinois businesses from business_frame on the USA map.

2g. Repeat 2e, but this time make the points for each business be equal to the size of the square root of the number of review counts for that business.

Question 3

3a. Use ggmap to plot out the locations of the airports in the United States.

3b. Add 5 lines to the USA map. Each line corresponds to one of the 5 most popular departure-to-arrival paths in the USA, as studied in Question 4 on Problem Set 3.

Question 4

4a. The england_outcome data contains a lot of cool information about the outcomes of the crimes in the city of London. It shows the outcome of the crime, and the longitude and latitude. Whenever there is longitude and latitude, you should know that you can easily use ggmap to plot. For this question, however, please create a colored bar graph of the counts of the 20 different outcomes of the crimes denoted by the factor “V1”. What is the most common outcome (in non numeric form. i.e. 2 is “court case unable to proceed”).

4b. Do the same thing (using ggplot) to similarly plot the crime_data crime types. What is the most common crime?

4c. Stack the different types of crimes (like in b), and then put them side by side based on “Month”. Make an observation as time goes on.

4d. Do the same for (a) like you did for (b) in (c). Make an observation.

4e. As time goes on, what appears to change more, the outcomes of the crimes or the crimes?

Question 5

5a. Use ggmap to get a map of London. Show the map.

5b. Plot the crimes as points on the map you made in (a). Use zoom = 12.

5c. Add color to (b).

5d. Repeat (c) but limit to “Violent Crimes” and “Violent and Sexual offenses”.

Question 6

6a. Plot a density map of the United States (zoom = 4) of airports.

6b. Plot a density map of the United States with a color gradient where low is green and high is red.

6c. On top of the map in (b), add points to the map that represent the airports. Size those points based on the “total” factor. The “total” factor is simply the frequency of inbound and outbound flights.

Question 7

Generate the first 20 Lucas numbers and store them in a vector. You can either use recursion or an explicit formula. If you are able to do both, which way is faster? How much faster?

Question 8

8a. Create a data frame called random_vars where:

  • the first column contains 10000 Bernoulli random variables, each with p=1/3.
  • the second column contains 10000 Binomial random variables, each with n=5 and p=1/3.
  • the third column contains 10000 Geometric random variables, each with expected value 3.
  • the fourth column contains 10000 Negative Binomial random variables, each of which is a sum of 5 Geometric random variables, and each of those Geometric random variables has expected value 3.
  • the fifth column contains 10000 Poisson random variables, each with expected value 3.
  • the sixth column contains 10000 Hypergeometric random variables, each with parameters N=20, M=5, and n=3 (using the notation from STAT/MA 41600).
  • the seventh column contains 10000 continuous Uniform random variables, each with min 5 and max 10.
  • the eighth column contains 10000 discrete Uniform random variables, each with min 5 and max 10.
  • the ninth column contains 10000 Exponential random variables, each with expected value 3.
  • the tenth column contains 10000 Gamma random variables, each with lambda = 3 and r = 5 (using the notation from STAT/MA 41600).
  • the eleventh column contains 10000 Beta random variables, each with alpha = 3 and beta = 8 (using the notation from STAT/MA 41600).
  • the twelveth vector contains 10000 Normal random variables with mean = 3 and variance = 5

8b. Find the mean and variance of each column. (Do this efficiently, i.e., do not write 12 separate lines of code.)