STAT 29000

Project 5

due October 12, 2016

Put all of your solutions into one text file for your group. For example, the file for project 5, group 7, should be called: p05g7.Rmd and it should be stored in the folder: /proj/gpproj16/p05g7/ Also include the results of the RMarkdown, i.e., include either the html, pdf, or Word output from your RMarkdown file.


Use the following function to extract data from the database of the NSF Center for Coastal Margin Observation & Prediction


myfunction <- function( mystation, mylength, mymonth, myyear ) {
  mystring <- paste("", sprintf("saturn%02d", mystation),"/", sprintf("saturn%02d", mystation), ".", mylength , ".A.CT/", myyear, sprintf("%02d",mymonth), ".nc", sep="")
  mync <- nc_open(mystring)
  tempDF <- lapply(1:mync$nvars, function(j) {ncvar_get(mync, mync$var[[j]])} ))
  names(tempDF) <- sapply(1:mync$nvars, function(j) mync$var[[j]]$name)
  tempDF$time <- ncvar_get(mync, "time")
  tempDF$length <- mylength
  tempDF$year <- myyear
  tempDF$month <- mymonth
  tempDF$days <- as.POSIXlt(tempDF$time, tz="PST8PDT", origin = "1970-01-01")$mday

Question 1

1a. Create a vector corresponding to the 84 months from Nov 2009 through Oct 2016, and create a second vector containing the corresponding years.

1b. Use these vectors in the context of an mapply function, to obtain the 84 months of data about the water temperature, salinity, and electrical conductivity at the SATURN03 station at the depth 2.4m. The result should be a list that contains 84 data.frames.

1c. Use the sapply function to verify that all 84 data.frames have the variable names in the same order.

1d. Use the function to rbind these 84 data.frames into one data.frame called bigDF24. Check that the resulting data.frame has a little more than 7 million observations.

1e. Repeat the steps above, to gather the data about these same 3 variables from depth 8.2m into one data.frame called bigDF82 (which will have a little less than 6 million observations).

Question 2

2a. Restricting attention to the 2.4m data, what is the longest time period for which no data is available, i.e., what is the longest time period in which no data is collected?

2b. On which day does that biggest gap occur?

Question 3

3a. Find the daily mean values for water_temperature at depth 2.4m.

3b. Plot the resulting daily mean values for water_temperature at depth 2.4m.

3c. Re-consider 3a and 3b for water_electrical_conductivity, and then also for water_salinity.

Question 4

4a. Decide what constitutes a false reading, i.e., data that is probably an outlier. What are your criteria for having a false reading?

4b. How many false readings occur at depth 2.4m? Please break your responses down to a month-by-month tally, for each variable.

Question 5

The goal of this question is to scrape the Hot 100 chart from Billboard. This chart is posted every Saturday. The first chart is here:

and the most current chart is here:

Use the system and either the wget or curl command, inside R, to scrape all of these charts (in XML format) into the scratch folder for your team.

Hint: It might be helpful to use the sapply and paste commands, as well as the seq.Date help page. After you have scraped all of the charts in XML format, then zip the results into one file, so that you can use them during a later project.

It is NOT NECESSARY to extract the titles and artists for the songs in the database. Just download the 3000+ webpages from the web (each one in XML), and we’ll come back later to this data, to scrape the titles and artists, and do some analysis. For now, we just want to download the data files.

Question 6

Consider the New York City taxi data located at:

Here is a data dictionary:

Use the system and either the wget or curl command, inside R, to scrape all of the yellow taxi cab data (in CSV format) into the scratch folder for your team. You can scrape these directly using bash if you prefer (in fact, that is probably recommended), but make sure that the code that you use to scrape them is succinct, and if you make bash calls, please use the system command in R to make them.

Question 7

You may want to cut the data in various ways in bash (again using the system command in R), before answering the following questions:

7a. On which day did the most taxi cab rides occur? If a ride goes past midnight, use the start of the ride for the date of the ride.

7b. For each day, determine the distribution of the number of passengers. Your output should allow you to answer questions like the following: On January 1, 2016, how many rides had 1 passenger? 2 passengers? 3 passengers? Etc.?

Question 8

8a. For each day, determine the average distance of a taxi cab ride.

8b. For each day, determine the average number of passengers.

Question 9

9a. On which type of day (Sun, Mon, …, Sat) is the average distance of a ride the longest?

9b. On which type of day (Sun, Mon, …, Sat) is the average number of passengers in a car the largest?

Question 10

Put the resulting answers from this entire project into an RMarkdown file.