STAT 29000
Project 5
due Wednesday, October 8, at 9:30 AM


Put all of your solutions into one text file for your group. The file should be called, for instance:
p05g1.R for project 5, group 1, and should be stored in the folder: /proj/gpproj14/p05g1/
   Group 1 consists of: kidd6,philliw,malek,peter188
p05g2.R for project 5, group 2, and should be stored in the folder: /proj/gpproj14/p05g2/
   Group 2 consists of: ffranci,rcrutkow,cringwa,cdesanti
p05g3.R for project 5, group 3, and should be stored in the folder: /proj/gpproj14/p05g3/
   Group 3 consists of: lyoke,enorlin,avorhies,reno0
p05g4.R for project 5, group 4, and should be stored in the folder: /proj/gpproj14/p05g4/
   Group 4 consists of: boydp,gallaghp,john1209,marti748
p05g5.R for project 5, group 5, and should be stored in the folder: /proj/gpproj14/p05g5/
   Group 5 consists of: fu82,vincentc,zhan1460,omalleyb


The code found in the Week 6 examples should be helpful in this problem set.

1. Practice using the sapply function:

1a. Find, with only one line (altogether) of sapply code, the 5 lengths of the following 5 vectors:
  1. the LakeHuron vector,
  2. the waiting vector in the geyser data (remember to load the MASS library first)
  3. the duration vector in the geyser data
  4. the chickwts$weight vector
  5. the mtcars$mpg vector
1b. Now find the average value stored in each of the 5 vectors, using sapply.

1c. Check that R did the right thing in 1b by manually taking the mean of each vector, using 5 separate lines of code.

1d: If you accidentally use "c" instead of "list" in 1b, R just takes an average of individual values, but the average of 1 value is just the value itself, so R returns the full list of values. Please give this (incorrect) behavior a try, just to see how it misbehaves!

1e. Now find the variance of the values stored in each of the 5 vectors, using sapply.

1f. Check that R did the right thing in 1e by manually finding the variance of each vector, using 5 separate lines of code.

1g. If you accidentally use "c" instead of "list" in 1e, R just takes the variance of each individual value, but R gives an NA when taking the variance of an individual value (you can try this, e.g., var(3.79) gives NA, so R returns NA for each value). Please give this (incorrect) behavior a try!

1h. Examine the head of the Cars93 data. This data set has a lot of types of columns. Use sapply to find out the kinds of classes for each of the 27 columns in this data.frame (using just one call to sapply; hint: use "class" for the function).

2a. Use the mapply function, with the paste function, and the vectors
c("a","b","c","d","e")
and
c("A","B","C","D","E")
and the parameters
sep="" and USE.NAMES=FALSE
to print these five sentences:

[1] "The uppercase version of a is A" "The uppercase version of b is B" "The uppercase version of c is C"
[4] "The uppercase version of d is D" "The uppercase version of e is E"

2b. Use the row.names function, and the column of population data,
both with the state.x77 data set, as well as the mapply function,
to print a vector of 50 sentences. (It might be helpful to use the parameters USE.NAMES=F and sep="".) The vector should start with the following six sentences:
[1] "Alabama has 3615 thousand people." "Alaska has 365 thousand people."
[3] "Arizona has 2212 thousand people." "Arkansas has 2110 thousand people."
[5] "California has 21198 thousand people." "Colorado has 2541 thousand people."

2c. Revise your answer to 2a, by actually multiplying the population data by 1000, so that the The vector should start with the following six sentences:
[1] "Alabama has 3615000 people." "Alaska has 365000 people."
[3] "Arizona has 2212000 people." "Arkansas has 2110000 people."
[5] "California has 21198000 people." "Colorado has 2541000 people."

3a. Make a data.frame containing all of the cars from mtcars with hp>100 and 8 cylinders. Create a new data.frame with these cars, displaying only the mpg, cyl, hp, and qsec.

3b. Make a data.frame containing all of the rows describing provences from the swiss data set with 50% or more Catholics and 50% or more of males involved in agriculture. Within this specific data.frame, find the mean and standard deviation of the Fertility data.

3c. Make a data.frame containing all of the rows in the chickwts data for which the feed is either horsebean or soybean. What is the average weight (altogether) across these two kinds of feed?

4. Step through Dr. Ward's R code for apply examples with SATURN data. It takes a little time to understand completely what is happening, but essentially we are able to read data from dozens of files with ease (i.e., without having to download them individually, by hand), and to extract and assemble the data in them. Note that the time parameter in these files is the same data we had in the earlier project, but is stored differently (and, hence, extracted differently) than in the earlier project.

4a. Use the R code for apply examples with SATURN data to extract the temperature, electrical conductivity, salinity, and time data from the SATURN03 station at depth 2.4m.

4b. Extract the temperature, electrical conductivity, salinity, and time data from the SATURN03 station at depth 8.2m, from: http://amb6400b.stccmop.org:8080/thredds/dodsC/preliminary_data/saturn03/saturn03.820.A.CT/ (Beware: The starting month is not the same for this data set, compared to the previous one.)

4c. Extract the temperature, electrical conductivity, salinity, and time data from the SATURN03 station at depth 13.0m, from: http://amb6400b.stccmop.org:8080/thredds/dodsC/preliminary_data/saturn03/saturn03.1300.R.CT/ (Beware: Again, the starting month is not the same for this data set, compared to the previous two.)

5. Extract the Phycoerythrin and time data from the SATURN03 station at depths 2.4m, 8.2m, and 13.0m from:
6. Extract the Oxygen Concentration (oxygen), Oxygen Saturation (oxygensat), and time data from the SATURN03 station at depths 2.4m, 8.2m, and 13.0m from:
7a. For each of the 3 types of data listed above (in questions 4, 5, 6) at each of the 3 depths, find the number of data points per month. For instance, starting with the temperature/conductivity/salinity data at depth 2.4m, find the number of data points per month. Then do the same for 8.2m and for 13.0m. Then do this again for the Phycoerythrin data. Then do it again for the Oxygen data. To express your answers, use the mapply function to print sentences that say statements like:

"Month 06 of year 2012 of the saturn03.240.A.CT data contains 202,702 data points at depth 2.4m."

7b. Re-calculate your answer to 7a, so that it is normalized according to the number of days in the month. In other words, get the number of data points divided by the number of days in the month. To express your answers, use the mapply function to print sentences that say statements like:

"Month 06 of year 2012 contains an average of 6756.733 data points per day, during the 30 day period, for a total of 202,702 data points during the month, at depth 2.4m."

8a. Extract the temperature data from the SATURN03 station for June 2012 at depth 2.4m. There are 202702 data points. Save these in a variable called tempdata. Also get the analogous 202702 time data points. Save these in a variable called temptimes.

8b. Extract the oxygen saturation data (the 2nd parameter in the data set "saturn03.240.A.Oxygen") from the SATURN03 station for June 2012. There are 15725 data points. Save these in a variable called oxydata. Also get the analogous 15725 time data points. Save these in a variable called oxytimes.

8c. Notice that we would be hard-pressed to compare the temperature and oxygen saturation data, because there are vastly different amounts of data in the two vectors and (perhaps more importantly) they were measured at different points in time. We can, however, build a function that predicts the behavior of the temperature data at ALL points in time, and then use it to figure out how the temperature data would have behaved, if it was measured at the same 15725 time points as the oxygen saturation data, and then we could compare the temperature and oxygen saturation data. This can be done as follows:
tempfunction <- approxfun(temptimes, tempdata)
#This makes f into a function that can predict the temperature behavior at any time we like. Then we run the function on the oxygen saturation times, to see how temperature would have behaved at the 15725 times when oxygen saturation was measured:
tempatoxygentimes <- tempfunction(oxytimes)
#Finally, we can plot the temperature versus the oxygen saturation data this way:
plot(tempatoxygentimes,oxydata)

9a. Make comparisons between some of the other variables, in the style of how we did things in question 8.

9b. Which pair of variables (in the June 2012 data sets) seem to be the most strongly correlated? Why do you think so?

9c. What could go wrong with the method discussed in 8 and 9? Hint: for instance, in part 8c, take a look at:
range(temptimes)
plot( tempfunction(seq(1338537601,1341129599,by=100) ) )
How could we potentially fix the problem that happens when missing data occurs? [You do not have to actually fix it; but briefly mention some way that you might fix it.] Can you see this problem in the plot? [We will discuss this problem more, in a set of future questions, in another project.]

10a. There are 7 data sets inside the directory: /data/public/NARR/pressure. How many variables do they each contain?

10b. How many pieces of data does the lat variable contain in each file? How about the lon variable? How about the Lambert Conformal variable? Are all of the lat variables identical across all 7 files? If so, how do you know? If not, how are they different? What about the lon variable? What about the Lambert Conformal variable?

10c. What are the sizes (i.e., dimensions) of the 4th variable in each of the 7 files? What percent of the 4th variable is missing in each of the 7 files?

10d. If you store the time vector from a file in a vector t, then the code: format(as.POSIXct(3600*t, origin="1800-01-01"), tz="UTC+0:00") will convert the time into a human-readable format. The 3600 converts the hours into seconds, and the seconds are given in units after January 1, 1800. (Dr. Ward fiddled around with this for awhile to figure this out.) Question: Do all 7 files have the same time vector?

10e. What is the time unit between consecutive times in each of these vectors?