STAT 29000
Project 6
due Wednesday, October 22, at 8:30 AM


Put all of your solutions into one text file for your group. The file should be called, for instance:
p06g1.R for project 6, group 1, and should be stored in the folder: /proj/gpproj14/p06g1/
   Group 1 consists of: kidd6,rcrutkow,zhan1460,marti748
p06g2.R for project 6, group 2, and should be stored in the folder: /proj/gpproj14/p06g2/
   Group 2 consists of: ffranci,enorlin,malek,omalleyb
p06g3.R for project 6, group 3, and should be stored in the folder: /proj/gpproj14/p06g3/
   Group 3 consists of: lyoke,gallaghp,cringwa,peter188
p06g4.R for project 6, group 4, and should be stored in the folder: /proj/gpproj14/p06g4/
   Group 4 consists of: boydp,vincentc,avorhies,cdesanti
p06g5.R for project 6, group 5, and should be stored in the folder: /proj/gpproj14/p06g5/
   Group 5 consists of: fu82,philliw,john1209,reno0


The code found in the Week 8 examples should be helpful in this problem set.

1. Compare the 3 variables found in the first SATURN03 data set we studied, namely, the saturn03.240.A.CT_2012_06_PD0.csv data set, from depth 2.4m. Compare them in pairs, to see if any pair of them yields a very good linear model. In all of these cases, be sure to remove any outliers, if necessary.
1a. Make a simple linear regression to try to predict the electrical conductivity from the temperature.
1b. Make a simple linear regression to try to predict the salinity from the temperature.
1c. Make a simple linear regression model to try to predict the electrical conductivity from the salinity.
1d. Which one of these linear models seems most amenable to linear modeling? Why?

2a. Make a simple linear regression model to predict the mpg from the mtcars data, based on the hp. Plot the two variables, along with the line suggested by a simple linear regression model.
2b. Make a multiple regression model to predict the mpg from the mtcars data, based on the hp and the disp.
2c. Using the multiple regression model, what kind of mpg might we guess that a car has, if it has 147 hp and 230 disp?

3a. Load the 1990 airline data from the dataexpo into a data.frame.
3b. Use the subset command to extract only the flights from June 1990.
3c. Build a simple linear regression model that predicts the arrival delays from the departure delays.
3d. Plot both the delays, putting the arrival delays on the y-axis and the departure delays on the x-axis.
3e. Draw the line from the simple linear regression model on the plot.
3f. Repeat steps 3c through 3e, removing the outliers, e.g., removing the flights with departure delays that are more than 500 and removing those that are less than -50. I.e., restrict attention to flights with departure delays between -50 and 500.

4a. Generate 100 (continuous) uniform random numbers, uniformly distributed between 0 and 1.
4b. For each uniform random number U in part a, define V = -log(U)/3. Make this transform for all 100 numbers from 4a.
4c. Generate 100 exponential random numbers with rate 3.
4d. Use a qqplot to convince yourself that the numbers from 4b have the same kind of distribution as the numbers in 4c. I.e., if U is a continuous uniform random variable, then -log(U)/3 is an exponential random variable with rate 3, i.e., with mean 1/3.
4e. Re-do parts 4a through 4d with millions of numbers instead of just 100 numbers, to reinforce this notion in your mind.

5a. Generate 1,000,000 (continuous) uniform random numbers (each between 0 and 1) and store them in a matrix M with 1000 rows and 1000 columns.
5b. Use the apply function to sum each row of M. So we get 1000 numbers, each of which is equal to the sum of the 1000 uniforms. Store the result in a vector v.
5c. Subtract 500 from each entry of v and then (afterwards) divide each number by sqrt(1000/12), i.e., by 9.1287. Store the result in a new vector w.
5d. Use a qqplot to convince yourself that the entries of w are approximately standard normal random numbers, i.e., normal random numbers with mean 0 and standard deviation 1.

6a. Generate 100,000,000 exponential random numbers, each with rate = 5, and store them in a matrix M with 10000 rows and 10000 columns.
6b. Use the apply function to sum each row of M. So we get 10000 numbers, each of which is equal to the sum of the 10000 exponential random numbers. Store the result in a vector v.
6c. Subtract 2000 from each entry of v and then (afterwards) divide each number by sqrt(10000/5^2) = 100/5 = 20. Store the result in a new vector w.
6d. Use a qqplot to convince yourself that the entries of w are approximately standard normal random numbers, i.e., normal random numbers with mean 0 and standard deviation 1.

7a. Use the built-in R data set for "Pharmacokinetics of Theophylline" (stored in Theoph) to build a multiple linear regression model of the concentration, based on the weight, dose, and time.
7b. If a person weighed 66 kg, and received a dose of 4 mg/kg, and it has been 6 hours since the dose was administered, what is the predicted level of concentration?

8. Look at the departure delays from the June 1990 flights. If we restrict attention to departure delays of 30 minutes or more, what kind of distribution do you think the data has? Normal? Uniform? Exponential? Justify your answer with a qqplot. How closely can you estimate the parameter(s) of the distribution you think that this data has?