STAT 29000

Project 3

due September 21, 2015

Put all of your solutions into one text file for your group.
For example, the file for project 3, group 5, should be called: p03g5.Rmd and
it should be stored in the folder: /proj/gpproj15/p03g5/


Please submit your solutions in R Markdown (i.e., with an Rmd file) at the end of the project. It is not necessary to use R Markdown while you are initially solving the problems, unless you want to… but please format the final submission for the project in R Markdown, by the time that you are finished.

NOTE: As in the previous project, it will be helpful to use {r cache=TRUE} instead of {r} on any line where any csv files are loaded, so that R only loads each csv file once. Moreover, it is necessary to turn off the lazy loading, to prevent your workspace from becoming too large. In other words, in this project, it is helpful to use {r cache=TRUE, cache.lazy=FALSE}

This project is all about the “Airline on-time performance”, from the American Statistical Association’s 2009 Data Expo. There is also some supplemental data provided by the ASA as well.

You can see the data on the ASA site. In particular, there is a listing of all of the parameters, which might be helpful for you to print.

I already downloaded it for you, to make things a little easier for you. Since the data itself is so large, I saved it into a common data directory: /data/public/dataexpo2009/

Notes: If you want to read ALL of the data into R at once, you can do it, but it takes quite awhile (it might take more than 15 minutes to initially load the data).

You can import just a year or two of the data at a time, to start working with the data. You are not expected to import all of the data while you are solving the questions. You can wait until you have solved the questions, and then come back and try to get the answers with all of the data. So, for instance, you might want to start with just a few specific years only: bigDF <- rbind( read.csv(“/data/public/dataexpo2009/2006.csv”), read.csv(“/data/public/dataexpo2009/2007.csv”), read.csv(“/data/public/dataexpo2009/2008.csv”) ) and once you are sure that everything works, before you get ready to submit your data, you can load all of the years, by typing:

bigDF <- rbind( read.csv(“/data/public/dataexpo2009/allyears.csv”) )

There are over 3.5 billion pieces of data in the files altogether, if you load all of the years from 1987 through 2008.

Question 1

1a. What percentage of data is missing (NA) from DepTime? How about from ArrTime?

1b. Focus on DepTime, CRSDepTime, ArrTime, and CRSArrTime. These are times in the hhmm format. Use the strptime function to convert the time “1359” to POSIXlt using strptime. What is the resulting output?

1c. Now use the strptime function to convert the time “1360” to POSIXlt using strptime. What happens? Why?

1d. Consider times that cannot exist (as in 1c) as erroneous data (it makes no sense!). Are there any erroneous times in DepTime, CRSDepTime, ArrTime, and CRSArrTime? If so, how many such times, in each category?

Question 2

2a. Everyone hates late departure times. Of the late departures (DepDelay), what percentage of flights depart 0-5 minutes late? 5-10 minutes late? 10-15? 15-20? 20-25? Etc.?

2b. Make boxplots that show, for each of the 7 days of the week, the degree to which departure times are delayed.

[If you want to only plot a random selection of the points, that is OK too. The reason is that it will probably take your R session forever to render the plot with all of the millions of dots for the millions of flights. If you choose to only plot a random selection of plots, please do not just plot the points at the start of the vector, since that would just correspond to the 1987 data. Instead, for instance, take every 1000th point. I.e., if the points that you wanted to plot are stored in vector v, then instead of plotting all of v, you could plot v[seq(1,length(v),by=1000)]. This will save you a lot of time when you render your plot in R, and it will still give you a very good picture of what is going on, i.e., it will still give you a good understanding of the behavior of your data. In this case, you would need to be sure to take every 1000th point of your data, and also every 1000th day too, so that your data and the days of the week are in agreement.]

Question 3

3a. Give a chart with 12 columns (corresponding to the months) and 22 rows (corresponding to the years), which computes how many flights have DepDelay > 0 in each of the months and years.

3b. Restrict attention to only the flights with delays. You can find whether a flight is delayed by checking whether the DepDelay is positive. What are the 5 carriers who are most responsible for these delays?

Question 4

4a. The airports.csv file contains data on each of the airports. Load airports.csv into a data frame called airports.

4b. Add a factor to the airports data frame called “freq”, which gives the total number of flights both into and out of the respective airport.

4c. Identify the 5 most popular departure-to-arrival paths in the USA.

4d. Find the very most popular departure-to-arrival path in each year.

Question 5

5a. The file plane-data.csv contains data on the planes. Load plane-data.csv into a data frame called planes.

5b. Rank the 10 manufacturers, according to the total number of miles flown. It will be necessary to use the TailNum information from the plane-data file (which has tailnum and manufacturer) and from the large dataexpo data (which has TailNum and Distance).

5c. Consider all of the planes that flew over 10000 miles in 2008. How many such planes are there? How old is the oldest such plane?

5d. There are 5 airplane types in the plane-data (“Co-Owner”, “Corporation”, “Foreign Corporation”, “Individual”, “Partnership”, and also one unknown “”). Show the total breakdown of miles, according to these types of plane.

Question 6

6a. Use the airports.csv file to determine how many airports are listed for each state.

6b. Using the iata codes from the airports.csv file, and restricting attention to the airports from Indiana, which 5 airports in Indiana had the most arriving flights?

6c. Using the iata codes from the airports.csv file, and restricting attention to the airports from the Midwest (which we will call “IL”, “IN”, “MI”, “OH”, “WI”), identify the 5 most popular departure-to-arrival paths within the Midwest (i.e., which both depart and also arrive in the Midwest).

Question 7

Use mapply to print sentences for corresponding to question 4c, e.g., the sentences might say something like “The number 1 departure-to-arrival path in the USA is ORD to IND with 000000 flights altogether.” (but of course use the actual values for the origin, destination, and number of flights, and do this for all 5 results in 4c, by using the mapply function with the paste command.)

Question 8

8a. One way that we might try to predict the hub airport for each of the airlines is to find the airport where that airline departs most often, i.e., the airport that is most often used as the origin for that airline. Print a table that shows, for each airline, this top airport origin.

8b. Solve question 8a again, using the destination airports instead of origin airports this time.

8c. Now consider each airport, and find which airline departs from that airport most often.

8d. Solve question 8c again, this time finding which airline arrives to that airport most often.

Question 9

9a. If we classify flights by their distance (e.g., 0 to 500 miles; 500 to 1000 miles; 1000 to 1500 miles; etc.), which classification of flights have the longest delays, on average? This will give us some information about whether shorter or longer flights have a longer average delay.

9b. If we classify flights by their departure time (e.g., before 6 AM; 6 AM to 12 noon; 12 noon to 6 PM; 6 PM to 12 midnight), which classification of flights have the longest delays, on average? This will give us some information about whether it is preferable to depart earlier or later in the day.

Question 10

10a. Write a function that takes two airports as inputs and finds the number of flights from the first airport to the second airport (you can call it numflightsfunc).

10b. Try your function from 10a on a pair of airpots, e.g., flights from IND to ORD.

10c. Write a “most popular destination function” (you can call it mostpopfunc) that takes a group of airports as the input and finds which of them is the most popular destination, i.e., which airport has the most arrivals.

10d. Try your function from 10c on 3 popular airports, e.g., JFK, ORD, and LAX, to see which of these 3 airports is the most popular destination.