due Friday, September 19, at 8:30 AM
Put all of your solutions into one text file for your group. The file should be called, for instance:
p03g1.R for project 3, group 1, and should be stored in the folder: /proj/gpproj14/p03g1/
Group 1 consists of:
p03g2.R for project 3, group 2, and should be stored in the folder: /proj/gpproj14/p03g2/
Group 2 consists of:
p03g3.R for project 3, group 3, and should be stored in the folder: /proj/gpproj14/p03g3/
Group 3 consists of:
p03g4.R for project 3, group 4, and should be stored in the folder: /proj/gpproj14/p03g4/
Group 4 consists of:
p03g5.R for project 3, group 5, and should be stored in the folder: /proj/gpproj14/p03g5/
Group 5 consists of:
This project is all about the "Airline on-time performance", from the American Statistical Association's 2009 Data Expo.
There is also some supplemental data provided by the ASA as well.
You can see the data on the ASA site too. In particular, there is a listing of all of the parameters, which might be helpful for you to print.
I already downloaded it for you, to make things a little easier for you. Since the data itself is so large, I saved it into a common data directory:
Notes: If you want to read ALL of the data into R at once, you can do it, but it takes quite awhile (it might take more than 15 minutes to initially load the data).
You can import just a year or two of the data at a time, to start working with the data. You are not expected to import all of the data while you are solving the questions. You can wait until you have solved the questions, and then come back and try to get the answers with all of the data. So, for instance, you might want to start with just a few specific years only:
bigDF <- rbind( read.csv("/data/public/dataexpo2009/2006.csv"), read.csv("/data/public/dataexpo2009/2007.csv"), read.csv("/data/public/dataexpo2009/2008.csv") )
and once you are sure that everything works, before you get ready to submit your data, you can load all of the years.
There are over 3.5 billion pieces of data in the files altogether, if you load all of the years from 1987 through 2008.
Just loading the data itself (if you choose all of the years) might take roughly 15 or 20 minutes to accomplish. It would be done with some code like this: (WARNING! This will take quite a long time to load, if you load all years at once.)
bigDF <- rbind(
read.csv("/data/public/dataexpo2009/1987.csv"), read.csv("/data/public/dataexpo2009/1988.csv"), read.csv("/data/public/dataexpo2009/1989.csv"),
read.csv("/data/public/dataexpo2009/1990.csv"), read.csv("/data/public/dataexpo2009/1991.csv"), read.csv("/data/public/dataexpo2009/1992.csv"),
read.csv("/data/public/dataexpo2009/1993.csv"), read.csv("/data/public/dataexpo2009/1994.csv"), read.csv("/data/public/dataexpo2009/1995.csv"),
read.csv("/data/public/dataexpo2009/1996.csv"), read.csv("/data/public/dataexpo2009/1997.csv"), read.csv("/data/public/dataexpo2009/1998.csv"),
read.csv("/data/public/dataexpo2009/1999.csv"), read.csv("/data/public/dataexpo2009/2000.csv"), read.csv("/data/public/dataexpo2009/2001.csv"),
read.csv("/data/public/dataexpo2009/2002.csv"), read.csv("/data/public/dataexpo2009/2003.csv"), read.csv("/data/public/dataexpo2009/2004.csv"),
read.csv("/data/public/dataexpo2009/2005.csv"), read.csv("/data/public/dataexpo2009/2006.csv"), read.csv("/data/public/dataexpo2009/2007.csv"),
Therefore, it is probably better (instead) to test your code on (say) three years of data, e.g., 2006-2008, before working on the full data set.
1a. Consider the departure times (DepTime). What fraction of the data are missing, i.e., are stored as NA values?
1b. Within the departure times that are recorded (i.e., that are not NA values), the times are stored in hhmm format. So there should be at most 24*60 = 1440 such possible times. Are there other DepTime values? Are they correct or perhaps erroneous? How many such DepTime values (overall) seem to be erroneous?
2a. Which departure times are the best, for minimizing the arrival delay (ArrDelay)? More specifically, if our goal is to minimize the arrival delay, which of these 4 time categories is best time of day for our departure? Between 12 midnight and 6 AM? Between 6 AM and 12 noon? Between 12 noon and 6 PM? Or between 6 PM and 12 midnight?
2b. Which of the 4 time categories for the departure will have the highest variance for arrival delay?
2c. Now please solve 2a and 2b again, splitting the data not only by the best time of day but also by the airline too. That way, we can know what time of day and which airline we might prefer to use.
3a. Which 10 airports have the most departures?
3b. Which 10 airports have the most arrivals?
3c. If we reconsider 3a and 3b, by splitting the data year by year, are the answers to 3a and 3b relatively consistent from year to year?
3d. Which are the most 10 popular pairs of departure/arrival city pairs? (For instance, IND-to-ORD might be one such popular pair.)
4a. Which 5 airports are most likely to be on time for arrivals (on average)?
4b. Which 5 airports are most likely to be on time for departures (on average)?
4c. Which 5 airports are most likely to be delayed for arrivals (on average)?
4d. Which 5 airports are most likely to be delayed for departures (on average)?
5a. Which is the best day of the week to fly, if you want to minimize delayed arrivals?
5b. Which portion of the flights depart on which days?
5c. What percent of flights depart between 12 midnight and 6 AM? Between 6 AM and 12 noon? Between 12 noon and 6 PM? Between 6 PM and 12 midnight?
5d. Can you study 5b and 5c simultaneously, e.g., can you give an analysis by day of the week and time of day (in tandem), so that we know precisely which days of the week and which portions of the days are busiest for departures, i.e., so that we have a finer breakdown of the departure data?
6a. Which 5 carriers are the most likely to be delayed?
6b. Which 5 carriers are the most likely to be on time?
7a. Give a month-by-month breakdown of the percentage of cancelled flights.
7b. What are the worst 3 months of the year for cancelled flights? I.e., during which 3 months are the most flights cancelled?
(Since 1987 is an incomplete year, please avoid the data from 1987 for 7a and 7b, because we do not want to unfairly balance the months.)
8. Make a plot that shows how the number of flights departing ORD has changed, year by year. Then add similar data to the same plot, for the number of flights departing IND, year by year.
9. Read the documentation for the dotchart function. Make a dotchart as follows: The x-axis should be the percentage of the time that flights are delayed more than 30 minutes. On the y-axis, the main groupings should be according to month, and within each month, please show O'Hare and Indianapolis as cities of departure for flights. The data to be displayed are the DepDelay data for 2007 only. So the overall plot will show, month-by-month, a comparison of the DepDelay data for O'Hare and Indianapolis.
10. Make another dotchart, similar to the one in question 9, where the main groupings on the y-axis are O'Hare and Indianapolis, and within each city, display all 12 months. Again, the data to be displayed are the DepDelay data for 2007. The x-axis should again be the percentage of the time that flights are delayed more than 30 minutes. So the overall plot will show, for each of the two cities, a month-by-month comparison of the DepDelay data. If you are able, you can organize the months according to their percentage of time delayed more than 30 minutes, rather than according to alphabetic order.