Project 4
1. Find all of the (origin) airports
from which you can fly to 100 or more (distinct) destinations. (Either from 2005 or from all years.)
Hint: R has a unique command.
Hint 2: sapply(mydata, myfunction)
can be used to apply myfunction
to each element of mydata
Hint 3: ALTERNATIVELY, it is possible to
custom build functions, for example,
we can get the mean of the DepDelay
for each origin airport this way:
tapply(myDF$DepDelay, myDF$Origin, mean, na.rm=T)
but we *cannot* get 3 times the mean this way:
tapply(myDF$DepDelay, myDF$Origin, 3*mean, na.rm=T)
we would need to build our own function:
tapply(myDF$DepDelay, myDF$Origin, function(x) {3*mean(x, na.rm=T)})
2a. Consider the election donation data:
https://www.fec.gov/data/advanced/?tab=bulk-data
from "Contributions by individuals"
for 2017-18. Download this data.
2b. Unzip the file (in the terminal).
2c. Use the cat command to concatenate all of the files into
one large file (in the terminal).
2c. Read the data dictionary:
https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description/
Hint: When working with a file that
is not comma separated,
you can use the read.delim
command, and just specify the
character that separates
the various pieces of data
on a row.
3a. Which state's citizens gave the largest number of contributions?
3b. Which state's citizens gave the greatest amount of money?
4a. Now turn attention to
the "all candidates" file
for 2017-18, which contains
summarized data.
4b. Download the data and unzip it.
4c. Read the data dictionary.
4d. Create a new data.frame that contains
only these columns:
6-18, 26-27, 29-30
5a. Convert the data.frame to a matrix
5b. Sum these columns (all at once), using the apply function.
6a. Consider the Lahman baseball database available at:
http://www.seanlahman.com/baseball-archive/statistics/
Download the comma-delimited version and unzip it.
Inside the "core" folder of the unzipped file, you will find many csv files.
If you want to better understand the contents of the files,
there is a helpful readme file available here:
http://www.seanlahman.com/files/database/readme2017.txt
6b. read the Teams.csv file into a data.frame called myDF
6c. we can break the data.frame into smaller data.frames,
according to the teamID, using this code:
by(myDF, myDF$teamID, function(x) {plot(x$W)} )
For each team, this will draw 1 plot of the number of wins per year.
The number of wins will be on the y-axis of the plots.
6d. For an improved version, we can add the years on the x-axis, as follows:
by(myDF, myDF$teamID, function(x) {plot(x$year, x$W)} )
6e. Change your working directory in R to a new folder,
using the menu option: Session -> Set Working Directory -> Choose Directory
We are going to make 149 new plots!
6f. After changing the directory, try this code, which makes 149 separate pdf files:
by(myDF, myDF$teamID, function(x) {pdf(as.character(x$teamID[1])); plot(x$year, x$W); dev.off()} )
7. Experiment with this concept yourself! Make three more series of plots, using the baseball tables. You are welcome to choose which kinds of series of plots you make. Enjoy, and be creative!
8. Put the project into RMarkdown and submit when your code is polished and ready.