Project 4


1. Find all of the (origin) airports from which you can fly to 100 or more (distinct) destinations. (Either from 2005 or from all years.)

Hint: R has a unique command.
Hint 2: sapply(mydata, myfunction) can be used to apply myfunction to each element of mydata
Hint 3: ALTERNATIVELY, it is possible to custom build functions, for example, we can get the mean of the DepDelay for each origin airport this way: tapply(myDF$DepDelay, myDF$Origin, mean, na.rm=T) but we *cannot* get 3 times the mean this way: tapply(myDF$DepDelay, myDF$Origin, 3*mean, na.rm=T) we would need to build our own function: tapply(myDF$DepDelay, myDF$Origin, function(x) {3*mean(x, na.rm=T)})

2a. Consider the election donation data: https://www.fec.gov/data/advanced/?tab=bulk-data from "Contributions by individuals" for 2017-18. Download this data.
2b. Unzip the file (in the terminal).
2c. Use the cat command to concatenate all of the files into one large file (in the terminal).
2c. Read the data dictionary: https://www.fec.gov/campaign-finance-data/contributions-individuals-file-description/
Hint: When working with a file that is not comma separated, you can use the read.delim command, and just specify the character that separates the various pieces of data on a row.

3a. Which state's citizens gave the largest number of contributions?
3b. Which state's citizens gave the greatest amount of money?

4a. Now turn attention to the "all candidates" file for 2017-18, which contains summarized data.
4b. Download the data and unzip it.
4c. Read the data dictionary.
4d. Create a new data.frame that contains only these columns: 6-18, 26-27, 29-30

5a. Convert the data.frame to a matrix
5b. Sum these columns (all at once), using the apply function.

6a. Consider the Lahman baseball database available at:
http://www.seanlahman.com/baseball-archive/statistics/
Download the comma-delimited version and unzip it.
Inside the "core" folder of the unzipped file, you will find many csv files.
If you want to better understand the contents of the files,
there is a helpful readme file available here:
http://www.seanlahman.com/files/database/readme2017.txt
6b. read the Teams.csv file into a data.frame called myDF
6c. we can break the data.frame into smaller data.frames,
according to the teamID, using this code:
by(myDF, myDF$teamID, function(x) {plot(x$W)} )
For each team, this will draw 1 plot of the number of wins per year.
The number of wins will be on the y-axis of the plots.
6d. For an improved version, we can add the years on the x-axis, as follows:
by(myDF, myDF$teamID, function(x) {plot(x$year, x$W)} )
6e. Change your working directory in R to a new folder,
using the menu option: Session -> Set Working Directory -> Choose Directory
We are going to make 149 new plots!
6f. After changing the directory, try this code, which makes 149 separate pdf files:
by(myDF, myDF$teamID, function(x) {pdf(as.character(x$teamID[1])); plot(x$year, x$W); dev.off()} )

7. Experiment with this concept yourself! Make three more series of plots, using the baseball tables. You are welcome to choose which kinds of series of plots you make. Enjoy, and be creative!

8. Put the project into RMarkdown and submit when your code is polished and ready.