STAT 29000
Project 10
due Monday, November 24, at 9:30 AM


Put all of your solutions into one text file for your group. The file should be called, for instance:
p10g1.R for project 10, group 1, and should be stored in the folder: /proj/gpproj14/p10g1/
   Group 1 consists of: peter188,marti748,boydp,gallaghp
p10g2.R for project 10, group 2, and should be stored in the folder: /proj/gpproj14/p10g2/
   Group 2 consists of: reno0,malek,ffranci,kidd6
p10g3.R for project 10, group 3, and should be stored in the folder: /proj/gpproj14/p10g3/
   Group 3 consists of: omalleyb,zhan1460,philliw,fu82
p10g4.R for project 10, group 4, and should be stored in the folder: /proj/gpproj14/p10g4/
   Group 4 consists of: cdesanti,avorhies,vincentc,rcrutkow
p10g5.R for project 10, group 5, and should be stored in the folder: /proj/gpproj14/p10g5/
   Group 5 consists of: john1209,cringwa,enorlin,lyoke


The code found in the Week 13 examples should be helpful in this problem set.

Please answer questions 1 to 3 in R, by making calls to your MySQL database.

1a. Who are the 10 pitchers with the highest tallys of strikeouts throughout their careers?
1b. Who are the 10 wildest pitchers, i.e., which pitchers have the highest tallys of wild pitches during their whole careers?
1c. Who are the 10 pitchers with the most Outs Pitched (IPOuts) during their career?

2a. Which team has the most home runs of all time (summed over all years)?
2b. Which team has the largest average number of home runs per year, where this is averaged over all years?

3a. Rank the 50 states according to the number of baseball players who were born in the state.
3b. What percent of players have a left batting hand? Right batting hand? Both?

Please answer questions 4 to 6 in R, by using XML tools.

4. Consider the 11 pages of college rankings found at the US News and World Report page: http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities/data
4a. Which 10 universities have the highest enrollments?
4b. Extract the states where the colleges are located. Then use this information to make a list that shows, for each state, how many colleges (from this list) are contained in that state.
4c. Which 10 universities have the highest tuition charges? (For universities with both in-state and out-of-state tuitions listed, use the out-of-state listing.)

5. Consider Dr. Ward's iTunesMusicLibrary.xml file (the listing of the songs), located in /data/public/iTunes/iTunesMusicLibrary.xml (this has not been updated for a few years, but it is still relatively large and perhaps is interesting).
5a. According to their numbers of songs in the playlist, what are the top 10 artists who appear in the playlist?
5b. According to their numbers of songs in the playlist, what are the top 3 genres?
There are several ways that you could work on problem 5. One possible method is to extract all of the keys, as follows:
iTunesDoc <- xmlParse("/data/public/iTunes/iTunesMusicLibrary.xml")
iTunesvec <- xpathSApply(iTunesDoc, "//*/dict/dict/dict/child::*/child::text()", xmlValue)

Warning: This code takes several minutes to run.

6. Consider the example with the Presidential votes in Indiana.
6a. Build your own county-by-county summaries, in each State. Try to do this as efficiently as possible.
6b. Do the summaries agree with Politico's State-by-State summaries? If not, what are the differences?