STAT 29000
Project 8
due Wednesday, November 5, at 8:30 AM


Put all of your solutions into one text file for your group. The file should be called, for instance:
p08g1.R for project 8, group 1, and should be stored in the folder: /proj/gpproj14/p08g1/
   Group 1 consists of: john1209,avorhies,ffranci,gallaghp
p08g2.R for project 8, group 2, and should be stored in the folder: /proj/gpproj14/p08g2/
   Group 2 consists of: omalleyb,marti748,vincentc,lyoke
p08g3.R for project 8, group 3, and should be stored in the folder: /proj/gpproj14/p08g3/
   Group 3 consists of: reno0,cringwa,boydp,fu82
p08g4.R for project 8, group 4, and should be stored in the folder: /proj/gpproj14/p08g4/
   Group 4 consists of: peter188,malek,philliw,rcrutkow
p08g5.R for project 8, group 5, and should be stored in the folder: /proj/gpproj14/p08g5/
   Group 5 consists of: cdesanti,zhan1460,enorlin,kidd6


The code found in the Week 10 examples should be helpful in this problem set.

1. Consider the file yow.lines, which is distributed with emacs 21.4. It can be downloaded from the llc server or you can access it directly from /proj/www/2014/29000/projects/yow.lines if you prefer. (Some of the lines in this file are very strange, but this is a standard text file, which is widely known and widely distributed too, on every Linux and UNIX system that contains emacs 21 and earlier.)
1a. How many lines start with a capital letter I?
1b. How many lines end with a question mark?
1c. How many lines end with an exclamation point?
1d. How many lines contain 3 or more exclamation points in a row (which may or may not be at the end of the phrase)?

2. Continuing to study yow.lines:
2a. How many lines from contain 3 or more exclamation points altogether (which may or may not be consecutive)?
2b. Print yow.lines with all uppercase letters converted to lowercase letters.
2c. On how many lines does the word "yow" appear (regardless of capitalization)?

3. Consider the file /usr/share/dict/words on the llc server.
3a. How many words have exactly 6 characters?
3b. How many words have an occurrence of dog as a subword?
3c. How many words have the letters dog, in that order, but not necessary in consecutive order?

4. Continuing to study /usr/share/dict/words on the llc server:
4a. How many words start with the 2-letter phrase de?
4b. How many words end with the 2-letter phrase ly?
4c. How many words do not start with the 3-letter phrase con?

5. This question is based on the Social Security baby names data set. You can read about the Social Security baby names at: http://www.ssa.gov/OACT/babynames/namesbystate.html. The data set itself can be downloaded from the llc server or you can access it directly from /proj/www/2014/29000/projects/babynames.txt if you prefer. The data set contains 134 years of data (1880 to 2013), with 1000 boy names and 1000 girl names per year. The rank of each name is given within each year. The number of boys or girls born with each name is given in each year.
5a. How many children were named Mary during 1880-2013?
5b. What are the ranks of Mary's name during each of these 134 years?
5c. How many different girl names (from this data set) start with the letter A? Be sure to remove duplicated names, i.e., count each name just once.
5d. How many different boy names (from this data set) have 4 letters? Be sure to remove duplicated names, i.e., count each name just once.

6. Continuing to study the baby names:
6a. What are the names (in alphabetic order, without duplicates) that have a double consecutive vowel, e.g., Aa or aa or Ee or ee or Ii or ii or Oo or oo or Uu or uu? Be sure to remove duplicated names, i.e., display each name just once. [Hint: We saw && is used for "and"; similarly, || is used for "or".]
6b. Which names have an occurrence of q that is not followed by a u? Be sure to remove duplicated names, i.e., display each name just once.
6c. Which names have two or more z's (regardless of uppercase or lowercase), which are not necessary consecutive? Be sure to remove duplicated names, i.e., display each name just once.

7. Consider the airline flight files stored in this directory: /data/public/dataexpo2009 on the llc server. We reconsider a few questions that we solved earlier in R. The advantage of using awk is that the speed is faster, and we do not have to input all of the data at the start (recall we had to pre-load all of the data in R).
7a. Which 10 airports have the most departures? [It might help to use awk and sort and uniq and another sort in conjunction, with a count flag for uniq.]
7b. Which 10 airports have the most arrivals?
7c. Which are the 10 most popular pairs of departure/arrival city pairs? (For instance, IND-to-ORD might be one such popular pair.)

8. Continuing to study the airline data:
8a. Make a new file called weekend1995.csv that contains only the flights that were on a weekend, from the 1995 flights file.
8b. Make a new file called longdelays1995.csv that contains only the flights that had a departure delay of 1 hour or more, from the 1995 flights file.
8c. Make a new file called JFKtoLAX1995.csv that contains only the flights that were from JFK to LAX, for the 1995 flights file.

9., 10..... I might provide another couple of questions soon, as usual, depending on how students seem to be doing with these questions.... BUT I want to see how things go with the problems outlined above. I like to be flexible, as you know!