STAT 29000
Project 7
due Wednesday, October 29, at 8:30 AM


Put all of your solutions into one text file. The file should be called, for instance: mdw.txt (where mdw is Dr. Ward's username, but please use your own username). When you are finished with the problem set, please email your solutions to mdw@purdue.edu and kamstut@purdue.edu.

Kevin provided some instructions for submitting the project.

The following resources might be helpful for you:

1a. How many lines are found in the file /etc/passwd?
1b. Remember (from the notes) that the command:
cat /etc/passwd | cut -f5 -d:
is used to find each person's name who is a user in the system. Instead of printing the full names of the users, print their usernames (e.g., mdw).

2a. How many users have her/his directory in the /home filesystem? (This is described in the 6th of the 7 fields on each line.)
2b. Now extract the first names of each such user who has her/his directory in the /home filesystem.
2c. Save the results into a file in your home directory, named firstnames.txt

3a. How many words in the file /usr/share/dict/words contain the letter q?
3b. Convince yourself that the command: awk '{print length}'
will print the length of the words in a file.
3c. Find the length of the words, line by line, in the file /usr/share/dict/words

4a. What is the longest word length, among all word lengths in the file /usr/share/dict/words?
Hint: you might need to use awk '{print length}' and sort, with a certain flag on the sort command. Such a flag comes in handy when you are sorting text that is numeric. It would be helpful to read the manual for the sort command, to see which flag to use.
4b. Instead of looking for the longest word length, after you sort the word lengths numerically, pipe the output to the uniq command, and use a flag on the uniq command to count the number of words of each length.

5. In the directory /data/public/election2008
there is some data related to the 2008 election. There are 49 files, namely, one for each of the 48 mainland states, and one for DC. Each line has seven pieces of data, namely, the percent and number of people who voted for Obama in 2008, the percent and number who voted for McCain in 2008, the overall percent of registered votes, the state, and the county.
5a. How many counties are represented in these 49 files?
5b. What is the largest number of votes in one county for Obama?
5c. What is the largest number of votes in one county for McCain?
5d. How many counties have one or more of the following words in the title: north, east, south, west?
5e. How many characters are found in the longest county name?

6. In the directory /data/public/dataexpo2009
there is the airline flight data, which we are already familiar with.
6a. How many flights were taken in 2006?
6b. How many flights were taken altogether, from 1987 to 2008?
6c. How many flights had IND as the origin city in 2006?
6d. How many flights had IND as the origin city altogether, from 1987 to 2008?

7a. In the airline data, how many unique carriers are there, in the 2006 data set?
7b. In the airline data, how many unique carriers are there altogether, from 1987 to 2008?
7c. What was the longest flight taken in 2006, in terms of miles?
7d. What was the longest flight taken altogether, from 1987 to 2008, in terms of miles?
7e. How many flights had this longest flight distance (in terms of miles), from 1987 to 2008?

8. Most of the UNIX commands you can access are contained in one of three places, namely:
Print a list of the names of the programs (be sure to check programs in all three of these places) that have the word "zip" somewhere in the title of the program.