Brief Notes from Day 1

Once logged onto the server, you can see the structure of the server using the File Manager, which is found under the Applications in the upper-left-hand corner.

Places to note are:

/home/mdw
This is your home directory.
(except that you use your Purdue username instead of Dr Ward's username)
Your home directory is the right place to store code that you wanted to be backed up. It is not the right place to store data because your quota is relatively small, compared to the scratch space, which is mentioned below.

/scratch/scholar/mdw
(except that again you use your Purdue username instead of Dr Ward's username)
Your scratch directory is the right place to store large data which does not need to be backed up. Compared to your home directory, you have a lot of space in the scratch directory. Please note that any file which is not touched in the scratch directory after 60 days will be automatically removed.

/depot/statclass/data/
We will store data for the class projects in this directory. The data is read-only, i.e., you will be able to read the data but you will not be able to make changes to it, unless you copy the data first, into your scratch directory.

The pwd (print working directory) is used to show where you are currently working on the server.
pwd

The cd command is used to change the current working directory.
cd /depot/statclass/data/

The ls command shows the contents of the current directory.
ls

We can also use the -la flags to see more information about the contents of the current directory.
ls -la

Assuming that we already moved into the data directory (as given above), we can now change the current working directory to view the Data Expo 2009 data:
cd dataexpo2009

The contents can be seen by again using the ls command
ls

The first 10 lines of a file can be displayed with the head command
head 2005.csv

The last 10 lines of a file can be displayed with the tail command
tail 2005.csv

We can see more lines by using the -n flag. For instance, the first 20 lines of the file can be displayed this way
head -n20 2005.csv

More information about the ASA Data Expo 2009 can be found online here:
http://stat-computing.org/dataexpo/2009/

and, in particular, the data dictionary is found here:
http://stat-computing.org/dataexpo/2009/the-data.html

Piping the output of one command into the input of another command is one of the most powerful and useful tools in UNIX/Linux. It allows us to daisy chain the commands in innovative ways.

As an example, instead of displaying the last 10 lines of the file, we can send the last 10 lines as input to the cut command. We use the -d flag to specify the delimiter between pieces of data (in this case, a comma is found between each piece of data), and we use the -f flag to specify which field to cut out and display.
tail 2005.csv | cut -d, -f17

The command above will take the last 10 lines of the file and will display the 17th field from each such line. These correspond to the origin airports for the last 10 flights in the 2005 data set.

Instead of displaying it after the cut, we can sort these 10 origin airports in alphabetic order, by piping the output to the sort command.
tail 2005.csv | cut -d, -f17 | sort

Instead of displaying the sorted data, we can send the output to the uniq command, which finds unique lines, when multiple lines in a row are the same. If we add the -c flag, it will display the counts of many of each type of line appeared. We need to use the sort command beforehand, because uniq does not sort on its own. It relies on repeated lines being already placed next to each other.
tail 2005.csv | cut -d, -f17 | sort | uniq -c

We can do the same thing for the last 1000 lines of the file. In other words, we can take the last 1000 lines of the file, cut out the 17th field of each line and sort this data (these are the origin airports), and after sorting them, we can display the number of occurrences of each. This gives the number of times that each airport is used as an origin airport, among the last 1000 flights in the 2005 data set.
tail -n1000 2005.csv | cut -d, -f17 | sort | uniq -c

If we want to see these results in sorted order, where the sort happens numerically, we can use sort with the -n flag
tail -n1000 2005.csv | cut -d, -f17 | sort | uniq -c | sort -n

Finally, once we have this code working well, instead of using the last 1000 lines of the file, we can do the same thing with the entire file. We just use cat instead of tail -n1000 because the cat command is used to display the entire file. They call it "cat" because it can also be used to "concatenate" files together, but here we are just using it to display one file.
cat 2005.csv | cut -d, -f17 | sort | uniq -c | sort -n

We will learn about more UNIX/Linux commands on Friday.
Perhaps the best way to learn them is to just try to use them and to discuss with your peers about how they can be used in tandem, with piping.
Remember that a pipe is used to send the output of one command into the input of the next command.
This is a really powerful technique, and it also can be used to promote reproducibility, i.e., to easily string together a bunch of manipulations of the data, to show someone how to clean the data with just one chain of commands.