# 1. We switch to the directory for the airline data cd /depot/statclass/data/dataexpo2009 # 1a. The number of flights that occurred, on every day of the year, # can be obtained by extracting the 1st, 2nd, and 3rd fields, # sorting the data, and then # summarizing the data using the uniq command with the -c flag sort 2005.csv | cut -d, -f1-3 | sort | uniq -c # The first few lines of the output are: # 16477 2005,10,1 # 19885 2005,10,10 # 19515 2005,10,11 # 19701 2005,10,12 # 19883 2005,10,13 # and the last few lines of the output are: # 20051 2005,9,6 # 19629 2005,9,7 # 19968 2005,9,8 # 19938 2005,9,9 # 1 Year,Month,DayofMonth # 1b. The day of the year on which the most flights occur can be found by # sorting the results above, in numerical order, using sort -n # and then (if desired, although it is optional) we can # extract the last line of the output using tail -n1 sort 2005.csv | cut -d, -f1-3 | sort | uniq -c | sort -n | tail -n1 # and we conclude that the most flights occur on August 5: 21041 2005,8,5 # 2. We can extract the 9th field, which is the carrier (i.e., the airline company) # and then, in the same way as above, we can sort the data, and then we can # summarize the data using uniq -c # This yields the number of flights for each carrier. # We can either read the number of United or Delta flights with our eyeballs, # or we can use the grep command, searching for both the pattern UA and DL # to isolate (only) the number of flights for United and Delta, respectively. sort 2005.csv | cut -d, -f9 | sort | uniq -c | grep "UA\|DL" # The output is # 658302 DL # 485918 UA # so Delta has more flights than United in 2005. # 3. Now we change directories to consider the taxi cab data cd ../taxi2018 # The ".." in the previous command just indicates that we want to go up one level to # /depot/statclass/data # and then, from that point, we want to go into the taxi cab directory. # If this sounds complicated, then (instead) it is safe to use the longer version: cd /depot/statclass/data/taxi2018 # The number of passengers is given in the 4th column, passenger_count # We use a method that is similar to the one from the first three questions, # We extract the 4th column, sort the data, and then # summarizing the data using the uniq command with the -c flag sort yellow_tripdata_2017-06.csv | cut -d, -f4 | sort | uniq -c # and the distribution of the number of passengers is: # 1 # 548 0 # 6933189 1 # 1385066 2 # 406162 3 # 187979 4 # 455753 5 # 288220 6 # 26 7 # 30 8 # 20 9 # 1 passenger_count # Notice that we have some extraneous information, i.e., there is # one blank line and also one line for the passenger_count (from the header) # 4. To revisit question 1a, we first change back to the airline directory using cd ../dataexpo2009 # or using cd /depot/statclass/data/dataexpo2009 # Then we perform the same operations as above, but working on the file allyears.csv sort allyears.csv | cut -d, -f1-3 | sort | uniq -c # An alternative, we could work on the collection of all of the files # whose name starts with a 1 or a 2 # We do not just use *.csv because in this method, # we do not want to work on the file allnames.csv in this method # (that would double our answers) sort [1-2]*.csv | cut -d, -f1-3 | sort | uniq -c # The first few lines of the output, with either method, are: # 14766 1987,10,1 # 13421 1987,10,10 # 14020 1987,10,11 # 14795 1987,10,12 # 14865 1987,10,13 # 5a. To find the most popular day of the week for travel, we extract the 4th field, # which contains the DayOfWeek, and we summarize it, # in the same way that we did above. # Either of these two methods will work: sort allyears.csv | cut -d, -f4 | sort | uniq -c | sort -n # or we can compute: sort [1-2]*.csv | cut -d, -f4 | sort | uniq -c | sort -n # We get the following distribution of the days of the week: # 15915382 6 # 17143178 7 # 18061938 2 # 18083800 4 # 18091338 5 # 18103222 3 # 18136111 1 # 1 DayOfWeek # 5b. We conclude that Monday (day 1) is the most popular for travel, # 5c. and Saturday (day 6) is the least popular for travel. # 6a. We can build such a file and store it in our home directory # (please note that the tilde ~ refers to the home directory) # as follows: grep "IND,ORD" allyears.csv >~/INDtoORD.txt # or, alternatively, like this: grep IND,ORD [1-2]*.csv >~/INDtoORD.txt # The first method yields a file that starts as follows: # 1987,10,31,6,1720,1721,1712,1714,UA,334,NA,52,53,NA,-2,-1,IND,ORD,177,NA,NA,0,NA,0,NA,NA,NA,NA,NA # 1987,10,1,4,816,816,920,909,UA,453,NA,64,53,NA,11,0,IND,ORD,177,NA,NA,0,NA,0,NA,NA,NA,NA,NA # 1987,10,2,5,816,816,921,909,UA,453,NA,65,53,NA,12,0,IND,ORD,177,NA,NA,0,NA,0,NA,NA,NA,NA,NA # 1987,10,3,6,833,816,932,909,UA,453,NA,59,53,NA,23,17,IND,ORD,177,NA,NA,0,NA,0,NA,NA,NA,NA,NA # 1987,10,4,7,814,816,906,909,UA,453,NA,52,53,NA,-3,-2,IND,ORD,177,NA,NA,0,NA,0,NA,NA,NA,NA,NA # The second method yields a file that starts as follows: # (the extra characters at the start of each line show the file where the pattern was found) # 1987.csv:1987,10,31,6,1720,1721,1712,1714,UA,334,NA,52,53,NA,-2,-1,IND,ORD,177,NA,NA,0,NA,0,NA,NA,NA,NA,NA # 1987.csv:1987,10,1,4,816,816,920,909,UA,453,NA,64,53,NA,11,0,IND,ORD,177,NA,NA,0,NA,0,NA,NA,NA,NA,NA # 1987.csv:1987,10,2,5,816,816,921,909,UA,453,NA,65,53,NA,12,0,IND,ORD,177,NA,NA,0,NA,0,NA,NA,NA,NA,NA # 1987.csv:1987,10,3,6,833,816,932,909,UA,453,NA,59,53,NA,23,17,IND,ORD,177,NA,NA,0,NA,0,NA,NA,NA,NA,NA # 1987.csv:1987,10,4,7,814,816,906,909,UA,453,NA,52,53,NA,-3,-2,IND,ORD,177,NA,NA,0,NA,0,NA,NA,NA,NA,NA # 6b. Now we zip the file (again the tilde ~ just refers to the fact that # these files are both in the home directory) zip ~/INDtoORD.zip ~/INDtoORD.txt # 7a. We identify the airports by extracting the 17th field and then # sorting and counting the lines, and then sorting the results, # as we did several times in the previous problems. # The tail contains the 10 most popular airports. sort allyears.csv | cut -d, -f17 | sort | uniq -c | sort -n | tail # 2733910 SFO # 2754997 MSP # 2884518 IAH # 2979158 DTW # 3319905 DEN # 3491077 PHX # 4089012 LAX # 5710980 DFW # 6100953 ATL # 6597442 ORD # 7b. We can grep for any line that contains data about these 10 airports: grep "SFO\|MSP\|IAH\|DTW\|DEN\|PHX\|LAX\|DFW\|ATL\|ORD" allyears.csv >~/popularairportdata.txt # 8. How many distinct airports are represented in the ASA Data Expo? # One method is to put the 17th and 18th fields in a file with two commands. # Please note that we used a double right carrot so that the output gets appended # to the file without deleting the file cat allyears.csv | cut -d, -f17 >>~/airportcodes.txt cat allyears.csv | cut -d, -f18 >>~/airportcodes.txt # and then we could sort the file and find the number of unique airport codes. # The wc command counts the number of lines, words, and characters # but here we use wc -l because we only need the number of lines (one line per airport) sort ~/airportcodes.txt | uniq | wc -l # There are 354 airports altogether. # An alternative, one line method is to enable the cut command to print a newline # after each field, so that we can print the 17th field and then a newline # followed by the 18th field and then a newline. # That way, we still get one airport per line. cat allyears.csv | cut -d, -f17-18 --output-delimiter=$'\n' | sort | uniq | wc -l # Again we see that there are 354 airports altogether. # 9a. We change directories to consider the taxi cab data cd ../taxi2018 # We extract the 4th field across all files now, # using the similar method to the one that we used above # Since this operation takes a long time to run, we store the results in a file. # We also use the "nohup" option, # which needs an ampersand (the is the "&") at the end of the line, # so that this process can run in the background while we are working on other things. nohup cat yellow*.csv | cut -d, -f4 | sort | uniq -c | sort -n >~/taxidistribution.txt & # and the distribution of the number of passengers includes (among many others): # 974 8 # 1165 7 # 1495 208 # 3800950 0 # 29462601 4 # 32640245 6 # 60751390 3 # 92684661 5 # 205576933 2 # 971920078 1 # 9b. Some of the taxi cab rides seem to have a very large number of passengers, but # this is a result of some errors in the data set. For instance, we see taxi cab rides here # that have 208 passengers, which is impossible. # 10a. We first extract the 2nd field, which has the date and time of the departure of each trip: head yellow_tripdata_2017-06.csv | cut -d, -f2 # Then we need to run this through the cut command again, this time using a space as the delimiter: # and (since this intermediate result had only two fields, namely, the date and the time), # this time we extract the first field. head yellow_tripdata_2017-06.csv | cut -d, -f2 | cut -d' ' -f1 # and (changing head to cat, so that we examine the entire file) # now we are prepared to use this strategy on the full month of taxi cab rides: cat yellow_tripdata_2017-06.csv | cut -d, -f2 | cut -d' ' -f1 | sort | uniq -c # The first few lines of output are the following (notice that we had one blank line) # 1 # 344507 2017-06-01 # 347404 2017-06-02 # 341807 2017-06-03 # 294236 2017-06-04 # and the last few lines of output are the following (notice that # we have the header in the output too) # 321083 2017-06-27 # 316000 2017-06-28 # 313277 2017-06-29 # 302847 2017-06-30 # 1 tpep_pickup_datetime # 10b. Now we run this code on all of the taxi cab rides from all of the years, # and we again use the "nohup" option, again with an ampersand at the end of the line, # so that this process can run in the background while we are working on other things. # We store the results in a file: nohup cat yellow*.csv | cut -d, -f2 | cut -d' ' -f1 | sort | uniq -c | sort -n >~/taxicountsbyday.txt & # As with lots of data, there are some strange properties. For instance, here are the # first several lines of the file: head -n30 ~/taxicountsbyday.txt # 1 2001-01-06 # 1 2002-12-31 # 1 2018-01-20 # 1 2018-02-04 # 1 2041-11-15 # 1 2053-03-21 # 2 2001-01-01 # 2 2003-01-01 # 2 2018-02-07 # 2 2018-03-01 # 2 2018-03-19 # 3 2003-01-14 # 3 2018-01-17 # 3 2018-02-25 # 4 2018-04-30 # 4 2018-05-22 # 5 2018-04-09 # 12 # 12 Trip_Pickup_DateTime # 26 2018-01-01 # 35 2008-12-31 # 36 tpep_pickup_datetime # 46 pickup_datetime # 82 # 29028 2011-08-28 # 72025 2010-12-27 # 78133 2016-01-23 # 100325 2017-03-14 # 113808 2012-10-29 # 135500 2015-01-27 # This is the end of the first project!