1a. In R, use the "system" function with the parameter intern=TRUE to solve question 1a from project 3. Inside the system function, you can use any method from bash that you like. The goal is to be able to solve this question relatively quickly, without having to import the complete file allyears.csv into R.
1b. In R, use the "pipe" function, wrapped inside the read.csv function, to solve question 1a in a different way, without using the "system" function.
1c. Use system.time to see which of these two methods is faster. By the way, both methods should be MUCH faster than importing the entire allyears.csv file, as we naively did back in project 3.
2. See what is the quickest method that you can use to solve question 4c from project 3, using your knowledge of bash and/or awk tools, as well as the system or pipe functions in R.
3. Solve questions 8a and 8c from project 3 again, using your knowledge of bash and/or awk tools, as well as the system or pipe functions in R.
4. Use awk (and the system or pipe function in R) to solve question 1 from project 5 again. How much faster is your solution, using these tools, as compared to the method you used from project 5?
5a. Use awk to find the lengths of the lines in the yow.lines file, and then use R to make a plot of the distribution of the lengths.
5b. Is it faster to (a) use awk to find the lengths of the lines, and then import these lengths in R (instead of the whole lines themselves), or (b) is it faster use R to import all of the lines and find the lengths within R?
5c. Find the distribution of the words in the /usr/share/dict/words file, according to the starting character. The letters should be treated as case insensitive.
5d. Use R to plot the distribution from part c. Plot the letters in decreasing order, according to how many words start with those letters.
6. Working with the DataFest 2015 visitor.csv file, use question 1c from project 6 to make a dotchart in R of the twenty cities with the most entries, showing the number of entries per city. Please put the data in the dotchart into numerical order, according to the number of entries for the city.
In question 7, for parts a, b, c, use bash or awk tools.
7a. The file babynames.txt has 134 years of data, with all of the baby names from 1880 to 2013. Extract a list of all of the names (regardless of gender).
7b. Remove the duplicates from the list in part a.
7c. Count the number of (unique) names that remain, according to the length of the name.
7d. Finally, import the resulting distribution of lengths to R, and make a plot of the distribution of the number of names, according to the length of the name.
7e. Redo parts 7a through 7d using only R functions, without resorting to bash or awk.
7f. Which method was faster? The method that blended bash/awk/R tools, or the method that used only tools from R?
8. Make a list (in increasing order) of all of the integers from 1 to 1000000 whose prime factors are only 2's and/or 3's. Hint: It might help to think cleverly and use an inner product, but you can do this in any way that you like. Time your solution. What is the fastest way that you can solve the problem? Compare with your peers to see what kinds of solutions that they found, and how fast the solution worked. [Hint: there are 142 such numbers, starting with 1, 2, 3, 4, 6, 8, 9, 12, 16, 18, 24, ..., and ending with 995328.]