Introduction To Big Data Analysis
STAT 29000, Fall 2015

Class times: MWF, 8:30 AM -- 9:20 AM, in SC 183
(STAT 29000-001; Banner CRN 15200)

Professor: Mark Daniel Ward
Email: mdw@purdue.edu
Office: MATH 540
Phone: 765-496-9563

Office hours: Dr. Ward is always happy to meet with students.
Dr. Ward is available for walk-in or scheduled appointments anytime, throughout the week.
He is also always guaranteed to be available MWF, 7:30 AM -- 8:20 AM, in MATH 540.

Teaching Assistant: Chen Chen
Email: chen1167@purdue.edu

We will attempt to cover some or all of the following technologies this semester: Some interesting websites for data are given here. Additional sources of data or data representations are very welcome: Course description: click here

Course policy: click here

Plan to be present for class every day.

The group assignment for the projects is given here.

Projects: Project solutions will be collected via the computing server, on the due date. Project solutions will be distributed.

Outline of Topics
Week 1: Mon, Aug 24
Wed, Aug 26
Fri, Aug 28
topics: introduction to the course; discussion about project 1;
introduction to the R platform; vectors; parameters; missing values; indexes; recycling; functions; data frames; help and documentation systems; CRAN
Week 2: Mon, Aug 31
Wed, Sep 2
Fri, Sep 4
evaluation with Dr. Loran Carleton Parker
R Markdown Dynamic Documents for R,
Introduction to R Markdown
topics: factors, tapply; data.frames; importing and exporting data from csv files; strings and dates
Week 3: Mon, Sep 7
(no class)
Wed, Sep 9
Fri, Sep 11
short week: finishing Project 2, starting Project 3 on Friday
Week 4: Mon, Sep 14
Wed, Sep 16
Fri, Sep 18
topics: best practices for data visualization; graphics
Week 5: Mon, Sep 21
Wed, Sep 23
Fri, Sep 25
topics: concluding the discussion of graphing; matrices and arrays; lists; subsetting data; the family of apply functions; data transformations; verifying and cleaning data
Also: visualizing data using the topics from week 4 and the handouts
Week 6: Mon, Sep 28
Wed, Sep 30
Fri, Oct 2
topics: examples from the suite of apply functions, including how to download and assemble many data files all at once for analysis; and short, specific examples about the apply, sapply, and subset functions
Week 7: Mon, Oct 5
Wed, Oct 7
Fri, Oct 9
topics: generating random numbers; connections with probability; simulations; regression; linear models; multiple linear regression
Week 8: Mon, Oct 12
October Break (no class)
Wed, Oct 14
Fri, Oct 16
topics: introduction to the bash shell
notes from Wed, Oct 14, 2015
and from Fri, Oct 16, 2015
Week 9: Mon, Oct 19
Wed, Oct 21
Fri, Oct 23
topics: introduction to awk, pattern matching, and regular expressions
notes from Mon, Oct 19, 2015
and from Wed, Oct 21, 2015
Week 10: Mon, Oct 26
Wed, Oct 28
Fri, Oct 30
visit and seminar by
Gary McDonald
(Department of
Mathematics and Statistics,
Oakland University)
time to work on project 6
Week 11: Mon, Nov 2
Wed, Nov 4
Fri, Nov 6
time to work on project 7
Week 12: Mon, Nov 9
Wed, Nov 11
Fri, Nov 13
topics: Databases, SQL/MySQL, and interacting with databases from R
Week 13: Mon, Nov 16
Wed, Nov 18
Fri, Nov 20
topics: XML and extracting/scraping data from the web; parsing data; XPath and XPointer
Week 14: Mon, Nov 23
Wed, Nov 25
Thanksgiving Vacation (no class)
Fri, Nov 27
Thanksgiving Vacation (no class)
topics: Thanksgiving vacation
Week 15: Mon, Nov 30
Wed, Dec 2
Fri, Dec 4
topics: Final project preparation time / presentations
Week 16: Mon, Dec 7
Wed, Dec 9
Fri, Dec 11
topics: Final project presentations
Final exam date/time/location:


This material is based upon work supported by the National Science Foundation under Grant Numbers 0939370, 1140489, 1246818. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.