Applied Data Analytics (IE 2064) Spring 2024
Description
This is an introduction to applied data analytics. The first part of the courses focuses on practical skills: datawrangling, visualization, data processing, exploratory data analysis and scoping projects. The second and main part of this course focuses on building predictive models for regression and classification: linear models, support vector machines, kernel methods, nearest neighbor, and tree-based models. The primary assessment is a project where students will apply their acquired skills on a real dataset. All course work will be done using R.
Prerequisites:
Probability and statistics
Computer programming.
Note: it is suggested (but not mandatory) that student start working on module 1 - `software tools' prior to starting date of this course -- the lectures for this module are recorded. Two weeks after the course starts students will be expected to complete all of module 1 and associated learning checkpoints. Module 1 provides an introduction to the software tools that we will be used in this course.
Modules
Title Reference
Part 0 - Introduction to software tools Chapter 1-4 of Modern Dive
Part 1 - Overview and general strategies Chapter 1-5 of Applied Predictive Modeling
Part 2 - Regression models Chapter 6-8 of Applied Predictive Modeling
Part 3 - Classification models Chapter 11-16 of Applied Predictive Modeling
Textbooks:
(APM) Max Kuhn, Kjell Johnson, (2013), Applied Predictive Modeling, Springer, ISBN: 978-1-4614-6850-9
This book is available online at the pitt library.
Supplements:
Coding
Chester Ismay and Albert Y. Kim, (2017), Modern Dive: An Introduction to Statistical and Data Science via R, http://moderndive.com/. Online book that is a good source on how to do data manipulation tasks and basic statistical techniques.
Bill Venables, David Smith, and the R Core Team, (2013), An Introduction to R - Notes on R: A Programming Environment for Data Analysis and Graphics, available at http://cran.r-project.org/manuals.html and distributed along with R
R for Data Science. Hadley Wickham, Garrett Grolemund. https://r4ds.had.co.nz/
Hadley Wickham, (2016), Introduction to ggplot 2nd ed, Springer, ISBN: 331924275X available in PDF from the Pitt Library
Modelling and theory
James, Witten, Hastie, Tibshirani, (2013) An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics) ISBN: 1461471370 https://www.statlearning.com/ Highly recommended. Very detailed, but focuses on practice in contrast to the Elements of Statistical Learning. Freely available for download.
Hastie, Tibshirani, Friedman (2009) The Elements of Statistical Learning, ISBN: 978-0-387-84857-0 http://statweb.stanford.edu/~tibs/ElemStatLearn/ Probably the best standard textbook. Tends toward the theoretical. You may want it for completeness or for research purposes, but Introduction to Statistical Learning is a better place to start.
Soft skills
Software tools
R
R Studio
Git Version Control
An introduction to these tools is provided in the software tools module. Installation instructions appear install-software-tools.pdf on canvas.
Assessment
Learning checkpoints, equally weighted 5% of grade
Three homeworks, equally weighted 30% of grade
Competition 10% of grade
Midterm 15% of grade
Project 40% of grade
Late penalties: less than 1 hour late 2% penalty, less than two days late 5% penalty. Any later no points except with extraordinary circumstances.
Learning checkpoints
Learning checkpoints are due one week after the corresponding lecture (with the exception of the software tools module). Unlike homeworks the goal is not to assess students but to give students an opportunity to practice skills and check they are following lectures. They are also generally much shorter than homeworks. It is acceptable to look at answers before you submit. Generally, brief feedback will be provided for learning check points. For learning checkpoints a good faith effort (more than 50% of questions attempted and most of the answers correct) will considered complete and receive full points. Otherwise zero points will be awarded.
Competition
I will provide a dataset from an unknown source, your goal is to predict the outcome as best as possible. You will scored based on the quality of your code and if you meet certain prediction performance thresholds.
Project
The projects will involve taking a real dataset and applying the skills that you have learnt in this course to solve a problem for a `stakeholder'.
Collaboration and academic integrity
Collaboration and discussion between students is generally encouraged with some restrictions. For the competition assignment students will be assigned to teams. For the homeworks, students may discuss with each other but final answers should be written independently. For projects, student should submit independent reports but it is acceptable for students to use the same datasets or discussion among students. Remember you are strongly encouraged to discuss homeworks and projects at office hours where I am happy to offer advice.
If you copy code in any homework or project (e.g., from stackoverflow), please make this clear and cite where you copied it from otherwise there is a risk of being accused of plagiarism. Furthermore, please keep in mind that a correctly cited chuck of code will not be consider plagiarism, but it could detract from the demonstrating that you understand the course material, causing you to lose points on an assignment. If you are unsure what to do for a particular assignment, I would recommend discussing with me at office hours.
Please contact me if you have any questions and make sure you have read the academic integrity section below.