Applied Data Analytics (IE 2064) Spring 2024
Description
This is an introduction to applied data analytics. The first part of the courses focuses on practical skills: datawrangling, visualization, data processing, exploratory data analysis and scoping projects. The second and main part of this course focuses on building predictive models for regression and classification: linear models, support vector machines, kernel methods, nearest neighbor, and tree-based models. The primary assessment is a project where students will apply their acquired skills on a real dataset. All course work will be done using R.
Prerequisites:
Probability and statistics
Computer programming.
Note: it is suggested (but not mandatory) that student start working on module 1 - `software tools' prior to starting date of this course -- the lectures for this module are recorded. Two weeks after the course starts students will be expected to complete all of module 1 and associated learning checkpoints. Module 1 provides an introduction to the software tools that we will be used in this course.
Modules
Title Reference
Part 0 - Introduction to software tools Chapter 1-4 of Modern Dive
Part 1 - Overview and general strategies Chapter 1-5 of Applied Predictive Modeling
Part 2 - Regression models Chapter 6-8 of Applied Predictive Modeling
Part 3 - Classification models Chapter 11-16 of Applied Predictive Modeling
Textbooks:
(APM) Max Kuhn, Kjell Johnson, (2013), Applied Predictive Modeling, Springer, ISBN: 978-1-4614-6850-9
This book is available online at the pitt library.
Supplements:
Coding
Chester Ismay and Albert Y. Kim, (2017), Modern Dive: An Introduction to Statistical and Data Science via R, http://moderndive.com/. Online book that is a good source on how to do data manipulation tasks and basic statistical techniques.
Bill Venables, David Smith, and the R Core Team, (2013), An Introduction to R - Notes on R: A Programming Environment for Data Analysis and Graphics, available at http://cran.r-project.org/manuals.html and distributed along with R
R for Data Science. Hadley Wickham, Garrett Grolemund. https://r4ds.had.co.nz/
Hadley Wickham, (2016), Introduction to ggplot 2nd ed, Springer, ISBN: 331924275X available in PDF from the Pitt Library
Modelling and theory
James, Witten, Hastie, Tibshirani, (2013) An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics) ISBN: 1461471370 https://www.statlearning.com/ Highly recommended. Very detailed, but focuses on practice in contrast to the Elements of Statistical Learning. Freely available for download.
Hastie, Tibshirani, Friedman (2009) The Elements of Statistical Learning, ISBN: 978-0-387-84857-0 http://statweb.stanford.edu/~tibs/ElemStatLearn/ Probably the best standard textbook. Tends toward the theoretical. You may want it for completeness or for research purposes, but Introduction to Statistical Learning is a better place to start.
Soft skills
Software tools
R
R Studio
Git Version Control
An introduction to these tools is provided in the software tools module. Installation instructions appear install-software-tools.pdf on canvas.
Assessment
Learning checkpoints, equally weighted 5% of grade
Three homeworks & quizzes, equally weighted 30% of grade
Competition & quiz 10% of grade
Midterm 15% of grade
Project & oral exam 40% of grade
Late penalties: less than 1 hour late 2% penalty, less than two days late 5% penalty. Any later no points except with extraordinary circumstances.
Learning checkpoints
Learning checkpoints are due one week after the corresponding lecture (with the exception of the software tools module). Unlike homeworks the goal is not to assess students but to give students an opportunity to practice skills and check they are following lectures. They are also generally much shorter than homeworks. It is acceptable to look at answers before you submit. Generally, brief feedback will be provided for learning check points. For learning checkpoints a good faith effort (more than 50% of questions attempted and most of the answers correct) will considered complete and receive full points. Otherwise zero points will be awarded.
Competition
I will provide a dataset from an unknown source, your goal is to predict the outcome as best as possible. You will scored based on the quality of your code and if you meet certain prediction performance thresholds.
Project
The projects will involve taking a real dataset and applying the skills that you have learnt in this course to solve a problem for a `stakeholder'. You will be assessed through a series of presentations and an oral exam.
Collaboration and academic integrity
Collaboration and discussion between students is generally encouraged with some restrictions. For the competition assignment students will be assigned to teams. For the homeworks, students may discuss with each other but final answers should be written independently. For projects, student should submit independent reports but it is acceptable for students to use the same datasets or discussion among students. Remember you are strongly encouraged to discuss homeworks and projects at office hours where I am happy to offer advice.
AI usage: in this course I will allow AI usage with the cavet that the majority of your grade will be assess through quizzes, presentations and the oral exam. Thus, reliance on AI may substantially hurt your grade. Indeed in the oral exam for the project you are expected to defend the claims made in your report and discuss decisions made in your code. Over reliance on AI or stack overflow may thus harm your grade.
Please contact me if you have any questions and make sure you have read the academic integrity section below.