Applied Data Analytics (IE 2064) Spring 2024

Description

This is an introduction to applied data analytics. The first part of the courses focuses on practical skills: datawrangling, visualization, data processing, exploratory data analysis and scoping projects. The second and main part of this course focuses on building predictive models for regression and classification: linear models, support vector machines, kernel methods, nearest neighbor, and tree-based models. The primary assessment is a project where students will apply their acquired skills on a real dataset. All course work will be done using R.

Prerequisites:

Probability and statistics
Computer programming.

Note: it is suggested (but not mandatory) that student start working on module 1 - `software tools' prior to starting date of this course -- the lectures for this module are recorded. Two weeks after the course starts students will be expected to complete all of module 1 and associated learning checkpoints. Module 1 provides an introduction to the software tools that we will be used in this course.

Modules

Title Reference

Part 0 - Introduction to software tools Chapter 1-4 of Modern Dive

Part 1 - Overview and general strategies Chapter 1-5 of Applied Predictive Modeling

Part 2 - Regression models Chapter 6-8 of Applied Predictive Modeling

Part 3 - Classification models Chapter 11-16 of Applied Predictive Modeling

Textbooks:

(APM) Max Kuhn, Kjell Johnson, (2013), Applied Predictive Modeling, Springer, ISBN: 978-1-4614-6850-9

This book is available online at the pitt library.

Supplements:

Coding

Chester Ismay and Albert Y. Kim, (2017), Modern Dive: An Introduction to Statistical and Data Science via R, http://moderndive.com/. Online book that is a good source on how to do data manipulation tasks and basic statistical techniques.
Bill Venables, David Smith, and the R Core Team, (2013), An Introduction to R - Notes on R: A Programming Environment for Data Analysis and Graphics, available at http://cran.r-project.org/manuals.html and distributed along with R
R for Data Science. Hadley Wickham, Garrett Grolemund. https://r4ds.had.co.nz/
Hadley Wickham, (2016), Introduction to ggplot 2nd ed, Springer, ISBN: 331924275X available in PDF from the Pitt Library

Modelling and theory

James, Witten, Hastie, Tibshirani, (2013) An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics) ISBN: 1461471370 https://www.statlearning.com/ Highly recommended. Very detailed, but focuses on practice in contrast to the Elements of Statistical Learning. Freely available for download.
Hastie, Tibshirani, Friedman (2009) The Elements of Statistical Learning, ISBN: 978-0-387-84857-0 http://statweb.stanford.edu/~tibs/ElemStatLearn/ Probably the best standard textbook. Tends toward the theoretical. You may want it for completeness or for research purposes, but Introduction to Statistical Learning is a better place to start.

Soft skills

Thinking with Data (2014). Max Shron.

Software tools

R
R Studio
Git Version Control

An introduction to these tools is provided in the software tools module. Installation instructions appear install-software-tools.pdf on canvas.

Assessment

Learning checkpoints, equally weighted 5% of grade

Three homeworks, equally weighted 30% of grade

Competition 10% of grade

Midterm 15% of grade

Project 40% of grade

Late penalties: less than 1 hour late 2% penalty, less than two days late 5% penalty. Any later no points except with extraordinary circumstances.

Learning checkpoints

Learning checkpoints are due one week after the corresponding lecture (with the exception of the software tools module). Unlike homeworks the goal is not to assess students but to give students an opportunity to practice skills and check they are following lectures. They are also generally much shorter than homeworks. It is acceptable to look at answers before you submit. Generally, brief feedback will be provided for learning check points. For learning checkpoints a good faith effort (more than 50% of questions attempted and most of the answers correct) will considered complete and receive full points. Otherwise zero points will be awarded.

Competition

I will provide a dataset from an unknown source, your goal is to predict the outcome as best as possible. You will scored based on the quality of your code and if you meet certain prediction performance thresholds.

Project

The projects will involve taking a real dataset and applying the skills that you have learnt in this course to solve a problem for a `stakeholder'.

Collaboration and academic integrity

Collaboration and discussion between students is generally encouraged with some restrictions. For the competition assignment students will be assigned to teams. For the homeworks, students may discuss with each other but final answers should be written independently. For projects, student should submit independent reports but it is acceptable for students to use the same datasets or discussion among students. Remember you are strongly encouraged to discuss homeworks and projects at office hours where I am happy to offer advice.

If you copy code in any homework or project (e.g., from stackoverflow), please make this clear and cite where you copied it from otherwise there is a risk of being accused of plagiarism. Furthermore, please keep in mind that a correctly cited chuck of code will not be consider plagiarism, but it could detract from the demonstrating that you understand the course material, causing you to lose points on an assignment. If you are unsure what to do for a particular assignment, I would recommend discussing with me at office hours.

Please contact me if you have any questions and make sure you have read the academic integrity section below.

lecture-schedule-IE-2064

Standard university policies

Academic Integrity

Students in this course will be expected to comply with the SSOE Policy on Academic Integrity and the Swanson School of Engineering Policy. Any student suspected of violating this obligation for any reason during the semester will be required to participate in the procedural process, initiated at the instructor level, as outlined in the University Guidelines on Academic Integrity and the Swanson School procedures. This may include, but is not limited to, the confiscation of the examination of any individual suspected of violating University Policy.

All students are expected to adhere to the standards of professional conduct and academic honesty. Any student engaged in cheating, plagiarism, or other acts of academic dishonesty would be subject to disciplinary action. Any student suspected of violating this obligation for any reason during the semester will be required to participate in the procedural process, initiated at the instructor level, as outlined in the SSOE Academic Integrity Policy found at: https://www.engineering.pitt.edu/Academic-Integrity-Guidelines/.

To learn more about Academic Integrity, visit the Academic Integrity Guide for an overview of the topic. For hands-on practice, complete the Understanding and Avoiding Plagiarism tutorial.

Disability Services

If you have a disability for which you are or may be requesting an accommodation, you are encouraged to contact both your instructor and Disability Resources and Services (DRS), 140 William Pitt Union, (412) 648-7890, drsrecep@pitt.edu, (412) 228-5347 for P3 ASL users, as early as possible in the term. DRS will verify your disability and determine reasonable accommodations for this course.

Religious observance

The observance of religious holidays (activities observed by a religious group of which a student is a member) and cultural practices are an important reflection of diversity. As your instructor, I am committed to providing equivalent educational opportunities to students of all belief systems. At the beginning of the semester, you should review the course requirements to identify foreseeable conflicts with assignments, exams, or other required attendance. If at all possible, please contact me within the first two weeks of the semester to allow time for us to discuss and make fair and reasonable adjustments to the schedule and/or tasks.

Diversity and Inclusion

The University of Pittsburgh does not tolerate any form of discrimination, harassment, or retaliation based on disability, race, color, religion, national origin, ancestry, genetic information, marital status, familial status, sex, age, sexual orientation, veteran status or gender identity or other factors as stated in the University’s Title IX policy. The University is committed to taking prompt action to end a hostile environment that interferes with the University’s mission. For more information about policies, procedures, and practices: https://www.diversity.pitt.edu/civil-rights-title-ix/policies-procedures-and-practices. I ask that everyone in the class strive to help ensure that other members of this class can learn in a supportive and respectful environment. If there are instances of the aforementioned issues, please contact the Title IX Coordinator, by calling 412-648-7860, or emailing titleixcoordinator@pitt.edu.

Communication to Instructor Pertaining to Illness

As in any situation regarding class absence (remote or in person), a student who becomes ill (albeit COVID-19 related or not) is responsible for communicating with me regarding course absences. Please contact me and provide documentation when absences affect quizzes/exams. This should be done via email as soon as possible.