Applied Analytics: the Machine Learning Pipeline
Units: 12
Machine learning is a valued set of analytics techniques, a confluence of ideas from computer science, statistics, economics, physics, and others. Machine learning is transforming fields with new capabilities, ways of understanding and visualizing data, and is becoming a key driver in decision making. However, knowing when (and how) to apply appropriate machine learning techniques requires understanding of data, machine learning, and the problem domain. This class seeks to teach students how to address the entire machine learning pipeline, starting from messy data and provisional questions and ending with actionable interpretations and insights.
The course will cover discovery, planning, analysis, and interpretation. Discovery involves understanding the data at hand, determining what is and is not answerable, and question generation. Planning involves contrasting the application of the desired machine learning method on ideal clean data with the messy data at hand. Dealing with representation, missing data, and designing appropriate machine learning machinery are all involved in planning. Analysis involves applying the machine learning method, checking model performance and assumptions in a principled and responsible manner. Interpretation involves the transformation of algorthm outputs into meaningful and actionable characterizations of the results. Each part of the pipeline is interconnected and students will learn to anticipate and address limitations through understanding of the pipeline as a whole.
Throughout the course we will focus on one vertical, health care, recognizing that the methods developed will generalize to others. We will work with real, messy, structured and unstructured data--including databases, text, and images. We will contrast machine learning methods against what is currently used in health care analytics, and describe the advantages and promise of each.
This course will be a mixture of lectures, discussions and coding workshops. There will be a final project and no final exam.
- learn and adapt the mathematical formulations of machine learning methods for principled application
- perform end-to-end machine learning analysis, including: data exploration, preparation, cleaning, prediction, validation, visualization, and interpretation
- build working knowledge of a data science pipeline: e.g. R tidyverse (we will use this one for class); e.g. python scikit-learn pandas
- develop machine learning algorithms tailored to data and business or research question
- understand the strengths and limitations of existing analytic strategies, including: randomized controlled trials, observational studies, Cox proportional hazards, logistic regression
- write a conference-style paper in Latex
- use of github for project code
Students should have completed or be concurrently taking Data Mining, Machine Learning for Problem Solving, ML 17-601, ML 17-401 or the equivalent. Previous exposure to R, Python or another programming language is highly recommended.