Back

Unstructured Data Analytics

95-865

Units: 6

Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as “unstructured”. This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.

Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series, including going over basics of large language models. We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).

Note regarding GenAI and foundation models (such as large language models): As likely all of you are aware, there are now technologies like (Chat)GPT, Gemini, Claude, Llama, etc which will all be getting better over time. If you use any of these in your homework, please cite them. For the purposes of the class, I will view these as external collaborators/resources. For exams, I want to make sure that you actually understand the material and are not just telling me what someone else or GPT/Gemini/etc knows. This is important so that in the future, if you use AI technologies to assist you in your data analysis, you have enough background knowledge to check for yourself whether you think the AI is giving you a solution that is correct or not. For this reason, exams in this class will explicitly not allow electronics.

Learning Outcomes

By the end of the course, students are expected to have developed the following skills:

  • Recall and discuss common methods for exploratory and predictive analysis of unstructured data
  • Write Python code for exploratory and predictive data analysis that handles large datasets
  • Work with cloud computing using Google Colab
  • Apply unstructured data analysis techniques discussed in class to solve problems faced by governments and companies

Prerequisites Description

95888 or 90819 or 95898

Syllabus