Problem Setup

Gaussian Mixture Models

We'll be fitting something known as a Gaussian mixture model. Specifically, a Gaussian mixture model is a generalization of the normal (or Gaussian, hence the name) distribution. Instead of having a single normal distribution, we imagine a dataset created by mixing multiple Gaussian distributions with different frequencies.

This is important because, in the real world, there are often important latent variables—unseen information that affects the data you have—that we don't know about. Many real-world datasets are roughly normally distributed, or at least unimodal: having a single peak. So, if a dataset looks strongly multimodal, having several distinct peaks, that's a hint that there's more to the story. We should probably be hunting for hidden variables that explain our data better than our current models, and we should be suspicious of many machine learning techniques that assume normally distributed or unimodal data. Examples from the real world:

  • If response to a medical treatment is bimodal, perhaps the treatment works differently in people with a specific gene.
  • If customer behavior is multimodal, that might indicate different segments of a market that could be identified.
  • If political preferences are multimodal, then perhaps those modes represent different voting blocs that should be accounted for when polling.
plot

Problem Setup

Gaussian Mixture Models

We'll be fitting something known as a Gaussian mixture model. Specifically, a Gaussian mixture model is a generalization of the normal (or Gaussian, hence the name) distribution. Instead of having a single normal distribution, we imagine a dataset created by mixing multiple Gaussian distributions with different frequencies.

This is important because, in the real world, there are often important latent variables—unseen information that affects the data you have—that we don't know about. Many real-world datasets are roughly normally distributed, or at least unimodal: having a single peak. So, if a dataset looks strongly multimodal, having several distinct peaks, that's a hint that there's more to the story. We should probably be hunting for hidden variables that explain our data better than our current models, and we should be suspicious of many machine learning techniques that assume normally distributed or unimodal data. Examples from the real world:

  • If response to a medical treatment is bimodal, perhaps the treatment works differently in people with a specific gene.
  • If customer behavior is multimodal, that might indicate different segments of a market that could be identified.
  • If political preferences are multimodal, then perhaps those modes represent different voting blocs that should be accounted for when polling.
plot