Storytime: the Gaussian Mixture Model

Bayesian inference is fundamentally inductive: we describe how our data was created and Bayes' Theorem fills in missing details given the actual data. So in order to do Gaussian mixture modeling, we need a story for how this data came to be. That story goes something like this. Each of the variables is something we want to figure out, given our actual data set.

To generate a single data point (petal and sepal width/length):

  • We pick a cluster to put our new iris in. There are nn different clusters numbered 00 through n1n - 1. There are ways of telling this story where we don't have to pick nn, but it makes the math more difficult.1 So, for now, we're going to assume that we're given nn.
  • Picking a cluster is equivalent to selecting i{0,1,2,3,,n1}i \in \{0, 1, 2, 3, \dots, n - 1\}. These clusters can have different probabilities of being picked, according to some vector of probabilities p1,p2,,pn\langle p_1, p_2, \dots, p_n \rangle that adds up to 1.
  • We draw a random value from a normal distribution dependent on the cluster we picked earlier. We have some mean μi\mu_i and some covariance Σi\Sigma_i that determines what the "average" iris of cluster ii looks like and how spread out those irises are in every dimension.
  • That's it!

Our goal is, given a bunch of outputs from this process, to estimate nn and pi,μi,Σip_i, \mu_i, \Sigma_i for each cluster ii. Using Bayesian inference, we're not just going to get a single guess for each, but rather a probability distribution over all of these values that will let us be as confident as we ought to be.


  1. Look up Dirichlet processes to see what this might look like.

Storytime: the Gaussian Mixture Model

Bayesian inference is fundamentally inductive: we describe how our data was created and Bayes' Theorem fills in missing details given the actual data. So in order to do Gaussian mixture modeling, we need a story for how this data came to be. That story goes something like this. Each of the variables is something we want to figure out, given our actual data set.

To generate a single data point (petal and sepal width/length):

  • We pick a cluster to put our new iris in. There are nn different clusters numbered 00 through n1n - 1. There are ways of telling this story where we don't have to pick nn, but it makes the math more difficult.1 So, for now, we're going to assume that we're given nn.
  • Picking a cluster is equivalent to selecting i{0,1,2,3,,n1}i \in \{0, 1, 2, 3, \dots, n - 1\}. These clusters can have different probabilities of being picked, according to some vector of probabilities p1,p2,,pn\langle p_1, p_2, \dots, p_n \rangle that adds up to 1.
  • We draw a random value from a normal distribution dependent on the cluster we picked earlier. We have some mean μi\mu_i and some covariance Σi\Sigma_i that determines what the "average" iris of cluster ii looks like and how spread out those irises are in every dimension.
  • That's it!

Our goal is, given a bunch of outputs from this process, to estimate nn and pi,μi,Σip_i, \mu_i, \Sigma_i for each cluster ii. Using Bayesian inference, we're not just going to get a single guess for each, but rather a probability distribution over all of these values that will let us be as confident as we ought to be.


  1. Look up Dirichlet processes to see what this might look like.