Metropolis-Hastings In the Real World

Storytime: the Gaussian Mixture Model

Bayesian inference is fundamentally inductive: we describe how our data was created and Bayes' Theorem fills in missing details given the actual data. So in order to do Gaussian mixture modeling, we need a story for how this data came to be. That story goes something like this. Each of the variables is something we want to figure out, given our actual data set.

To generate a single data point (petal and sepal width/length):

We pick a cluster to put our new iris in. There are $n$ different clusters numbered $0$ through $n - 1$ . There are ways of telling this story where we don't have to pick $n$ , but it makes the math more difficult.¹ So, for now, we're going to assume that we're given $n$ .
Picking a cluster is equivalent to selecting $i \in \{0, 1, 2, 3, \dots, n - 1\}$ . These clusters can have different probabilities of being picked, according to some vector of probabilities $\langle p_1, p_2, \dots, p_n \rangle$ that adds up to 1.
We draw a random value from a normal distribution dependent on the cluster we picked earlier. We have some mean $\mu_i$ and some covariance $\Sigma_i$ that determines what the "average" iris of cluster $i$ looks like and how spread out those irises are in every dimension.
That's it!

Our goal is, given a bunch of outputs from this process, to estimate $n$ and $p_i, \mu_i, \Sigma_i$ for each cluster $i$ . Using Bayesian inference, we're not just going to get a single guess for each, but rather a probability distribution over all of these values that will let us be as confident as we ought to be.

Look up Dirichlet processes to see what this might look like.↩

Storytime: the Gaussian Mixture Model

To generate a single data point (petal and sepal width/length):

We pick a cluster to put our new iris in. There are $n$ different clusters numbered $0$ through $n - 1$ . There are ways of telling this story where we don't have to pick $n$ , but it makes the math more difficult.¹ So, for now, we're going to assume that we're given $n$ .
Picking a cluster is equivalent to selecting $i \in \{0, 1, 2, 3, \dots, n - 1\}$ . These clusters can have different probabilities of being picked, according to some vector of probabilities $\langle p_1, p_2, \dots, p_n \rangle$ that adds up to 1.
We draw a random value from a normal distribution dependent on the cluster we picked earlier. We have some mean $\mu_i$ and some covariance $\Sigma_i$ that determines what the "average" iris of cluster $i$ looks like and how spread out those irises are in every dimension.
That's it!

Look up Dirichlet processes to see what this might look like.↩