Bayes In the Real World: The Metropolis-Hastings Algorithm

Attempted Solution: Curve Fitting

Let's say we can make some educated guesses about $P$ . For example, perhaps we have good reason to believe it does look pretty close to a normal distribution. One approach is to then treat it as a curve-fitting problem: we fit a specific distribution (say, a normal distribution) to unnormalized samples from $P(x)$ through some kind of optimization. Luckily, there are optimization methods that can ignore the scaling constant, so we don't actually have to explicitly deal with it.¹ Then, we can sample from that distribution instead of the original.

This family of approaches is broadly called variational Bayes², and it has some upsides and downsides:

We don't have to pick the boundaries like we do if numerically integrating.
It scales very well by dimensionality: if we fit a 3D normal distribution, even if we assume that it can take any kind of weird shape in 3D (i.e., it might be long on one side but not another), we have to fit 9 parameters. Storing those values in a grid, even if we only have 10 bins for each axis, requires computing 1000 parameters, one for each grid box.
It can incorporate extra knowledge well, allowing us to simplify the problem depending on how much we know and our tolerance for inaccuracy.

There's one big downside: it's hard to know if you've picked a good distribution to fit to $P$ , and the more flexible you want to be in what you can represent the harder it is to fit. There is always some inherent error here—you can never be 100% sure that your approximation accurately captures the true distribution.

Variational Bayes is used quite often—it can be a very quick way to get a rough solution. But we'll leave it for now and move on to the main course.

Specifically, we minimize something called the Kullback-Liebler divergence, which has two very important properties that gets this to work. First, we can estimate the KL divergence between two distributions even if we only have a set of samples and their probabilities under both distributions, which is important when we don't know the closed form of $P^*(x)$ . Second, the scaling constant $c$ becomes a fixed additive penalty to the KL divergence, so it has no effect on the best fit: the distribution that fits $P(x)$ the best also fits $P^*(x)$ the best. I'll explain this in more detail in the future.↩
Variational, in this context, means optimizing functions rather than optimizing numbers—we're finding a function that fits our posterior distribution.↩

Attempted Solution: Curve Fitting

This family of approaches is broadly called variational Bayes², and it has some upsides and downsides:

We don't have to pick the boundaries like we do if numerically integrating.
It scales very well by dimensionality: if we fit a 3D normal distribution, even if we assume that it can take any kind of weird shape in 3D (i.e., it might be long on one side but not another), we have to fit 9 parameters. Storing those values in a grid, even if we only have 10 bins for each axis, requires computing 1000 parameters, one for each grid box.
It can incorporate extra knowledge well, allowing us to simplify the problem depending on how much we know and our tolerance for inaccuracy.

Variational Bayes is used quite often—it can be a very quick way to get a rough solution. But we'll leave it for now and move on to the main course.

Specifically, we minimize something called the Kullback-Liebler divergence, which has two very important properties that gets this to work. First, we can estimate the KL divergence between two distributions even if we only have a set of samples and their probabilities under both distributions, which is important when we don't know the closed form of $P^*(x)$ . Second, the scaling constant $c$ becomes a fixed additive penalty to the KL divergence, so it has no effect on the best fit: the distribution that fits $P(x)$ the best also fits $P^*(x)$ the best. I'll explain this in more detail in the future.↩
Variational, in this context, means optimizing functions rather than optimizing numbers—we're finding a function that fits our posterior distribution.↩