Attempted Solution: Integration

One solution is to compute xRP(x) dx\int_{x \in \mathbb{R}} P(x)\ dx using some numerical integration method—pure brute force. This can work, but it has some big problems:

  • Where do we start and end the integral? Remember PP is a black box: we don't know what we can reasonably restrict our search space to. How are you supposed to know that my axis limits here aren't misleading?
  • It scales very poorly with dimensionality. If we imagine integration as breaking up the area into a ton of rectangles, the number of rectangles we need is something like nDn^D, where nn is the number of rectangles we can split up each dimension into and DD is the number of dimensions, and exponential growth is vicious. It's not uncommon to see neural networks with ten million parameters. Even if we only split each dimension into two different regions, and we store each result in a single atom, there wouldn't be enough space in the entire universe to hold our output.
  • Without knowing anything about PP, it's hard to say how fine-grained our integration needs to be: too fine-grained and it won't be possible to compute PP for all of the values we want, and if we are too coarse we'll miss parts of the distribution. Imagine sampling three points from D at the left, middle, and right: you completely miss the interesting behavior.

These problems will pop up again and again, and we'll become intimately familiar with them. I've been extremely optimistic in using a single-variable normal distribution as my imagined PP: distributions with crazy spikes, very long tails, or separated "islands" will be much harder to deal with.

Attempted Solution: Integration

One solution is to compute xRP(x) dx\int_{x \in \mathbb{R}} P(x)\ dx using some numerical integration method—pure brute force. This can work, but it has some big problems:

  • Where do we start and end the integral? Remember PP is a black box: we don't know what we can reasonably restrict our search space to. How are you supposed to know that my axis limits here aren't misleading?
  • It scales very poorly with dimensionality. If we imagine integration as breaking up the area into a ton of rectangles, the number of rectangles we need is something like nDn^D, where nn is the number of rectangles we can split up each dimension into and DD is the number of dimensions, and exponential growth is vicious. It's not uncommon to see neural networks with ten million parameters. Even if we only split each dimension into two different regions, and we store each result in a single atom, there wouldn't be enough space in the entire universe to hold our output.
  • Without knowing anything about PP, it's hard to say how fine-grained our integration needs to be: too fine-grained and it won't be possible to compute PP for all of the values we want, and if we are too coarse we'll miss parts of the distribution. Imagine sampling three points from D at the left, middle, and right: you completely miss the interesting behavior.

These problems will pop up again and again, and we'll become intimately familiar with them. I've been extremely optimistic in using a single-variable normal distribution as my imagined PP: distributions with crazy spikes, very long tails, or separated "islands" will be much harder to deal with.