Why Probability?

Despite occasional problems, this algorithm tends to work quite well for clustering datasets. There are a few practical issues, however:

  • You have to specify the number of clusters beforehand, and results can be quite odd if that choice is incorrect for your data.
  • Similarly, if the initialization does affect the final output, it's not clear how we'd go about figuring out which initialization produces the best results.

There are nonprobabilistic solutions to these problems, but statistics and probability gives us elegant, principled solutions for them. If we can compute the likelihood of observing the data we have under different models (say, with different amounts of clusters), we can pick the model that matches our data the best, and that provides a way of assessing our final model fit that has a strong theoretical foundation.

Probabilistic modeling also lets us reason about uncertainty, which is a serious benefit in a world that rarely gives us complete information. We can answer questions like "what is the probability this point belongs to cluster A as opposed to cluster B" and make decisions using that information without over- or under-promising how much we know.

This means we're going to be using a probabilistic version of k-means clustering to analyze some data. Let's get started!

Why Probability?

Despite occasional problems, this algorithm tends to work quite well for clustering datasets. There are a few practical issues, however:

  • You have to specify the number of clusters beforehand, and results can be quite odd if that choice is incorrect for your data.
  • Similarly, if the initialization does affect the final output, it's not clear how we'd go about figuring out which initialization produces the best results.

There are nonprobabilistic solutions to these problems, but statistics and probability gives us elegant, principled solutions for them. If we can compute the likelihood of observing the data we have under different models (say, with different amounts of clusters), we can pick the model that matches our data the best, and that provides a way of assessing our final model fit that has a strong theoretical foundation.

Probabilistic modeling also lets us reason about uncertainty, which is a serious benefit in a world that rarely gives us complete information. We can answer questions like "what is the probability this point belongs to cluster A as opposed to cluster B" and make decisions using that information without over- or under-promising how much we know.

This means we're going to be using a probabilistic version of k-means clustering to analyze some data. Let's get started!