Figuring Out What We Already Know

A common way of using this iris dataset is to train some machine learning model to predict an iris's species given its measurements. We're going to try something a bit more challenging. If we give a model just the measurements, without telling it about the species, can it infer that the irises are split up into three distinct groups and estimate what those groups are?

It's a lot more useful to find these hidden subgroups than it is to split data into subgroups you already know exist. If you're predicting how a new patient will respond to a drug, or how a new customer will respond to a sale, you probably don't have a perfect understanding of what different groups of people exist within the larger population. Perhaps specific genes are important for a drug, perhaps they aren't. A model that can take in a large population of people and then split them into groups we can go try to explain after the fact can be an invaluable tool.

This also helps us understand whether a group is important in the first place. If our model doesn't see two groups we thought were different as distinct, perhaps that difference isn't relevant to the problem at hand. If we care about these iris measurements and not other properties, is it worth drawing a line between versicolor and virginica flowers? That's not a question that a model we train specifically to tell them apart will be able to answer for us, but it's a very natural fit for a probabilistic model.

Figuring Out What We Already Know

A common way of using this iris dataset is to train some machine learning model to predict an iris's species given its measurements. We're going to try something a bit more challenging. If we give a model just the measurements, without telling it about the species, can it infer that the irises are split up into three distinct groups and estimate what those groups are?

It's a lot more useful to find these hidden subgroups than it is to split data into subgroups you already know exist. If you're predicting how a new patient will respond to a drug, or how a new customer will respond to a sale, you probably don't have a perfect understanding of what different groups of people exist within the larger population. Perhaps specific genes are important for a drug, perhaps they aren't. A model that can take in a large population of people and then split them into groups we can go try to explain after the fact can be an invaluable tool.

This also helps us understand whether a group is important in the first place. If our model doesn't see two groups we thought were different as distinct, perhaps that difference isn't relevant to the problem at hand. If we care about these iris measurements and not other properties, is it worth drawing a line between versicolor and virginica flowers? That's not a question that a model we train specifically to tell them apart will be able to answer for us, but it's a very natural fit for a probabilistic model.