People often make machine learning out to be something from a science fiction novel. It sounds like teaching a computer to become sentient or something even more outlandish.
In reality, machine learning is usually just statistics, but it's a very specific use of statistics. In fact, my graduate course on machine learning was called “statistical learning.” (I must say that “machine learning” is certainly better branding though.)
But how exactly is machine learning different from other statistics?
There’s two key differences:
We care more about predicting an outcome and less about explaining how that outcome happens*
We write computer programs to improve those predictions
More on those differences below...
Predicting versus Explaining Outcomes
Historically, we used statistics to make inferences. We wanted to learn how some variables impacted other variables. Since it was costly to gather data, we would hypothesize a relationship, conduct experiments or gather data, and then test the strength of that relationship to “infer” a cause-and-effect relationship in the real world.
The majority of academic research uses inferential statistics. The same for pharmaceutical research.
Machine learning, however, uses predictive statistics. We care less about whether variables cause an outcome than whether they correlate with one.
It's really not that complicated. Imagine you're a CEO of a major c-store chain. You want to know why some stores have higher sales than others. Is it the gas price relative to competition? Is it the weather? Is it traffic flow or population density?
That's an inferential question. You're trying to understand why an outcome happens. What you learn from that analysis may determine future store locations for your company.
Now let's say, as the CEO, you want to provide investors with projected sales for a new store opening. You ask your data scientist to build a model that predicts sales.
That's a predictive question. You're trying to use statistics to predict the future. And best of all, it matters less whether the variables in your model cause the outcome – only whether they consistently predict the outcome.
This is a subtle, but important, distinction to understand.
Now these two applications – inference and prediction – are not mutually exclusive. They can inform and support one another.
If your predictive models always show that stores next to a highway exit correlates with higher store revenue, that helps form a hypothesis for an inferential statistical analysis.
And if past research shows that stores next to highway exits have higher store sales, a data scientist knows to include that variable when building a predictive model.
Improving Predictions with Computers and Big Data
The other key component of machine learning is the way we use computers to make and apply these predictions.
In the past, predictive models used smaller data sets. That’s changed in the last twenty years. Computer and mobile applications, along with the “internet of things,” have generated vast swathes of data. This has created an ever expansive list of variables that could be used to make predictions.
Now we can start with a model with a single variable and add more until we fully optimized its predictive capabilities. Or we can do the reverse and include all variables! And then take away one variable at a time until we find the best combination.
Obviously, great care and skill are need to accomplish this. When these techniques are used like a blunt instrument, it devolves into something called "data dredging" or "p-hacking." That's where a researcher uses statistical methodology to find any "statistically significant" relationship – even ones that don't hold up to common sense.
A good data dredging example I saw on Wikipedia showed that deaths from venomous spiders correlated with the number of spelling bee winners. Obviously, these two would have nothing in common and any correlation is a matter of chance and is unlikely to continue when making future predictions.
We can improve a model’s prediction using something called “cross-validation,” which allows us to test whether our algorithm has a high number of false positives and false negatives with new data sets. This improves the accuracy rate of the model and ensures that the variables selected continue to work in new environments.
That, in a nutshell, is how machine learning is different from traditional statistical analysis.
Machine learning isn’t science fiction mumbo jumbo. We’re not teaching computers to become sentient. None-the-less, machine learning is still pretty damn cool and its impact on our lives will grow.
Machine learning techniques don’t detract from the usefulness of inferential statistics though. Researches should not discard the tried and true methods for the sexier sounding data science methods. They must continue to apply statistics and solid research methodologies to uncover relationships between variables and outcomes to help us better understand the world.
If you want to learn how to use machine learning, I suggest the book Introduction toStatistical Learning found here. It explains the theory and practical steps used to build these programs. Plus, it’s a far easier read than most statistics textbooks I’ve read.