Probability theory

The world we live in is inherently random. Even in a fully deterministic universe - quantum mechanics seem to contradict this philosophy, at least for now - we would require complete knowledge of all relevant variables and factors in order to perform inference and prediction on said problem. Given incomplete knowledge about the data generating process, either in the form of missing variables or due to the mere fact that we can only draw samples from that process, we have to live with some form of randomness. Probability theory now becomes an important tool to solve data related problems.

Introductory courses on probability theory are mostly concerned with well-behaved random variables like Bernoulli and Binomials and the famous Normal distribution. While this is certaintly helpful to learn the basics or derive meaningful theoretical results, we need to be aware that real-world data problems are often messy and our typical assumptions might break down completely. Outside of academia, empirical results outweigh theoretical justification and even the most elegant model will quickly be dropped if it makes you lose money. This story is a somewhat cautionary tale about this issue.

Nevertheless, theory can guide our way through the sheer endless possible solutions for a given problem and help us exclude those options that clearly won't work. Since every problem is different, a data practitioner with strong mathematical and statistical knowledge could also develop new solutions that work exceptionally well for that particular task at hand. As often in life, balance is key here.

With that being said, I hope that this section can provide helpful insights on the theory of probability. I certainly cannot cover everything but aim to give a reasonable amount of sources for further reading. There will also be proofs, as a solid proof would still outweigh every other argument when searching for the most performant approach. If you find that the sample average will be a better solution than a 10,000\$ Deep Bayesian Residual Neural Network trained with a hybrid MCMC algorithm, you better be able to prove that to your hype-driven manager. A savvy sales team might still be able to somehow sell that as advanced Artificial Intelligence.

Also a final quick note: All content will for now be based on the typical notion of probability without measure-theoretic background as is typically done for non-mathematicians. While measure theory generalizes many aspects of probability theory and advanced Machine Learning research frequently taps into it, I found it too complex to discuss it here as well (I also don't feel confident yet to provide learning material on it). For a measure theoretic introduction into probability, I found this book by David Pollard to be a very good read.


  • Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA: Duxbury, 2002.
  • Mittelhammer, Ron C. Mathematical statistics for economics and business. Vol. 78. New York: Springer, 1996.
  • Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.