Apparently, it has become very popular to convince aspiring Machine Learning engineers that learning theory is rather overrated in practice. Authors of respective articles like to claim that they are able to do a good job without a deep understanding of the mathematical concepts.
Since my experience has been quite the opposite so far in most of my industry projects, I want to share a largely contrarian perspective. Hence, expect this piece to be influenced by my own personal biases on the topic. I am however sufficiently convinced of the arguments so hopefully you will bear with me.
First, let us start with a thought experiment:
A potential survivorship bias
Imagine a prospective ML engineer without any mathematical knowledge who is currently learning the basics to set a foot in the field. Many of the articles he or she is reading online argue that a good understanding of the theory is more or less mandatory. Our prospect is now facing two alternatives – aside from quitting:
• Follow the recommendations and build a solid foundation of mathematical knowledge. At the same time, ideally learn to code and work on personal projects
• Ignore the recommendations and focus only on the latter two
Let’s say they go with option 2) and find a job. Everything goes well, and they manage to thrive in the industry. After 4 years, they find that there was no need for all those formulas at all. In fact, they were able to prove wrong all those snobs from academia who told them they would fail without the basics.
Now might be a good time to write an article on Medium to tell everyone how math and statistics are overrated in real-world Machine Learning.
Consider, on the contrary, a parallel universe where our ML apprentice had the exact opposite experience. Failed after a few months on the job because somehow things weren’t as easy without knowing some theory. Now, they may or may not decide that it’s time to catch up on their knowledge gaps.
Their motivation to write an article about this experience, however, will likely be much lower than in the first scenario. Who likes talk about their personal failure in public, especially one that was caused by a slight amount of carelessness?
This fictitious scenario is definitely not intended to belittle anyone who has had such experience! Rather, I want to encourage you to take personal stories about being successful in Machine Learning without theory with a grain of salt. Just because it worked for some people on the internet does not mean that the same approach will work for you or in general.
I would even go as far as stating that treating Machine Learning theory as an overrated subject can easily fall on your feet in the long term. To support this statement, I will now continue with some other examples and put them into a broader perspective. These scenarios are, to some extent, based on my own anecdotal experiences and biases.
I therefore recommend that you don’t blindly trust my own reasoning either.
Every problem looks like a nail
Let’s suppose you – believing that pre-packaged libraries are all you need – are working on a demand forecasting model trained on past items sold. There are plenty of examples and tutorials for time-series forecasting with RNNs and LSTMs available online. Hence, you only have to pick your favourite model and adapt it to your data.
Everything is working great, MSE is getting better every every day and you are extra careful not to introduce any lookahead bias in your tests. Finally, you present your model to management and now comes the first bummer:
Management wants to see confidence intervals in their forecasts.
You do your research and learn that you can also use a Neural Network to fit a Normal distribution conditioned on your network’s inputs. That example obviously needs to be tailored to your recurrent model. Given the experience you now have running ML packages out-of-box, this is, of course, not a problem anymore.
Model gets updated, you do your next presentation, even management is happy and your model gets finally pushed to production. Now, everyone is happy for the next two weeks.
After a short period of tranquility, first complaints arise from your procurement and sales departments. For some reason, your model predicts negative demand in some cases. This again causes increasing reluctance to trust a model that no one understands except for you and your team.
If only those confidence intervals weren’t a requirement, you could simply squash your forecast through the right activation function. A ReLu might just work well enough.
This is not possible here, so you go again back to the lab and discover that you can truncate a Normal Distribution to the positive real numbers. Since your problem is now fairly specific, you don’t find a suitable example online anymore and hence need to fiddle around yourself.
Finally, you succeed, the model is implemented and you are looking forward to leave this project behind at last. Somehow, things were getting a little nerve-wracking over time.
Then, weeks after you started with your first model, someone reminds you that the number of items sold is not necessarily equal to actual demand. For obvious reasons, you cannot sell more items than your company has distributed in the first place. Therefore, what you see in the data might lie far below what people would have actually bought if enough quantities had been made available.
This brings you to the concept of censored data. The loss function you now have to derive completely on your own gives you some sense of discomfort and you start wondering what will happen next.
Is this example constructed? To some extent yes, but I have actually worked on a very similar problem in practice and knowing a tad of theory was, to put it kindly, fairly helpful. There are two things to consider here:
1) How much time could you have saved in the above example had you been aware of the statistical properties of your problem in the first place?
Of course every project sooner or later comes to an end, hopefully a successful one. Along the way however, you can potentially skip a lot of trial and error if you are able to define your problem in a holistic manner. Had we realised early on that we are dealing with a continuous, positive, censored target, we could have found a suitable model much sooner.
Now let’s be honest: You could always encounter problems that lie way beyond your current knowledge. Nonetheless, a solid foundation can at least help you to steer your problem solving process into the right direction.
2) Could you have done it with just the standard libraries?
Despite the example above, you will often still be able to achieve reasonable performance with pre-packaged models. However, think about this: If you know that your data has a highly quadratic relation, a plain linear model with quadratically transformed features will likely outperform any complex alternative.
A sophisticated Neural Network might need hundreds of examples for reasonable performance on this task. Our proposed linear model, on the other hand, could quickly converge to the generating function with only a fraction of the data. Now make the residual term follow approximately a Generalised Beta distribution and a custom model could suddenly perform much better than anything from sklearn.
Another, more practical and recent example would be COVID-19 modelling. If you want to make your life easy, you could just punch daily cases into an LSTM to forecast future infections. This procedure, however, would completely ignore potent infectious disease models like SIR and its offspring.
Such approaches have been around for decades, and it will be hard to outperform them with any ML algorithm alone. A fancy and creative solution might put ML on top of such established models. If you decide, at any point, to base your ML approach on SIR, you better be ready to dive into some formulas. Suddenly, differential equations aren’t so useless anymore…
To summarise this section: Don’t get me wrong! Pre-packaged models are more often than not very performant out of the box. You should however keep in mind that you sometimes need to tailor a solution to your particular problem to get an actual edge. This is especially true when your data size is too small for ML to actually find any meaningful structure on its own:
In Machine Learning, dataset size matters
Consider another fictitious scenario:
You are sitting in a job interview and discuss with your future boss what problems your Machine Learning models might solve for them. Let’s say we are dealing with a producer of industry materials, and their goal is to detect damaged goods on their conveyor belt via computer vision.
By chance, you have worked hard on image classification projects for the past 6 months and your Kaggle score was soaring in that time. This gives you a sharp confidence boost and you tell your potential employer that they better hire you on the spot. Fast forward – you get the job and now you see the data they were collecting so far for the first time.
Suddenly you feel the chills when you move your mouse to the data folder and find that there are only 100 images available in total. Those are clearly too few examples to train a reliable Neural Network. If you are lucky, you can just ask them to increase that amount by a factor of 100 before you even considering an
import tensorflow as tf.
In case you are unlucky, your boss will tell you that creating new training data is too expensive.
The quality assurance team needs to conduct costly tests to determine if a single item is damaged or not. At this point, there might be no other choice left than trying to build a custom model and incorporate as much domain knowledge into it as possible. A solid foundation in mathematics now allows you to express complex concepts in your algorithms when pre-packaged solutions don’t exist anymore.
On a higher level: If the properties of your particular problem diverge too much from the norm, standard library algorithms might not work at all. At that point there is basically no other option than writing your own algorithm for which you should know some theory.
A question of speed
In case you work a lot with Python and R, you might have gotten to a point where you try to avoid for-loops as much as possible. Especially when your model runs through multiple loops per training epoch, performance quickly degrades. This is particularly inconvenient with modern models where GPUs and CPUs are frequently being pushed to their limits already.
Luckily, a lot of these methods rely on linear algebra. Thanks to the latter, we might be able to replace sluggish for-loops by faster matrix multiplications and even benefit from moving entire blocks to the GPU – see this article for an introductory example.
In the general case, we can fall back on a large body of literature about how to improve efficiency in computational linear algebra. The book Matrix Computations is arguably a very popular source about the topic.
In particular for Gaussian Process regression, massive performance improvements have recently been developed. Many of these solutions were made possible by modern computational linear algebra. You can find some references here, here and here.
Another case for the computational argument, aside from performance, is floating point precision. As memory is finite, some mathematical operations can cause valid calculations to produce mere garbage. My favourite example from statistics is Maximum Likelihood estimation.
I won’t go into details here but running Maximum-Likelihood naively would quickly become infeasible to do on a computer. Unless – we exploit the monotony and product-sum property of the logarithm in this procedure. You can find more extensive explanations about this topic here and a related problem here.
Of course you most likely don’t want to publish your models and solutions in academia, so why even bother? Well, a smart mathematical ‘trick’ could make the difference between running your model on a 10 instance GPU cluster and a single laptop. Being able to save companies and clients a lot of money that way will surely boost your market value as a Machine Learning engineer.
Needle in a haystack
Let us consider a final argument in favour of learning some theory. If you ever had to debug a complex computer program, you might have experienced how tiresome the process can be. Sometimes you need to spend hours to find a single issue in your code and realise afterwards how stupid the mistake actually was.
Now imagine you have to debug a program in a foreign programming language you have never seen before. If you are lucky, somebody has already solved the exact same problem on Stackoverflow and you can simply copy their solution.
When the program is too complex, however, you likely need to figure things out yourself with the handicap of working in a foreign language. At some point, you will probably find the right solution and move on. The process however might be much more daunting than if you knew the language in the first place. Somebody with more experience might need a few minutes to solve an issue that took you hours or days to even discover.
This could also happen to you when implementing a complex Deep Learning model without any prior knowledge about the theory. Modern neural architectures are looking more and more like actual computer programs. If you don’t believe me, I recommend having a glance at Neural Turing Machines or Differentiable Neural Computers.
Now, even if you find your favourite model in a Github repo that you can just copy and paste from, you might still get funky results once you feed it your particular dataset.
Gradients are exploding? Maybe you should standardise your data first. Your variational auto-encoder produces garbage? Consider transforming your variance output neurons to be strictly positive. Normal distributions don’t work with negative variance.
While trial, error and a helpful online community might ultimately guide you to a solution, the process could be rather daunting and lengthy. If you cannot fall back on a holistic understanding of what is happening under the hood of
tf.keras.Sequential, any bug you encounter could put you at the mercy of chance and other peoples’ helpfulness.
In the early days of your Machine Learning journey, this will certainly happen to you on a regular basis and is expected. At some point however, you might not want to justify a missed deadline with your five unanswered questions on Stackoverflow anymore.
Should a lack of mathematical and statistical knowledge stop you from becoming a Machine Learning engineer? Certainly not! As often in life, a balanced approach is likely the most sustainable option here. I definitely don’t believe that a PhD is necessary to become successful in the field. Let me nevertheless summarise my key points in favour of building a solid theoretical foundation:
- Chances increase that you find ways to boost both predictive and computational performance
- You have a broader toolbox for solving non-standard problems
- Development becomes faster and more efficient as you know exactly what is happening inside your models
- You can spot violations of the theory early on and avoid them causing any issues further downstream
Learning things from the start is definitely going to slow you down at first. Nonetheless, the long-term benefits will by far outweigh the hassle initially.
Hopefully, this little piece has convinced you to begin looking a little deeper into the theoretical aspects of Machine Learning if you are not already doing so. Of course you should not neglect other aspects of the craft like programming, computer science and working on applied projects early on. In case you have just started, I wish you a lot of fun on this fascinating journey.
Finally, if you disagree with my assessment, I am looking forward to a nice discussion – either in the comments or through some other communication channel.