Introduction

What is Data Science?

Before learning how to do it, let's first define what Data Science actually is. I want to generalize the definition from wikipedia and define Data Science as the process of extracting valid knowledge and insights from data of all forms. This definition doesn't specify any tools or even the use of a computer yet. If you derived insights from data using only a ruler and the calculator of your mobile phone but your results help your company increase revenue by a seven figure number, who would care about the tools you used? If someone gave you a model that allows you to consistently forecast stock prices up to 1% exactness, would you reject the model because it is just twenty cells of basic Excel formulas instead of 5,000 lines of C++ code and mathematically based on Teichm├╝ller spaces? Of course, these questions are no-brainers and as long as the results and implications of applied Data Science are correct, we shouldn't care about how those results were derived. However, in reality it will mostly be hard to proof the validity of our results and recommendations beforehand without the proper use of mathematics and algorithmic theory. Also, we don't want to spend weeks of repetitive excel copy-pasting if a Python script could do the same task in seconds without being prone to mistakes due to inattention. 'Good' Data Science is actually based on several building blocks that should be developed in order to consistently produce valuable results.

The building blocks

There exist several frameworks on what components are necessary to become the Unicorn Data Scientist every recruiter is dreaming of. I want to quickly present my view on this - not to add another two cents to this battle of the venn diagrams but to provide a reasonable structure to the outline of this tutorial. To me, good Data Science consists of:

  1. Quantitative Expertise (Math, Stats)
  2. Computer Science Expertise (Algorithms, Data Structures, Programming,...)
  3. Domain Expertise
  4. Communication Expertise (Verbally, written, but also in communicating via visualizations)

While the former three are common, the latter sometimes seems to not receive the attention it deserves. Most of the time, there are several stakeholders that need to be adressed in a Data Science project. If you cannot communicate your findings in a way that everyone understands it and can appropriately act on it, little value is added. I'd even argue that a good Data Science communicator should be able to communicate complex results in a fashion that even a ten year old child can understands the key aspects. This is of course seldom a god given skill but making abstract concepts tangible to everyone is definitely worth developing, as former students of Richard Feynman would testify.
In general, those four areas of Data Science should be developed in a balanced way. Have much higher Quantitative Expertise than the other ones and you become more of a Mathematician or Statistician, have much higher Communication Expertise and you will soon be better suited as a Data Translator than an actual Data Scientist - the list goes on. While it is certainly not bad to focus on one area of expertise only or have focused interest in a single aspect for some time, I think that one's aim as a Data Scientist should be to develop each expertise in a roughly balanced way.

How this section is structured

The core structure of this tutorial is given by the four building blocks from above with a focus on application. That means that I will try to keep the amount of formulas and math as low as possible but as high as necessary to work with more advanced topics later on. Also, I aim to explain the fundamental topics in way that they are tangible to a high-school graduate - if you think the writing is stil too complicated, feel free to write an email.
At last, I am by no means perfect in any of the four areas so contributions from other people to this section are highly appreciated - cudos included ;).