6/16/17

Beginning Data Science and Supervised Learning in R

By Thomas Mailund

Data science is a hot topic these days; you hear it mentioned all the time. But what is actually meant by the term data science tends to differ depending on who you ask to define it. I will give you my definition: Data science is the science of learning from data.

This is a very broad definition—almost too broad to be useful. I realise this. But then, I think data science is an incredibly general field. I don’t have a problem with that. Of course, you could argue that any science is all about getting information out of data, and you might be right. Although I would say that there is more to science than just transforming raw data into useful information. The sciences are focussing on answering specific questions about the world while data science is focusing on how to manipulate data efficiently and effectively. The primary focus is not which questions to ask of the data but how we can answer them, whatever they may be. It is more like computer science and mathematics than it is like natural sciences, in this way. It isn’t so much about studying the natural world as it is about how to compute efficiently on data.

Included in data science is also the design of experiments. With the right data, we can address the questions we are interested in. With a poor design of experiments or a poor choice of which data we gather, this can be difficult. Study design might be the most important aspect of data science, but is not the topic of this book. In this book I focus on the analysis of data, once gathered.

Computer science is also mainly the study of computations—as is hinted at in the name—but is a bit broader in this focus. The name “computer science” puts the focus on computation while using the name “data science” puts the focus on data. But of course, the fields overlap. If you are writing a sorting algorithm, are you then focusing on the computation or the data? Is that even a meaningful question to ask?

There is a huge overlap between computer science and data science and naturally the skill sets you need overlap as well. To efficiently manipulate data you need the tools for doing that, so computer programming skills are a must and some knowledge about algorithms and data structures usually is as well. For data science, though, the focus is always on the data. In a data analysis project, the focus is on how the data flows from its raw form through various manipulations until it is summarised in some useful form. Although the difference can be subtle, the focus is not about what operations a program does, during the analysis, but about how the data flows and is transformed. It is also focused on why we do certain transformations of the data, what purpose those changes serve, and how they help us gain knowledge about the data. It is as much about deciding what to do with the data as it is about how to do it efficiently.

Statistics is of course also closely related to data science. So closely linked, in fact, that many consider data science as just a fancy word for statistics that looks slightly more modern and sexy. I can’t say that I strongly disagree with this—data science does sound sexier than statistics—but just as data science is slightly different from computer science, data science is also slightly different from statistics. Just, perhaps, somewhat less different than computer science is.

A large part of doing statistics is building mathematical models for your data and fitting the models to the data to learn about the data in this way. That is also what we do in data science. As long as the focus is on the data, I am happy to call statistics data science. If the focus changes to the models and the mathematics, then we are drifting away from data science into something else—just as if the focus changes from the data to computations we are drifting from data science to computer science.

Data science is also related to machine learning and artificial intelligence—and again there are huge overlaps. Perhaps not surprising since something like machine learning has its home both in computer science and in statistics; if it is focusing on data analysis, it is also at home in data science. To be honest, it has never been clear to me when a mathematical model changes from being a plain old statistical model to becoming machine learning anyway.

Download an excerpt of my book Beginning Data Science in R. This is an abridged excerpt of Chapter 6 concerning supervised learning.

Excerpt of Chapter 6: Supervised Learning, from Beginning Data Science in R.

About the Author

Thomas Mailund (@ThomasMailund) is an associate professor in bioinformatics at Aarhus University, Denmark. His background is in math and computer science but for the last decade his main focus has been on genetics and evolutionary studies, particularly comparative genomics, speciation, and gene flow between emerging species.