What Is Data Science?
This is a book about doing data science with Python, which immediately begs the question: what is data science? It’s a surprisingly hard definition to nail down, especially given how ubiquitous the term has become. Vocal critics have variously dismissed the term as a superfluous label – after all, what science doesn’t involve data? – or a simple buzzword which only exists to salt resumes and catch the eye of overzealous tech recruiters.
In my mind, these dismissals miss something important. Data science, despite its hype-laden veneer, is perhaps the best label we have for the cross-disciplinary set of skills that are becoming increasingly important in many applications across industry and academia. This cross-disciplinary piece is key: in my mind, the best extisting definition of data science is illustrated in the Data Science Venn Diagram created by Drew Conway in 2010:
While some of the intersection labels are a bit tongue-in-cheek, this diagram captures the essence of what I think people mean when they say “Data Science”: it is fundamentally an interdisciplinary subject. Data Science comprises three distinct and overlapping sets of skills: the skills of a statistician who knows how to model and summarize datasets which are growing ever larger; the skills of a computer scientist who can design and use algorithms to efficiently store, process, and visualize this data; and the domain expertise – what we might think of as “classical” training in a subject – necessary both to formulate the right questions and to put their answers in context.
With this in mind, I would encourage you to think of data science not as a new domain of knowledge to learn, but a new set of skills that you can apply within your current area of expertise. Whether you are reporting election results, forecasting stock returns, optimizing online ad clicks, identifying microorganisms in microscope photos, seeking new classes of astronomical objects, or working with data in any other field, my goal is that the content of this book would give you the ability to ask and answer new questions about your chosen subject area.
Who Is This Book For?
In my teaching both at the University and at various tech-focused conferences and meetups, one of the most common questions I have heard is this: “how should I learn Python?” The people asking are generally technically-minded students, developers, or researchers, often with an already strong background in writing code and using computational and numerical tools. Most of these folks don’t want to learn Python per se, but want to learn the language with the aim of using it as a tool for data-intensive and computational science. While a large patchwork of videos, blog posts, and tutorials for this audience is available online, I’ve long been frustrated by the lack of a single good answer to this question; that is what inspired this book.
This book is about how to do data-intensive and computational science and analysis using the open source ecosystem of tools available in Python. It is not meant to be an introduction to programming in general; I assume the reader has familiarity with some coding language, and has some level of experience in defining functions, assigning variables, controlling the flow of a program, and other basic computer science tasks. The first sections, however, do offer content which will be helpful to the reader who has never experienced the joys of programming in Python in particular.
I recognize that many readers will come to this book already experienced on one or another side of the Venn diagram. You as the reader know your own background, and should feel free to focus on the parts of the book that will be most useful to you: for example, a developer new to Python might focus their reading on the first few sections detailing the use of Python tools, while an experienced Python user new to data science might skim these chapters and instead focus on the latter sections which use these Python tools to explore statistical approaches to data. My hope is that as the reader gains expertise in these overlapping areas, the organization of this book will allow him or her to continue using it as a reference, pulling it off the shelf often to learn or recall the basics of different Python data analysis tools and patterns.
What to Expect from This Book
The mental model of data science advocated above consists of the overlap of three broad parts: the computational expertise, statistical expertise, and domain expertise required to tackle modern data-rich analysis. The first four sections of the book focus on the computational component: they are a walk-through of the extensive and mature ecosystem of data-focused tools available in the Python programming language. The remaining sections of this book tackle the statistical component: the fundamental concepts of statistics and mathematics useful in analyzing datasets of a variety of size. The goal is that by the end, readers will be poised to use these Python tools process, describe, model, and draw inferences from the various data they encounter.
The final component – the domain expertise – is necessarily left up to the reader. It is up to you to take the fundamental tools and concepts presented here and apply them to data in your own research, work, or field of interest, whether that data consists of server logs, click rates, detector output, economic indicators, election results, traffic volumes, stock prices, social media posts, or anything in between.
This book focuses on recipes and techniques for doing data science using the Python programming language. Python has emerged over the last couple decades as a first-class tool for scientific computing tasks, including the analysis and visualization of large datasets. This may have come as a surprise to early proponents of the Python language: the language itself was not specifically designed with data analysis or scientific computing in mind. The usefulness of Python for data science stems primarily from the large and active ecosystem of third-party packages: particularly NumPy for manipulation of homogeneous array-based data, Pandas for manipulation of heterogeneous and labeled data, SciPy for common scientific computing tasks, Matplotlib for publication-quality visualizations, IPython for interactive execution and sharing of code, Scikit-Learn for machine learning, and many more tools that will be mentioned in the following pages.
Python 2 vs Python 3
This book uses the syntax of Python 3, which contains language enhancements which are not compatible with the 2.x series of Python. Though Python 3.0 was first released in 2008, adoption has been relatively slow, particularly in the scientific and web development communities. This is primarily because it took some time for many of the essential packages and toolkits to be made compatible with the new language internals. Since early 2014, however, stable releases of the most important tools in the data science ecosystem have been fully-compatible with both Python 2 and 3, and so this book will use the newer Python 3 syntax. Even though that is the case, the vast majority of code snippets in this book will also work without modification in Python 2: in cases where a Py2-incompatible syntax is used, I will make every effort to note it explicitly.
In a few places through this book, shell commands are used. These shell commands are standard ones that you’ll see on Linux, Unix, Mac OSX, and other
Unfortunately, the default Windows shell is different from these; if you’re using a Windows system some of these commands may not work.
It has been my experience that because of this difference and others, most users of Python for data-intensive computing tend to avoid Windows PCs.
For readers using Windows, the vast majority of this book still applicable; just be aware that when the book uses shell commands (marked by “
!” in IPython), they may not work as advertized.