Chico Caramago, a postdoctoral researcher in data science at the Oxford Internet Institute came to data science from a background in biology.
“Biology is big, messy and complex,” he told Built In, “so I was drawn towards tools that could help me make some sense out of that.”
Usually, humans make sense of the natural world’s complexity with our own natural tools: our brains and our senses. Data science augments those innate capacities, though, with algorithms and predictive models.
Caramago was especially drawn to unsupervised machine learning and natural language processing, which helps humans with everything from detecting signs of metastasizing cancer to understanding foreign languages with Google Translate.
At this point, in fact, data science has gotten so sophisticated that it doesn’t just enhance our natural abilities — it mimics them.
Take deep learning, for example. It “uses multiple layers [of algorithms] to progressively extract higher-level features from raw input,” Caramago explained.
Human vision works in a similarly layered way. “The first layers of neurons in our visual system are responsible for identifying light and dark,” Caramago said, “while the deeper layers respond to patterns like curves and straight lines.” Ultimately, the “nth” layer of neurons recognizes the visual for what it is: “Aha, it’s a face!”
In a way, data science has become humanity’s sixth sense. Yet it’s also probably the sense the average person understands the least. So for anyone hoping to learn more, we asked three experts to recommend their favorite data science books. Our panel included:
The resulting reading list ranges from technical machine learning and math textbooks to sociological studies of how algorithms impact our daily lives.
GENERAL INTEREST BOOKS
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz
CARAMAGO: This book is like Freakonomics in the age of data science. It’s 100 percent not a technical book. Every chapter tells some peculiar story illustrating a data science concept — like, there’s one chapter about Google searches, another about news, another about image data, etc. It’s a bunch of stories of people being creative and finding patterns in the most random things, because these random things actually reveal a lot. The book has that name because you can lie about what you eat and read, and you can lie about who you’re going to vote for — but if I have access to your search history, I can figure out the truth. It’s a book for people that are curious about what data science is and what it can do — especially when it comes to social data. The author finishes by saying the next Freud will be a data scientist, the next Foucault will be a data scientist, the next Marx will be a data scientist. I think that’s a bit much perhaps, because data science doesn’t answer every question ever. But it’s a fun book, to be read with a grain of salt.
Naked Statistics: Stripping the Dread from Data by Charles Wheelan
HERMAN: This book gives a lot of examples of how statistical concepts apply in the real world. Wheelan does not go into a lot of theory, but he has some pretty interesting examples and a kind of dry sense of humor. This the only statistics book that’s ever made me laugh, and it’s the book that we recommend our incoming students at the Flatiron School read beforehand. Our students come from a wide variety of statistics backgrounds, but I’ve always gotten really positive feedback on it. It’s ideal for beginners, but I also think that if you’ve never read it and you’re in data science, it’s a great read.
Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neil
CARAMAGO: The author of this book, Cathy O’Neil, used to be an academic mathematician. Then she went to Wall Street, then she went to Occupy Wall Street and now she’s an activist raising awareness of how algorithms rule our lives, and how they are not as neutral or unbiased as we like to believe. The book is a collection of stories of algorithms’ real-world applications, and a lot of them are about people who were classified as unworthy by an algorithm. Like, someone purchased an item at a particular shop and automatically got their credit card limit lowered, or a college student couldn’t get a job at a local grocery store because the algorithm said so.
She doesn’t just say “boo hoo, bad algorithm, bad machine!” though — she makes an effort to explain the mechanisms that might make an algorithm racist, for instance. So, why is a policing algorithm sending officers to black neighborhoods more often? Well, what happened in that case is that the algorithm was fed data on previous police patrols, which were more often in black neighborhoods. So the algorithm learned that those neighborhoods are the ones that receive more patrols. The algorithm simply reproduced what it was taught. The book makes you think a lot about how you can design algorithms and data science practices to deal with that.
Algorithms of Oppression by Safiya Noble
CARAMAGO: This book has a few stories, with very simple “data,” which the author explores in depth. I found it a very interesting read, because the author’s background is almost diametrically opposed to mine. She’s 100 percent qualitative, telling stories based on “small data” with a lot of context.
In one of these stories, the author, Safiya Noble, was organizing a party for her niece and other children, and she searched something like “black girls” on Google. To her surprise, she didn’t find pictures of children. She found websites like “HOT BLACK SINGLES IN YOUR AREA.” For other search terms, like “Latina girls” and “Asian girls,” she found the same stuff.
The reason this happened, she explained, is Google’s revenue model. The algorithm will serve whatever ad pays the most. And it becomes a troubling situation, because even though Google is an advertising company, we use it like a public library — like some sort of publicly accessible repository of information. I found it a very sobering read.
An Introduction to Statistical Learning: with Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
HERMAN: When I was first learning data science, most statistical textbooks were kind of unreadable. They went in-depth on theory and didn’t really show the application side. This book doesn’t go as deep statistically as a lot of other books, but it gives you enough knowledge to be successful as a data scientist, and it goes over the key machine learning algorithms. One of the issues people have with data science is that algorithms are these black boxes where you put data in and you get data out and you have no idea what happens in the middle. This book gives you enough statistical knowledge to understand what’s going on in that black box.
It’s geared toward people that don’t have any programming or statistics background. That being said, I’ve actually read this book multiple times. Even if you’re an experienced data scientist, a lot of statistical concepts, you kind of forget about them over time. As you work in a job, you’re not going to be using every single algorithm. You get comfortable. This book allows you to say, okay, maybe I should try this other algorithm.
Data Science from Scratch: First Principles of Python by Joel Grus
MILLER: This book is about how to write data science algorithms in Python. It’s a mix between a textbook and a normal book — a great entryway book, very appropriate for a layperson. So for instance, if I wanted to learn the machine learning algorithm Naive Bayes, this book says, “We’re going to literally program Naive Bayes as if it doesn’t exist in the world. We’re going to learn the math first and then write the code as part of that. We’ll build this algorithm together with nothing but Python.”
You probably want to know a little bit of Python and a little bit of statistics going in, but this book assumes almost no depth of knowledge. It’s not one of those books that’s like, “This is left to the reader because it’s easy.” And it will teach you all the standard machine learning algorithms, probably 10 or 15 different ones.
Hands-On Machine Learning with Scikit-Learn, Keras and Tensorflow by Aurélien Géron
HERMAN: This book will teach you how to run predictive analytics. In the data science world, there are two main programming languages: Python and R. There are pros and cons to both, but this book is specifically for Python. Scikit-Learn, Keras and TensorFlow are all libraries of machine learning and deep learning functions within the Python programming languages.
You have to be pretty good at these libraries to be a data scientist. When I was starting out, I would reference this book daily. To this day I probably look at it at least monthly as a reference, because he really goes deep into explaining how each algorithm works. A lot of algorithms have a lot of knobs or levers that you can turn — so depending on what the data is doing, you might change the algorithm a little bit. The author explains what those different knobs and levers are in a way that a beginner can understand, but someone with more experience can appreciate the level of detail that he goes into.
Think Stats by Allen B. Downey
MILLER: Data science is a mix of three different disciplines. One is programming and computer science; one is linear algebra, stats, very math-heavy analytics; and then one is machine learning and algorithms. The ideal data scientist is really good at all of them. But that doesn’t always happen, so this book is about building out that analytics, math and stats side of your data science knowledge. How do you do testing, how do you determine whether your solutions are working and the distributions are right, and how do you use that math stuff to solve business problems?
It’s textbook-y, but it isn’t a hardcore textbook. It also merges the statistical analysis with how you would write it in Python. Early in my career, I found statistics fairly easy, but making statistics into a program was more challenging. I found this very helpful for making that connection.
Grokking Deep Learning by Andrew W. Trask
CARAMAGO: This book is an introductory textbook for the beginner who wants to go beyond usage and understand a bit of how deep learning works. People who develop deep learning tools are usually drawing from a lot of mathematics: multivariate calculus, linear algebra, optimisation, often some physics too. But you don’t need all these things to understand what deep learning is doing. In the author’s words, “If you’ve passed high school mathematics and hacked around in Python, you’re ready for this book.” It covers some very general and fundamental bits, such as gradient descent, backpropagation and regularization, which are used in so many advanced tools that you cannot progress without a decent understanding of them.
I think books like this are important because thanks to online tutorials, you can get to a point where you’re implementing complex stuff without actually understanding how it works — all you need is Python and an internet connection. And that is troublesome, sometimes. People can waste resources by using deep neural networks where a linear regression would do (using a bazooka to kill a fruit fly, in a sense) or by implementing algorithms that lead to decisions that harm people, without the programmers realizing that’s happening.
Linear Algebra Done Right by Sheldon Axler
MILLER: This book is an undergraduate math textbook. It’s designed for a mid-level linear algebra course, which is something every data scientist can use. It’s not sexy. It’s not machine learning, it’s not flash programming. But the thing that I use more than anything else is my ability to take a matrix or a high-dimensional space and think about it. This is one of those books that, when you’re done, you will know inside and out how to do matrices and how to handle the vector space and how to do pure math about high-dimensional spaces. I wouldn’t say it’s for everybody, though. If this was your first math book, you would find it daunting. This is for a 200- or 300-level course.
MORE ADVANCED TEXTBOOKS
Pattern Recognition and Machine Learning by Christopher M. Bishop
MILLER: This book is definitely a textbook. It’s also, if you take Data Science from Scratch and then turn up the math level to 11, that’s what this book is. It bases everything on what is known as a Bayesian viewpoint, and it says that it has an intro for like Bayesian learning, which it technically does, but any beginner would be mortified by it about two pages in. When I talked to other data scientists who are as nerdy as me, though, this is the book that we always end up talking about.
As far as what pattern recognition means here — any machine learning is pattern recognition, right? Looking at how the stock market used to perform and then projecting how it should perform next, that’s pattern recognition. But similarly looking at a bunch of signs and learning, this pattern means “stop,” that’s a similar thing. Machine learning is a big, fancy, shiny term, which basically just means using the old data to think about the data you haven’t seen before. This is probably the best book I’ve read on the subject, just in terms of just depth and clarity of presentation. He’s not glossing over anything and he’s not making it super beginner-friendly. It’s just, this is how it works, and you can take it or leave it.
Deep Learning with Python by François Chollet
HERMAN: The author of this book is the creator of the library called Keras, which makes it a lot easier to build neural networks in Python — and usually, in deep learning, you’re using neural networks on unstructured data. So if you’re trying to predict if there’s a person in an image, or whether a review on Yelp is positive or negative, you would use a deep neural network. I remember when I was reading this, in the second chapter, you build a neural network for the first time. He writes out code in the book, and then you try it out for yourself on your computer, and you get 98 percent accuracy. The dataset is a bunch of handwritten numbers and you’re trying to predict what the number is, even though everyone’s handwriting is different. The ones the algorithm gets incorrect are ones that I would probably would get incorrect. Being able to do that in the second chapter, I was like, “Okay, I’m definitely gonna be finishing this book.”
Designing Data-Intensive Applications by Martin Kleppman
MILLER: This book isn’t a standard pick for a data science book because it’s very much in that data engineering, computer sciences corner of data science’s three pillars. It’s more about designing databases and making sure that your data can flow in and out of your system. If I wanted to build a system to store every Yelp review that’s ever existed, every Yelp user and all of that information — this book is about how you store that. How do you make sure that the data can go in and out? How do you make sure that the data is consistent and reliable? How do you make sure that your system doesn’t break when you get a million users instead of 100,000 users?
It’s not super data science-y, but I think it’s a piece of the puzzle that a lot of data scientists ignore, and it explains why your system should be this way very clearly. It doesn’t assume that you’re a data engineer or an admin. I would say anybody who’s a data scientist owes it to themselves to learn about how the systems they rely on work. But you probably aren’t going to sit down and read this one end to end. It’s more of a reference.
Data Science with Python and Dask by Jesse Daniel
HERMAN: The focus of this book is big data — specifically working on it with Dask.
Dask is a new library in Python and it’s this buzzword right now. I see it in pretty much every job description my students apply for, and I’m very fond of it. Most companies that work with big data use a library called Spark, but it has a huge learning curve. You have to learn essentially a new language to use it. Dask allows you to interact with massive datasets in libraries that you’re already comfortable with. In this book, I really liked seeing how concepts were applied. The author introduces a data set at the beginning — it’s 42 million parking tickets around New York City — and he’ll explain a concept and then apply it on that data set.
Responses have been condensed and edited. Images via Shutterstock and interviewees.