What You Need to Learn to Become a Data Scientist
1 Introduction
This next section covers all of the data science skills you’ll need to learn. You’ll also learn about the tools you need to do your job.
Most data scientists use a combination of skills every day, some of which they have taught themselves on the job or otherwise. They also come from various backgrounds. There isn’t any one specific academic credential that data scientists
are required to have.
There isn’t any one specific academic credential that data
scientists are required to have.
2 Data Science Skills
2.1 An Analytical Mind
Takeaway
You need to approach data science problems analytically to solve them.
You’ll need an analytical mindset to do well in data science.
You’ll need an analytical mindset to do well in data science.
A lot of data science involves solving problems. You’ll have to be adept at framing those problems and methodically applying logic to solve them.
2.2 Mathematics
Takeaway
Mathematics is an important part of data science. Make sure you know the basics of university math from calculus to linear algebra. The more math you know, the better.
Mathematics is an important part of data science.
When data gets large, it often gets unwieldy. You’ll have to use mathematics to process and structure the data you’re dealing with.
You won’t be able to get away from knowing calculus, and linear algebra if you missed those topics in undergrad. You’ll need to understand how to manipulate matrices of data and get a general idea behind the math of algorithms.
2.3 Statistics
Takeaway
You must know statistics to infer insights from smaller data sets onto larger populations. This is the fundamental law of data science.
You must know statistics to infer insights from smaller
data sets onto larger populations.
You need to know statistics to play with data. Statistics allows you to slice and dice through data, extracting the insights you need to make reasonable conclusions.
Understanding inferential statistics allows you to make general conclusions about everybody in a population from a smaller sample.
To understand data science, you must know the basics of hypothesis testing, and experiment design in order to understand where the meaning and context of your data.
2.4 Algorithms
Takeaway
Algorithms are the ability to make computers follow a certain set of rules or patterns. Understanding how to use machines to do your work is essential to processing and analyzing data sets too large for the human mind to process.
Understanding how to use machines to do your work is
essential to processing and analyzing data sets too large
for the human mind to process.
You’ll want to know many different algorithms. You will want to learn the fundamentals of machine learning. Machine learning is what allows for Amazon to recommend you products based on your purchase history without any direct human intervention. It is a set of algorithms that will use machine power to unearth insights for you.
In order to deal with massive data sets you’ll need to use machines to extend your thinking.
2.5 Data Visualization
Takeaway
Finishing your data analysis is only half the battle. To drive impact, you will have to convince others to believe and adopt your insights.
To drive impact, you will have to convince others to believe and adopt your insights.
Human beings have been wired to respond to visual cues. You’ll need to find a way to convey your insights accordingly.
2.6 Business Knowledge
Takeaway
Data means little without its context. You have to understand the business you’re analyzing.
Data means little without its context.
Most companies depend on their data scientists not just to mine data sets, but also to communicate their results to various stakeholders and present recommendations that can be acted upon.
The best data scientists not only have the ability to work with large, complex data sets, but also understand intricacies of the business or organization they work for.
Having general business knowledge allows them to ask the right questions, and come up with insightful solutions and recommendations that are actually feasible given any constraints that the business might impose.
2.7 Domain Expertise
Takeaway
As a data scientist, you should know the business you work for and the industry it lives in.
Beyond having deep knowledge of the company you work for, you’ll also have to understand its field your insights to make sense. Data from a biology study can have a drastically different context than data gleaned from a well-designed
psychology study. You should know enough to cut through industry jargon.
3 Data Science Tools
With your skill set developed, you’ll now need to learn how to use modern data science tools. Each tool has their strengths and weaknesses, and each plays a different role in the data science process. You can use just one of them, or you can use all of them. What follows is a broad overview of the most popular tools in data science as well as the resources you’ll need to learn them properly if you want to dive deeper.
3.1 File Formats
Data can be stored in different file formats. Here are some of the most common:
CSV
Comma separated values. You may have opened this sort of file with Excel before. CSVs separate out data with a delimiter, a piece of punctuation that serves to separate out different data points.
SQL
SQL, or structured query language, stores data in relational tables. If you go from the right from a column to the left, you’ll get different data points on the same entity (for example, a person will have a value in the AGE, GENDER, and HEIGHT categories).
JSON
Javascript Object Notation is a lightweight data exchange format that is both human and machine-readable. Data from a web server is often transmitted in this format.
3.2 Excel
Takeaway
Excel is often the gateway to data science, and something that every data scientist can benefit from learning.
Excel is often the gateway to data science.
Introduction to Excel
Excel allows you to easily manipulate data with what is essentially a What You See Is What You Get editor that allows you to perform equations on data without working in code at all. It is a handy tool for data analysts who want to get results without programming.
Benefits of Excel
Excel is easy to get started with, and it’s a program that anybody who is in analytics will intuitively grasp. It can be very useful to communicate data to people who may not have any programming skills: they should still be able to play with the data.
Who Uses This
Data analysts tend to use Excel.
Level of Difficulty
Beginner
Sample Project
Importing a small dataset on the statistics of NBA players and making a simple graph of the top scorers in the league
3.3 SQL
Takeaway
SQL is the most popular programming language to find data.
SQL is the most popular programming language to find data.
Introduction to SQL
Data science needs data. SQL is a programming language specially designed to extract data from databases.
Benefits of SQL
SQL is the most popular tool used by data scientists. Most data in the world is stored in tables that will require SQL to access. You’ll be able to filter and sort through the data with it.
Who Uses This
Data analysts and some data engineers tend to use SQL.
Level of Difficulty
Beginner
Sample Project
Using a SQL query to select the top ten most popular songs from a SQL database of the Billboard 100.
3.4 Python
Takeaway
Python is a powerful, versatile programming language for data science.
Python is a powerful, versatile programming language for data science.
Introduction to Python
Once you download Anaconda, an environment manager for Python and get set up on iPython Notebook, you’ll quickly realize how intuitive Python is. A versatile programming language built for everything from building websites to gathering data from across the web, Python has many code libraries dedicated to making data science work easier.
Benefits of Python
Python is a versatile programming language with a simple syntax that is easy to learn.
The average salary range for jobs with Python in their description is around $102,000. Python is the most popular programming language taught in universities:the community of Python programmers is only going to be larger in the years to come. The Python community is passionate about teaching Python, and building useful tools that will save you time and allow you to do more with your data.
Many data scientists use Python to solve their problems: 40% of respondents to a data science survey conducted by O’Reilly used Python, which was more than the 36% who used Excel.
Who Uses This
Data engineers and data scientists will use Python for medium-size data sets.
Level of Difficulty
Intermediate
Sample Project
Using Python to source tweets from celebrities, then doing an analysis of the most frequent words used by applying programming rules.
3.5 R
Takeaway
R is a staple in the data science community because it is designed explicitly for data science needs. It is the most popular programming environment in data science with 43% of data professionals using it.
R is a staple in the data science community because it is
designed explicitly for data science needs.
R is a programming environment designed for data analysis. R shines when it comes to building statistical models and displaying the results.
Benefits of R
R is slightly more popular than Python in data science, with 43% of data scientists using it in their tool stack compared to the 40% who use Python.
It is an environment where a wide variety of statistical and graphing techniques can be applied.
The community contributes packages that, similar to Python, can extend the core functions of the R codebase so that it can be applied to very specific problems such as measuring financial metrics or analyzing climate data.
Who Uses This
Data engineers and data scientists will use R for medium-size data sets.
Level of Difficulty
Intermediate
Sample Project
Using R to graph stock market movements over the last five years.
6.3.6 Big Data Tools Big data comes from Moore’s Law, a theory that computing power doubles every two years.
This has led to the rise of massive data sets generated by millions of computers. Imagine how much data Facebook has at any give time!
Any data set that is too large for conventional data tools such as SQL and Excel can be considered big data, according to McKinsey. The simplest definition is that big data is data that can’t fit onto your computer.
3.7 Hadoop
Takeaway
By using Hadoop, you can store your data in multiple servers while
controlling it from one.
By using Hadoop, you can store your data in multiple
servers while controlling it from one.
The solution is a technology called MapReduce. MapReduce is an elegant abstraction that treats a series of computers as it were one central server. This allows you to store data on multiple computers, but process it through one.
Benefits of Hadoop
Hadoop is an open-source ecosystem of tools that allow you to MapReduce your data and store enormous datasets on different servers. It allows you to manage much more data than you can on a single computer.
Who Uses This
Data engineers and data scientists will use Hadoop to handle big data sets.
Level of Difficulty
Advanced
Sample Project
Using Hadoop to store massive datasets that update in real time, such as the number of likes Facebook users generate.
3.8 NoSQL
Takeaway
NoSQL allows you to manage data without unneeded weight.
NoSQL allows you to manage data without unneeded
weight.
Tables that bring all their data with them can become cumbersome. NoSQL includes a host of data storage solutions that separate out huge data sets into manageable chunks.
Benefits of NoSQL
NoSQL was a trend that pioneered by Google to deal with the impossibly large amounts of data they were storing. Often structured in the JSON format popular with web developers, solutions like MongoDB have created databases that can be manipulated like SQL tables, but which can store the data with less structure and density.
Who Uses This
Data engineers and data scientists will use NoSQL for big data sets, often website databases for millions of users.
Level of Difficulty
Advanced
Sample Project
Storing data on users of a social media application that is deployed on the web.