an unofficial guide to learning data science
a list of free resources to help you start from scratch.
Data science careers are buzzing in the tech world, as what I see (in my biased opinion as a data scientist) to be the cooler younger sibling of the traditional software developer role. When I started college in 2017, data science wasn’t a major and I didn’t know much about data science as a career path. During my latter years in undergrad, data science boomed. The Data Science Union was formed, a Data Theory major popped up, and people were thirsting for data science internships just as much as coveted software engineering ones.
As I’ve talked with undergrads in the past year, many are curious about the field but don’t know where to begin. This week’s post is an unofficial guide of resources to learning data science that I’ve compiled from friends and mentors, as well as the tools I’ve found for my own learning. Candidly, I have a lot of data science to still learn, especially when it comes to technical skills. It's a journey and these resources are things I still actively reference in my day to day. What follows contains my personal opinions, primarily focused at students still in undergrad. A lot of these guides can get pretty lengthy, so I’ll try to keep my lists brief and get right to it.
take stock of your skills
To learn anything, it’s always good to start with what you know, and what you don’t. Assessing where you are at across the main skills of data science will help you focus your efforts.
The main skills to assess:
Python - most of data science uses packages written in Python, which are crucial for being technically adept
SQL — this one might be a hot topic, as a lot of data scientists claim to not use it, but I stand by this one as a great skill to understanding databases
Statistics - specifically metrics that could describe a set of numbers, like variance or distribution
Data visualizations — both in terms of using packages, as well as understanding proper ways to display data
Machine learning — this could take years to skim the surface of machine learning, so I’ll share more about foundational specifics
Data science project skills — while projects are great places for you to put your end-to-end skills to the test, they also are great to talk about during interviews and get feedback on from your peers
Technical communication — a really cool model is great, but being able to explain a technical process to folks at any level is a huge skill
you’ll never learn it all
I get overwhelmed sometimes trying to learn something new in data science as I feel like I’ll just simply never learn it all. And that’s the truth. Data science is a growing field, with academics, researchers, and developers adding new complexities to the existing landscape faster than one can keep up with.
Coming to terms with the fact that you won't ever learn it all is a good mindset to start with. This is also why I believe a company that supports learning for their early team members is something junior data scientists should seek out.
now for the resources
The following resources are aimed at folks who have either zero experience in the skill, or minimal. These resources won't make you a professional data scientist immediately, but rather are a starting point for your journey there. By covering these areas, I believe that you are on the path to success in completing a data science project, or completing an internship or new-grad data science interview.
python
While any Python course will be useful, I’d recommend focusing on Python for Data Science. There are a plethora of packages and special use cases for data science that are higher priority than being an expert in all of Python. You will, however, need to learn the Python basics before getting comfortable with these packages.
getting started: In terms of free resources, I like Introduction to Python or the courses on either DataCamp or CodeAcademy for something more interactive.
online practice:
I’ve spent hours practicing Python, particularly for interviewing, on HackerRank. They’ve got a massive set of sample questions that can help you learn the basics fast.
I also am a huge fan of Kaggle for all things data science. Their notebooks take away the bother of having to set up a development environment and having access to all their datasets make practice a breeze.
youtube videos: one of my favorite Youtubers, Programming With Mosh, has an incredible 6 hour learning course for Python. I spent a lot of time watching his Python series. I will say the entire 6 hours is probably not necessary, but not a bad way to spend a Saturday.
data science specific packages: Knowing Python is the start, but the packages built on Python really are the bread and butter of data science. Pandas is the tried and true skill of fast, simple data analysis and manipulation. I would spend a solid amount of time in the doc’s, learning all the things it can do. The other main packages including Sci-Kit Learn and Plotly are mentioned later on.
sql
With SQL, while it is important to learn how to write the logic using the language, there also is fundamental learning around database structures and storing tabular data. Starting with books like the one below gives good background before diving into the language.
books: Learning to code from books can feel counterintuitive, but I actually really like the method. I was given Learning SQL by a friend when I needed to, you know, learn SQL and I really liked it.
online resources: Folks tend to like SQL Easy as it reads like an online textbook and hits the ground running on learning the syntax (doesn’t cover much on databases and other things relevant to SQL that the books do a better job of covering)
practice on: HackerRank is the answer again. I attribute all of my SQL knowledge to reading Learning SQL, then practicing on here, and then practicing more.
statistics
To be honest, this is where I need the most help. I’ll be sure to read over this section myself more often.
courses: This may seem obvious, but taking undergraduate statistics classes, as well as classes that may cover probability or statistics for data science are key in a lot of data scientists’ understandings of the basics of stats. Stanford offers a free course here.
topics: The University of Virginia actually compiled a solid list of the statistics skills that data scientists should be familiar with. From their list, the skills I echo are:
Population: the source of data to be collected.
Sample: a portion of the population.
Variable: any data item that can be measured or counted.
Quantitative analysis: collecting and interpreting data with patterns and data visualization.
Descriptive statistics: characteristics of a population.
Inferential statistics: predictions for a population.
Central tendency (measures of the center): mean (average of all values), median (central value of a data set), and mode (the most recurrent value in a data set).
Measures of the spread: range, variance, and standard deviation
books: Higher-ed courses will typically recommend Introduction to Statistical Learning, which is used in a lot of masters’ courses. This book is above the undergrad stats need for starting in data science, but figured I’d share as it is commonly recommended.
visualization
The visual side of data science is a skill you learn by doing.
skills to learn: Most data scientists will use Plotly or Seaborn for their visualizations. I’d recommend Plotly Express, which is based on Python, so it tends to be a bit easier to dive into, while Seaborn is based on Matplotlib.
how to practice: Find a dataset in Kaggle and try out plots in both Plotly and Seaborn.
machine learning
where to start: Microsoft has an incredible Machine Learning For Beginners course that’s free on Github. I really enjoyed going through it, complete with quizzes and practice problems.
skills: Pretty much every machine learning problem uses SciKit Learn. Their docs are incredible clear and helpful to learning how machine learning works.
books: Hands On Machine Learning is a book I reference regularlyy. It does an incredible job of building your understanding of machine learning from the ground up.
online resources:
Kaggle Competitions are a great place for you to practice machine learning skills, but also to look at how other folks approach problems. Seeing how data scientists tackle problems can be incredibly enlightening to your own growth.
This Harvard course covers a lot of machine learning techniques and gives a solid mathematical background required for data science as well.
building a project
With Python, SQL, stats, ML, and visualization, you are in a great spot to try building something yourself. Kaggle is a great, hassle-free place to start. As you’re working on a dataset, building a model, and evaluating performance, always work through problems, bugs, and use them as encouragement to learn rather than roadblocks. Perhaps take the next steps and source your own data, from something that interests you. Solving problems in your everyday can be a practical way to activate your data science skills.
technical communication
I wish I had more resources to provide on this, but technical communication is really a skill you’ll learn over time. And you’ll only learn if you put yourself in potentially uncomfortable situations, like having to explain some code over Zoom, or take a speaking role in a data science class project. Some tips? Anything you learn, or a project you build — try explaining the concept or what you built to someone completely unfamiliar with data science. Practice truly does make all the difference in communication skills.
happy learning!
What skills did I miss? Are there areas of data science you’re still struggling to learn in? Let me know. I’m always trying to find new resources, and hope that this list can grow as folks chime in.
Thanks for reading, see you next week!