Everyone wants to be a data scientist, nowadays. It’s the “sexiest job of the 21st century,” according to Harvard Business Review. It's intellectually rewarding, pays well, and has consistently strong job growth.
The hard part, though, is landing your first data scientist job. It seems every company has that "two years experience" requirement, which makes it a challenge for newcomers to break into the field.
It took me five years after graduating with my undergrad degree to land my first data scientist job. I know others who transitioned into the role sooner, but I had to network and grow my skill set through graduate school before I was ever seriously considered for the job.
Like most analytics jobs, data scientist roles require a solid foundation in technology, with the addition of statistical expertise. It simply doesn’t matter how good of a strategist or communicator you are – you can’t succeed in a data scientist role without those skills.
Ironically, long-term success requires the opposite focus. After you mastered the technical and statistical expertise, you have to actively work to expand your soft skills. That will ensure your talents translate into a positive impact for your stakeholders.
There's a long list of things to learn, but it's helpful to simplify what you should focus on to achieve this goal. Down below are six specific skill groups that will help you either land a data scientist job or improve in your existing data scientist role.
1. SQL (Structured Query Language)
SQL stands for structured query language. It's used to manage large databases. Since data science is all about... well, data, it makes sense that you should learn the most common programming language used for data!
There's several versions of SQL databases out there. Some of the more common ones include Oracle, Microsoft SQL Server, MySQL, Amazon AWS, PostgreSQL, etc.
Don't worry about picking the right one to learn. The SQL syntax is very similar between these tools. There's some minor differences, but they're so easy to figure out that it's not a big deal if you learn only one tool to get started.
You don't have to spend money to learn SQL either. Mainly because there's so many free resources out there to learn the language. The best one is the w3 schools course. I took this when I started my first analytics job. It's practical and has lots of interactivity, as well as little quizzes along the way.
2. Statistical Programming Language
You would think SQL would be the only programming language you'd need to know, right? Sadly, that's not true. You'll need to learn a statistical programming language to become a data scientist.
The two big programming languages used for data science are R and Python. Both have their strengths and weaknesses.
R is my preferred language. It's more firmly rooted in statistics whereas Python is more rooted in computer engineering. If you want to learn R, download the beta version of my book, R Programming in Plain English.
Python is not a statistical programming language, per se, but it has a large collection of data science libraries. It also works better in production environments than R programming – something I'm told that R proponents are fighting hard to fix.
There's another programming language called SAS. SAS is more popular among old school statisticians than it is among data scientists. Unlike Python and R, SAS costs money to use. It's mostly older institutions that use SAS, but these companies are major employers. So there's still value in learning the SAS programming language.
3. Advanced Statistics
It's hard to fathom data science without advanced statistics. To become a data scientist, you need to learn about linear and logistical regression, multivariate statistics, survival analysis, and more.
That's why so many data scientist jobs require a masters degree. You'll take more statistics courses in a graduate program than you would with a bachelor's degree.
Sadly, many people are led to believe this isn't true. They'll tell you to focus on programming and read about the statistics on the internet.
I can tell you from experience though that it's far, far better to learn statistics from an accredited masters program than it is to create your own curriculum and read blog posts on the internet.
The reason is that there's a lot of nuance to which methods work in which scenarios. That's hard to teach via blog posts on the internet.
If you want to get hands on experience with analytics and data while you pursue a masters degree, you can take online programs at KU-Edwards or other universities. In the meantime, I suggest getting a job such as report / tableau developer, data analyst, or data engineer.
4. Machine Learning
Many people make machine learning into something that it's not. It's not about creating sentient computers that will bring about the robot apocalypse like the Matrix or Terminator. It's really a very specific application of the statistics you'll learn.
Most traditional research forms hypotheses about relationships and tests them with statistics. With machine learning, you use statistics to uncover those relationships to make predictions or learn something new about the data that you wouldn't have known otherwise.
It's a very subtle distinction, but it's an important one to understand.
Most machine learning methods use advanced statistics, but there are simpler methods that entry-level analysts can use now. The most popular of these simpler methods is the decision tree. You can use R or Python to build them and they're very intuitive to use and interpret.
5. Cross Validation Techniques
Machine learning algorithms often fail when applied to new observations. That's because the development environment doesn't match the production environment. In simpler terms, the model isn't accurate when it tries to make real-world predictions. This can damage the credibility of data scientists in an organization.
One real-world example I read about involved a group of data scientists that worked with a police department to flag officers at risk of using excessive force. Sadly, their model led to a large number of false positives. After a while, the police officers saw it as a badge of honor to be flagged because it was so common. It meant "you're one of us now!" It didn't matter whether the data scientists later improved the model – the damage was done.
Fortunately, cross validation techniques allow you to train a model on one subgroup of your data and testing it on one or many other subgroups. This better prepares it for new observations in the future.
There's several cross validation techniques. If you want to learn more about this topic, I suggest the book Statistical Learning. It's a fairly straightforward book and their chapter on cross validation is one of the best chapters I've read in data science or statistics text books.
6. People Skills
Many people in STEM lack people skills. This gets back to what I said earlier – to get a STEM job, you have to meet the basic skills requirements. For data science, that’s statistics and programming.
That means many people with great interpersonal skills are rejected in favor of people who know statistics and programming.
This isn’t a bad thing. If you’re going to build a sophisticated data science operation, you need to prioritize people who actually have the aptitude and skill set to deliver these services over someone who is easy-to-work with.
The problem though is that stakeholders don’t often know what data science really means. They often fail to distinguish between sophisticated machine learning programs and more routine business intelligence solutions. They also don’t always grasp what the benefits are. Nor do they understand how their problems can be solved with data science.
Many who invest in data science programs do so because it’s a popular trend in the business community.
That means there’s still a huge burden placed on data science teams to bridge that gap – between the talent and the stakeholder – so that real value can be delivered.
It takes people skills to bridge that gap. And I worry not enough data scientists are making that effort. That makes it a real possibility that the business community will turn its back on data science the more times they fail to get real value out of it.
In order to become a data scientist, focus on the technical and statistical skills required. Learn SQL, Python or R programming, higher-level statistics, machine learning concepts, and cross validation.
In the long-run, improve your personal skills to ensure that your work delivers value to the stakeholder.