Remember the Gold Rush of the 19th century? Well, maybe you don’t, most probably because it was more than a couple of centuries ago. During that period, people travelled vast distances and even migrated completely to try their hand at prospecting for gold. Their only motivation was to strike a fortune and get a much better life going for themselves. Well, I wouldn’t be too wrong to say that in the current day and age, data is that gold for almost all major corporations.
You won’t actually find corporate employees going around the street hunting for data (they actually do and it’s called surveys). There’s no denying that major companies have completely changed the way they work since they have harnessed the power of data and analytics. Major decisions are being taken by companies as diverse as a software giant and a fast food retail chain by applying the same data techniques. With all this and more, it is evident that data science is a great career choice to make right now. But with automation looming large over the industry, is it really a safe option?
The different disciplines that combine to create the field of data science
Look Ma, I’m a Scientist!
While data science and data scientists might be a buzzword, there is a common dearth of awareness about what the role actually entails. Or if it is even an actual role at all. A broad definition of data science encompasses any kind of insights drawn from a large enough set of data. But that isn’t helpful for someone who wants to be a data scientist. Without the knowledge of the role, they would essentially be flying hazy at best, blind at worst.
Mr. Arun Krishnan, Founder and CEO, nFactorial Analytical Sciences, says, “A data scientist is a combination of a computer scientist and a statistician. She has skills in Math, Statistics, Programming, Databases, Data Visualisation and Domain Knowledge, in addition to having good communication skills to tell a story with the data. A data analyst is probably someone with logical and statistical skills without the others mentioned above.”
Mr. Arun Krishnan, Founder and CEO, nFactorial Analytical Sciences
Automation is a tool like any other. Data scientists would still be required to bring everything together into a coherent whole and in enabling interpretation and insights from the same data.
While the role of a data scientist does not bring much more information to light by itself, there are quite a few job roles that are related to the title. All of them might be called Data Scientists, but the roles can be quite different when compared. The following are not actual roles and only meant for getting an idea of possible tasks:
Data Analytics: While Data science is well established as a separate field now, there are multiple companies that equate it with being a data analyst. Typical responsibilities in this role might involve pulling data out of MySQL databases, becoming a master at Excel pivot tables, and producing basic data visualizations (for example, line and bar charts). This role is good to get a hang of things without much risk, and even experiment once in a while.
Data Wrangler: Medium to small companies that are making the switch to a better data infrastructure, due to increased traffic and consequently increased data, have the demand for a role like this. This role usually involves setting up the infrastructure along with providing analysis and is suitable for someone with a software engineering background. This one’s for the risk takers who want an opportunity to shine in this field.
Data Jedi: Data drives these companies and data is their product. Usually, in such an organisation, the degree of machine learning and statistics involved is high. As a result, this type of a role is ideal for someone with a strong academic background in those fields with a passionate desire to continue down the same road.
Data Consultant: Companies with a reasonable stake in data, but not data companies themselves, are usually on the lookout for this role. When joining a company like this as a data scientist, you will probably be a part of an established team of data scientists. While your data science capabilities are important, it is equally important here that you should be able to deal with production code, analyse numbers, visualise information etc.
The interesting thing is, all of these companies will post their job requirements as ‘Data Scientist’, even though they require different skill sets and experience levels. Depending on the company you are applying for, you might want to give more importance to certain skills compared to others.
With the different requirements under the broad definition of ‘data scientist’, it’s admittedly confusing whether one should focus on one particular skill, or another. But broadly, there are eight different areas where you need to focus:
Tools - Be it a data centric company or a consulting company, you will definitely be expected to know the popular software tools for data science, namely statistical programming languages like R and SAS, along with a database query language like SQL.
Statistics - This is where you go back to your basic stats class. Depending on the role, you may need up to an advanced level of knowledge of statistics for your job. It will definitely be a recurring topic in your interview, so it is better to be prepared. This is especially important for a role in a non-data company that is data driven, where you will be expected to present results, helping in decision making.
Multi-variable calculus and linear algebra- If you’re wondering why a data scientist would need to learn this, even though there are tools out there, programmed in R or other languages, that can take care of this, then you’re not alone. The advantage here is that you might face questions based on these in your interview, since these form the basis for a lot of techniques used in data science. Also, if the company decides to build their own implementation in house, this will be needed.
Data Visualisation- Often, people who will be using the results of your work will not have a technical in-depth knowledge of data science. Hence, it is really important to have visualisation skills which will help you communicate your data in an understandable form with the decision makers. Especially useful if you’re in a company that is new to the data science field.
Software Engineering- For smaller companies that are implementing data analysis and are hiring a data scientist for the first time, you can expect to be responsible for developing data driven products down the line and immediately take charge of other developments. A background in software engineering will be highly useful here.
Machine Learning- In companies dealing with a large and complex data set, or where the product itself is data driven, a knowledge of machine learning is a big plus. But it isn’t exactly a deal breaker. Rather than the knowledge of programming an implementation, the knowledge of when to use which technique is more important here.
If you are an experienced professional, having worked on related fields earlier, it may or may not work in your favour. As Mr. Arun puts it, “Anyone who has worked with data, in analysing, parsing and interpreting, can hope to get started. However, to become a data scientist, it does need additional IT skills or at least an overall awareness and knowledge of technologies which might be difficult to catch up on for very experienced people”.
It’s good to know that everybody is going the ‘data’ way and there is a genuinely high demand for professionals in that area. But before you jump into the bandwagon and start preparing yourself, you need to know what kind of company you want to work for, based on the information given earlier. Why is this important? With this one decision, you might entirely change the kind of work you do.
E-commerce is a segment where data scientists and data experts are in high demand. The work there involves logistics optimisation, customer preference analysis, monitoring campaigns, showcasing sales performance to vendors and more. Another segment where data plays a big role are live apps – apps that connect service providers to consumers live – like a cab aggregation service, or a food ordering service. Such services usually have algorithms that are meant to optimise the service availability versus profitability in providing the service, which in turn is created on the basis of data!
Even the more traditional field of advertising has been leveraging data since the popularity of mobile advertising has grown immensely. Don’t take our word for it. Mr. Sunil Nandihalli, Head of Data Sciences at AppLift, a mobile ad tech company, explains, “Data can be used in many ways to create a comprehensive advertising plan for a particular campaign. It enables more targeted advertising, which ensures that users are able to view ads that are most relevant to them. For advertisers, there is greater visibility on their return on investment, thus helping fuel future decisions with respect to marketing spend. It is famously said that we have moved from an era of ‘mad men’ (referring to the famous American drama series) to ‘math men’, and rightly so. The new digital marketing landscape is being driven by a growth of programmatic advertising, where data is the foundation of performance-driven marketing.”
Mr. Sunil Nandihalli, Head of Data Science at AppLift.
There are a lot of opportunities to leverage data to improve business efficiencies and we have just scratched the surface of the huge realm of possibilities out there.”
Tools and their courses
While personal talents are essential to make it into the data science industry, a strong academic know-how is important as well. According to a study by Burtch Works, almost 88% of data scientists at least have a Masters Degree and 46% even have a Ph.D. That being said, companies are desperately on the lookout for candidates with real world experience that can complement and even surpass the academical proficiency needed for the roles.
Just like most fields now, if you don’t have the capacity or the time to accomplish a graduate or postgraduate certification in data sciences, you can go the self guided way with MOOCs. As Mr. Sunil puts it, “Python and R should definitely be part of the toolkit. An important point to note is that depending solely on tools cannot help in developing the vital technical skills required in a data scientist. Understanding the underpinnings (mainly mathematics) of the techniques used is very important.”
While developing the understanding is up to you, this link talks about the essential technical and non-technical skills that you will need and some of the best online sources to learn.
When things are at the scale of the web, normal data analysis will probably not work and it has to be taken up a notch with more powerful tools. In fact, most data scientists work with distributed loads that cannot be run on single machines. And the keyword for that is Hadoop.
Hadoop is an open-source framework widely used for storage and large scale processing of data sets on clusters of commodity hardware. Two of its essential features are MapReduce and Spark. While MapReduce is the programming paradigm that allows for massive scalability across servers in a Hadoop cluster.
Apache Spark is more like Hadoop’s Swiss Army knife. It is a fast running data analysis system that provides real-time data processing functions to Hadoop. While there are online resources available to learn this platform, true proficiency in this can only come through experience. Thankfully, this is open-source and there are a lot of community resources available to get you started.
The big name in big data
Kaggle is an online community dedicated to data science. It hosts competitions and leaderboards for data science problem solving regularly. You can even run your code on the cloud and get community feedback on it. Mr. Sunil says, “Kaggle is an interesting website where one can solve problems (and learn whatever necessary along the way) to become proficient data scientists.” Other than that, it is always productive to have your own pet project as well. Go here for data.
Some of the most useful certifications in Data Science are:
Certified Analytics Professional at https://www.certifiedanalytics.org/
CCP: Data Scientist at https://dgit.in/CCPDSCloudera
EMC: Data Science Associate at https://dgit.in/DataScientistEMC
SAS Certified Predictive Modeler at https://dgit.in/SASGlblCrtfct
Being a data scientist might sound like a very cool job (which it definitely is), after all, the information given earlier. But right now, the conclusion is to be drawn by you, not us – one of the many conclusions to come. “Data Science is as much art as it is science,” Mr. Arun rightly states. With decisions being automated everyday, a specialisation within Data Science is very essential for long term job security. Being curious and creative in the scientific context are as much of a necessity here as the technical skills mentioned earlier, to stay relevant in your role.
Also, remember that your conclusions will affect how major decisions are taken and will need to be understood by people with little to no technical knowledge. You need to be effective at working with a team and have great communication skills. Who knows, being a storyteller might be much more useful than all that mathematical and programming knowledge. After all, you’re the one who’ll be writing your own success story here!
This article was first published in October 2016 issue of Digit magazine. To read Digit's articles first, subscribe here or download the Digit e-magazine app for Android and iOS. You could also buy Digit's previous issues here.