Touted as the sexiest job in the 21st century, back in 2012 by Harvard Business Review, the data science world has since received a lot of attention across the entire world, cutting across industries and fields. Many people wonder what the fuss is all about. At the same time, others have been venturing into this field and have found their calling.
Eight years later, the chatter about data science and data scientists continues to garner headlines and conversations. Especially with the current pandemic, suddenly data science is on everyone’s mind. But what does data science encompass?
With the current advent of technology, there are terabytes upon terabytes of data that organizations collect daily. From tracking the websites we visit - how long, how often - to what we purchase and where we go - our digital footprint is an immense source of data for a lot of businesses. Between our laptops, smartphones and our tablets - almost everything we do translates into some form of data.
On its own, this raw data will be of no use to anyone. Data science is the process that repackages the data to generate insights and answer business questions for the organization. Using domain understanding, programming and analytical skills coupled together with business sense and know-how, existing data is converted to provide actionable insights for an organization to drive business growth. The processed data is what is worth its weight in gold. By using data science, we can uncover existing insights and behavioural patterns or even predict future trends.
Here is where our highly-sought-after data scientists come in.
A data scientist is a multifaceted role in an organization. They have a wide range of knowledge as they need to marry a plethora of methods, processes and algorithms with computer science, statistics and mathematics to process the data in a format that answers the critical business questions meaningfully and with actionable insights for the organization. With these actionable data, the company can make plans that will be the most profitable to drive their business goals.
To churn out the insights and knowledge that everyone needs these days, data science has become more of a craft than a science despite its name. The data scientists need to be trained in mathematics yet have some creative and business sense to find the answers they are looking in the giant haystack of raw data. They are the ones responsible for helping to shape future business plans and goals.
It sounds like a mighty hefty job, doesn’t it? It is also why it is one of the most sought after jobs these days. The field is rapidly evolving, and keeping up with the latest developments takes a lot of dedication and time, in order to produce actionable data that the organizations can use.
The only constant through this realm of change is the data science project lifecycle. We will discuss briefly below on the critical areas of the project lifecycle. The natural tendency is to envision that it is a circular process immediately - but there will be a lot of working back and forth within some phases to ensure that the project runs smoothly.
Stage One: Business Understanding
As a child, were you one of those children that always asked why? Even when the adults would give you an answer, you followed up with a “why”? Those children will have probably grown up to be data scientists as it seems, their favourite question is: Why?
By asking the why - they will get to know the problem that needs to be solved and the critical question will emerge. Once there is a clear understanding of the business problem and question, then the work can begin. Data scientists want to ensure that the insights that come from this question are supported by data and will allow the business to achieve the desired results. Therefore, the foundation stone to any data science project is in understanding the business.
Stage Two: Data Understanding
Once the problem and question have been confirmed, you need to start laying out the objectives of this project by determining the required variables to be predicted. You must know what you need from the data and what the data should address. You must collate all the information and data, which can be reasonably difficult. An agreement over the sources and the requirements of the data characteristics needs to be reached before moving forward.
Through this process, an efficient and insightful understanding is required of how the data can and will be used for the project. This operational management of the data is vital, as the data that is sourced at this stage will define the project and how effective the solutions will be in the end.
Stage Three: Data Preparation
It has been said quite often that a bulk of a data scientist’s time is spent in preparing the data for use. In this report from CrowdFlower in 2016, the percentage of time spent on cleaning and organizing data is pegged at 60%. That is more than half their day!
Since data comes in various forms, and from a multitude of sources, there will be no standardization or consistency throughout the data. Raw data needs to be managed and prepared - with all the incomplete values and attributes fixed, and all deconflicting values in the data eliminated. This process requires human intervention as you must be able to discern which data values are required to reach your end goal. If the data is not prepared according to the business understanding, the final result might not be suitable to address the issue.
Stage Four: Modeling
Once the tedious process of preparation is over, it is time to get the results that will be required for this project lifecycle. There are various types of techniques that can be used, ranging from decision-tree building to neural network generation. You must decide which would be the best technique based on the question that needs to be answered. If required, multiple modeling techniques can be used; where each task must be performed individually. Generally, modeling techniques are applied more than once (per process), and there will be more than one technique used per project.
With each technique, parameters must be set based on specific criteria. You, as the data scientist, must apply your knowledge to judge the success of the modeling and rank the models used based on the results; according to pre-set criteria.
Stage Five: Evaluation
Once the results are churned out and extracted, we then need to refer back to the business query that we talked about in Stage One and decide if it answers the question raised; and if the model and data meet the objectives that the data science project has set out to address.
The evaluation also can unveil other results that are not related to the business question but are good points for future direction or challenges that the organization might face. These results should be tabled for discussion and used for new data science projects.
Final Stage: Deployment
This is almost the finishing line!
Now with the evaluated results, the team would need to sit down and have an in-depth discussion on what the data shows and what the business needs to do based on the data. The project team should come up with a suitable plan for deployment to address the issue. The deployment will still need to be monitored and assessed along the way to ensure that the project will be a successful one; backed by data.
The assessment would normally restart the project lifecycle; bringing you full circle.
Data is everywhere
In this day and age, we are surrounded by a multitude of data science applications as it crosses all industries. We will focus on these five industries, where data science is making waves.
Banking & Finance
Financial institutions were the earliest adopters of data analytics, and they are all about data! From using data for fraud or anomaly detection in their banking transactions to risk analytics and algorithmic trading - one will find data plays a key role in all levels of a financial institution.
Risk analytics is one of the key areas where data science is used; as financial institutions depend on it to make strategic decisions for the financial health of the business. They need to assess each risk to manage and optimize their cost.
Logistics & Transportation
The world of logistics is a complex one. In a production line, raw materials sometimes come from all over the world to create a single product. A delay of any of the parts will affect the production line, and the output of stock will be affected drastically. If logistical delays can be predicted, the company can adjust quickly to another alternative to ensure that there will be no gap in the supply chain, ensuring that the production line will function at optimum efficiency.
2020 has been an interesting one. It has been a battle of a lifetime for many of us. Months have passed, and yet the virus still rages on to wreak havoc on lives and economies. Many countries have turned to data science applications to help with their fight against COVID-19.
With so much data generated daily, people and governments need to know various things such as:
- Epidemiological clusters so people can be quarantined to stop the spread of the virus
- tracking of symptoms over thousands of patients to understand how the virus transmits and mutates to find vaccines and
- solutions to mitigate transmission.
In this field, millions can be on the line each day as there are so many moving parts that can cause delays, production issues, etc. Data science is primarily used to boost production rates, reduce cost (workforce or energy), predict maintenance and reduce risks on the production floor.
This allows the manufacturer to make plans to ensure that the production line is always operating at the optimum level, providing the best output at any given time.
Retail (Brick & Mortar, Online)
Have you ever wondered why some products in a shop are placed next to each other or how discounts on items work? All those are based on data science.
The retailers track people’s shopping routes, purchases and basket matching to work out details like where products should be placed; or what should go on sale and when to drive up the sales for an item. And that is just for the instore purchases.
Online data tracks what you are buying and suggests what you might want to buy next based on past purchase histories; or even tells you what you might want to add to your cart. That’s how your online supermarket suggests you buy bread if you have a jar of peanut butter already in your cart.
As a data scientist, you must always remember that the power in the data. You need to understand how the data can be used to find the desired results for your organization. The right questions must be asked, and it has become more of an art than a science.
Image Source: Data science Life Cycle