Data plays a huge role in today’s tech world. All technologies are data-driven, and humongous amounts of data are produced on a daily basis. A data scientist is a professional who is able to analyze data sources, clean and process the data, understand why and how such data has been generated, take insights from it, and make changes such that they profit the organization. These days, everything revolves around data. This demand for data is also increasing the demand for data science courses. While there are a lot of options available online, ensure you look at the best data science course before signing up for one.
With that being said, let’s jump into the details about the types of big data and the role of statistics in data science.
- Data Cleaning: It deals with gathering the data and structuring it so that it becomes easy to pass this data as input to any machine learning algorithm. This way, redundant, irrelevant data and noise can also be eliminated.
- Data Analysis: This deals with understanding more about the data, why the data has yielded certain results, and what can be done to improve it. It also helps calculate certain numerical values like mean, variance, distributions, and the probability of a certain prediction.
How the basics of statistics will serve as a foundation to manipulate data in data science
The basics of statistics include terminologies and methods of applying statistics in data science. In order to analyze the data, the important tool is statistics. The concepts involved in statistics help provide insights into the data to perform quantitative analysis on it. In addition to this, as a foundation, the basics and working of linear regression and classification algorithms must also be known to a data science aspirant. Our data science with python online course will streamline your skill gaps with an industry-oriented curriculum. Make sure you check that.
Know more about measures of dispersion
Terminologies associated with statistics
- Population: It is an entire pool of data from where a statistical sample is extracted. It can be visualized as a complete data set of items that are similar in nature.
- Sample: It is a subset of the population, i.e. it is an integral part of the population that has been collected for analysis.
- Variable: A value whose characteristics such as quantity can be measured, it can also be addressed as a data point, or a data item.
- Distribution: The sample data that is spread over a specific range of values.
- Parameter: It is a value that is used to describe the attributes of a complete data set (also known as ‘population’). Example: Average, Percentage
- Quantitative analysis: It deals with specific characteristics of data- summarizing some part of data, such as its mean, variance, and so on.
- Qualitative analysis: This deals with generic information about the type of data, and how clean or structured it is.
How does analyzing data using statistics help gain deep insights into data?
Statistics serve as a foundation while dealing with data and its analysis in data science. There are certain core concepts and basics which need to be thoroughly understood before jumping into advanced algorithms.
Not everyone understands the performance metrics of machine learning algorithms like f-score, recall, precision, accuracy, root mean squared error, and so on. Instead, visual representation of the data and the performance of the algorithm on the data serves as a good metric for the layperson to understand the same.
Also, visual representation helps identify outliers, specific trivial patterns, and certain metric summary such as mean, median, variance, that helps in understanding the middlemost value, and how the outlier affects the rest of the data.
Statistical Data Analysis
Statistical data analysis deals with the usage of certain statistical tools that need knowledge of statistics. Software can also help with this, but without understanding why something is happening, it is impossible to get considerable work done in statistics and data science.
Statistics deals with data variables that are either univariate or multivariate. Univariate, as the name suggests deals with single data values, whereas multivariate data deals with the multiple number of values. Discriminant data analysis, factor data analysis can be performed on multivariate data. On the other hand, univariate data analysis, Z-test, F-test can be performed if we are dealing with univariate data.
Data associated with statistics is of many types. Some of them have been discussed below.
Categorical data represents characteristics of people, such as marital status, gender, food they like, and so on. It is also known as ‘qualitative data’ or ‘yes/no data’. It takes numerical values like ‘1’, ‘2’, where these numbers indicate one or other type of characteristics. These numbers are not mathematically significant, which means it can’t be associated with each other.
Continuous data deals with data that is immeasurable, and can’t be counted, which basically continual forms of values are. Predictions from a linear regression are continuous in nature. It is a continuous distribution that is also known as probability density function.
On the other hand, discrete values can be measured, counted, and are discontinuous. Predictions from logistic regression are considered to be discrete in nature. Discrete data is non-continuous, and density concept doesn’t come into the picture here. The distribution is known as probability mass function.
The Best way to Learn Statistics for Data Science
The best way to learn anything is by implementing it, by working on it, by making mistakes and again learning from it. It is important to understand the concepts, either by going through standard books or well-known websites, before implementing them.
Before jumping into data science, the core statistics concepts like such as regression, maximum likelihood, distributions, priors, posteriors, conditional probability, Bayesian theorem and basics of machine learning have to be understood clearly.
Core statistics concepts
Descriptive statistics: As the name suggests, it uses the data to give out more information about every aspect of the data with the help of graphs, plots, or numbers. It organizes the data into a structure, and helps think about the attributes that highlight the important parts of the data.
- Inferential statistics: It deals with drawing inferences/conclusions on the sample data set which is obtained from the population (entire data set) based on the relationship identified between data points in the data set. It helps in generalizing the relationship to the entire dataset. It is important to remember that the dataset drawn from the population is relevant and represents the population accurately.
- Regression: The term ‘regression’ which is a part of statistics and machine learning, talks about how data can be fit to a line, and how every point from the straight line gives some insights. In terms of machine learning, it can be understood as tasks that can be solved without explicitly being programmed. They discuss how a line can be fit to a given set of data points, and how it can be further extrapolated for the predictions to be done.
- Maximum likelihood: It is a method that helps in finding values of parameters for a specific model. The values of the parameters have to be such that the likelihood of the predictions that occur have to be maximum in comparison to the data values that were actually observed. This means the difference between the actual and predicted value has to be less, thereby reducing the error and increasing the accuracy of the predictions.
Note: This concept is generally used with Logistic regression when we are trying to find the output as 0 or 1, yes or no, wherein the maximum likelihood tells about how likely a data point is near to 0 or 1.
Bayesian thinking deals with using probability to model the process of sampling, and being able to quantify the uncertainty associated with the data that would be collected.
This is known as prior probability- which means the level of uncertainty that is associated with the data before it is collected to be analysed.
Posterior probability deals with the uncertainty that occurs after the data has been collected.
Machine learning algorithms are usually focussed on giving the best predictions as output with minimal errors, exact probabilities of specific events occurring and so on. Bayes theorem is a way of calculating the probability of a hypothesis (a situation, which might not have occurred in reality) based on our previous experiences and the knowledge we have gained by it. This is considered as a basic concept that needs to be known.
Bayes theorem can be stated as follows:
P(hypo | data) = (P(data | hypo) * P(hypo)) / P(data)
In the above equation,
P(hypo | data) is the probability of a hypothesis ‘hypo’ when data ‘data’ is given, which is also known as posterior probability.
P(data | hypo) is the probability of data ‘data’ when the specific hypothesis ‘hypo’ is known to be true.
P(hypo) is the probability of a hypothesis ‘hypo’ being true (irrespective of the data in hand), which is also known as prior probability of ‘hypo’.
P(data) is the probability of the data (irrespective of the hypothesis).
The idea here is to get the value of the posterior probability, given other data. The posterior probability for a variety of different hypotheses has to be found out, and the probability that has the highest value is selected. This is known as the maximum probable hypothesis, and is also known as the maximum a posteriori (MAP) hypothesis.
MAP(hypo) = max(P(hypo | data))
If the value of P(hypo | data) is replaced with the value we saw before, the equation would become:
MAP(hypo) = max((P(data | hypo) * P(hypo)) / P(data))
P(data) is considered as a normalizing term that helps in determining the probability. This value can be safely ignored when required, since it is a constant value.
Naïve Bayes classifier
It is an algorithm that can be used with binary or multi-class classification problems. It is a simple algorithm wherein the probability for every hypothesis is simplified.
This is done in order to make the data more traceable. Instead of calculating value of every attribute like P(data1, data2,..,datan|hypo), we assume that every data point is independent of every other data point in the data set when the respective output is given.
This way, the equation becomes:
P(data1 | hypo) * P(data2 |hypo) * … * P(data-n| hypo).
This way, the attributes would be independent of each other. This classifier performs quite well even in the real world with real data when the assumption of data points being independent of each other doesn’t hold good.
Once a Naïve Bayes classifier has learnt from the data, it stores a list of probabilities in a data structure. Probabilities such as ‘class probability’ and ‘condition probability’ are stored. Training such a model is quick since the probability of every class and its associated value needs to be determined, and this doesn’t involve any optimization processes or changing of coefficient to give better predictions.
- Class probability: It tells about the probability of every class that is present in the training dataset. It can be calculated by finding the frequency of values that belongs to each class divided by the total number of values.
- Class probability = (number of classes/(number of classes of group 0 + number of classes of group 1))
- Conditional probability: It talks about the conditional probability of every input that is associated with a class value. It can be calculated by finding the frequency of every data attribute in the data for a given class, and this can be determined by the number of data values that have that data label/class value.
- Conditional probability P(condition | result ) = number of ((values with that condition and values with that result)/ (number of values with that result))
Not just the concept, once the user understands the way in which a data scientist needs to think, they will be able to focus on getting cleaner data, with better insights that would lead to performing better analysis, which in turn would give great results.
Introduction to Statistical Machine Learning
The methods used in statistics are important to train and test the data that is used as input to the machine learning model. Some of these include outlier/anomaly detection, sampling of data, data scaling, variable encoding, dealing with missing values, and so on.
Statistics is also essential to evaluate the model that has been used, i.e. see how well the machine learning model performs on a test dataset, or on data that it has never seen before.
Statistics is essential in selecting the final and appropriate model to deal with that specific data in a predictive modelling situation.
It is also needed to show how well the model has performed, by taking various metrics and showing how the model has fared.
Metrics used in Statistics
Most of the data can be fit to a common pattern that is known as Gaussian distribution or normal distribution. It is a bell-shaped curve that can be used to summarize the data with the below mentioned two parameters:
- Mean: It is understood as the central most value when the data points are arranged in a descending or ascending order, or the most likely value.Mode: It can be understood as the data point that occurs the greatest number of times, i.e. The frequency of the value in the dataset would be very high.
- Median: It is a measure of central tendency of the data set. It is the middle number, that can be found by sorting all the data points in a dataset and picking the middle-most element. If the number of data points in a dataset is odd, one single middle value is picked up, whereas two middle values are picked and their mean is calculated if the number of data points in a dataset is even.
- Range: It refers to the value that is calculated by finding the difference between the largest and the smallest value in a dataset.
- Quartile: As the name suggests, quartiles are values that divide the data points in a dataset into quarters. It is calculated by sorting the elements in order and then dividing the dataset into 4 equal parts.
- Three quartiles are identified: The first quartile that is the 25th percentile, the second quartile which is the 50th percentile and the third quartile that is the 75th percentile. Each of these quartiles tells about the percentage of data that is smaller or larger in comparison to other percentiles of data.
Example: 25th percentile suggests that 25 percent of the data set is smaller than the remaining 75 percent of the data set.
Quartile helps understand how the data is distributed around the median (which is the 50th percentile/second quartile).
There are other distributions as well, and it depends on the type of data we have and the insights we need from that data, but Gaussian is considered as one of the basic distributions.
- Variance: The average of the difference between every value and the mean of that specific distribution.
- Standard deviation: It can be understood as the measure that indicates the dispersion that occurs in the data points of the input data.
In this post, we understood why and how statistics is important to understand and work with data science. We saw a few terminologies of statistics that are essential in understanding the insights which statistics would end up giving to data scientist. We also saw a few basic algorithms that every data scientist needs to know, in order to learn other advanced algorithms.
If you wish to learn more about Data Science, Check out KnowledgeHut’s Data Science with Python Online Course on the page. We hope this gives you a fair idea of the topic and helps you with your next steps as a Data Scientist.
All the best for your Data Science journey!