Takeaways from this article
Data plays a huge role in today’s tech world. All technologies are data-driven, and humongous amounts of data are produced on a daily basis. A data scientist is a professional who is able to analyse data sources, clean and process the data, understand why and how such data has been generated, take insights from it, and make changes such that they profit the organization. These days, everything revolves around data.
The basics of statistics include terminologies, and methods of applying statistics in data science. In order to analyze the data, the important tool is statistics. The concepts involved in statistics help provide insights into the data to perform quantitative analysis on it. In addition to this, as a foundation, the basics and working of linear regression and classification algorithms must also be known to a data science aspirant.
Statistics serve as a foundation while dealing with data and its analysis in data science. There are certain core concepts and basics which need to be thoroughly understood before jumping into advanced algorithms.
Not everyone understand the performance metrics of machine learning algorithms like f-score, recall, precision, accuracy, root mean squared error, and so on. Instead, visual representation of the data and the performance of the algorithm on the data serves as a good metric for the layperson to understand the same.
Also, visual representation helps identify outliers, specific trivial patterns, and certain metric summary such as mean, median, variance, that helps in understanding the middlemost value, and how the outlier affects the rest of the data.
Statistical data analysis deals with the usage of certain statistical tools that need knowledge of statistics. Software can also help with this, but without understanding why something is happening, it is impossible to get considerable work done in statistics and data science.
Statistics deals with data variables that are either univariate or multivariate. Univariate, as the name suggests deals with single data values, whereas multivariate data deals with the multiple number of values. Discriminant data analysis, factor data analysis can be performed on multivariate data. On the other hand, univariate data analysis, Z-test, F-test can be performed if we are dealing with univariate data.
Data associated with statistics is of many types. Some of them have been discussed below.
Categorical data represents characteristics of people, such as marital status, gender, food they like, and so on. It is also known as ‘qualitative data’ or ‘yes/no data’. It takes numerical values like ‘1’, ‘2’, where these numbers indicate one or other type of characteristics. These numbers are not mathematically significant, which means it can’t be associated with each other.
Continuous data deals with data that is immeasurable, and can’t be counted, which basically continual forms of values are. Predictions from a linear regression are continuous in nature. It is a continuous distribution that is also known as probability density function.
On the other hand, discrete values can be measured, counted, and are discontinuous. Predictions from logistic regression are considered to be discrete in nature. Discrete data is non-continuous, and density concept doesn’t come into the picture here. The distribution is known as probability mass function.
The best way to learn anything is by implementing it, by working on it, by making mistakes and again learning from it. It is important to understand the concepts, either by going through standard books or well-known websites, before implementing them.
Before jumping into data science, the core statistics concepts like such as regression, maximum likelihood, distributions, priors, posteriors, conditional probability, Bayesian theorem and basics of machine learning have to be understood clearly.
Descriptive statistics: As the name suggests, it uses the data to give out more information about every aspect of the data with the help of graphs, plots, or numbers. It organizes the data into a structure, and helps think about the attributes that highlight the important parts of the data.
Note: This concept is generally used with Logistic regression when we are trying to find the output as 0 or 1, yes or no, wherein the maximum likelihood tells about how likely a data point is near to 0 or 1.
Bayesian thinking deals with using probability to model the process of sampling, and being able to quantify the uncertainty associated with the data that would be collected.
This is known as prior probability- which means the level of uncertainty that is associated with the data before it is collected to be analysed.
Posterior probability deals with the uncertainty that occurs after the data has been collected.
Machine learning algorithms are usually focussed on giving the best predictions as output with minimal errors, exact probabilities of specific events occurring and so on. Bayes theorem is a way of calculating the probability of a hypothesis (a situation, which might not have occurred in reality) based on our previous experiences and the knowledge we have gained by it. This is considered as a basic concept that needs to be known.
Bayes theorem can be stated as follows:
P(hypo | data) = (P(data | hypo) * P(hypo)) / P(data)
In the above equation,
P(hypo | data) is the probability of a hypothesis ‘hypo’ when data ‘data’ is given, which is also known as posterior probability.
P(data | hypo) is the probability of data ‘data’ when the specific hypothesis ‘hypo’ is known to be true.
P(hypo) is the probability of a hypothesis ‘hypo’ being true (irrespective of the data in hand), which is also known as prior probability of ‘hypo’.
P(data) is the probability of the data (irrespective of the hypothesis).
The idea here is to get the value of the posterior probability, given other data. The posterior probability for a variety of different hypotheses has to be found out, and the probability that has the highest value is selected. This is known as the maximum probable hypothesis, and is also known as the maximum a posteriori (MAP) hypothesis.
MAP(hypo) = max(P(hypo | data))
If the value of P(hypo | data) is replaced with the value we saw before, the equation would become:
MAP(hypo) = max((P(data | hypo) * P(hypo)) / P(data))
P(data) is considered as a normalizing term that helps in determining the probability. This value can be safely ignored when required, since it is a constant value.
It is an algorithm that can be used with binary or multi-class classification problems. It is a simple algorithm wherein the probability for every hypothesis is simplified.
This is done in order to make the data more traceable. Instead of calculating value of every attribute like P(data1, data2,..,datan|hypo), we assume that every data point is independent of every other data point in the data set when the respective output is given.
This way, the equation becomes:
P(data1 | hypo) * P(data2 |hypo) * … * P(data-n| hypo).
This way, the attributes would be independent of each other. This classifier performs quite well even in the real world with real data when the assumption of data points being independent of each other doesn’t hold good.
Once a Naïve Bayes classifier has learnt from the data, it stores a list of probabilities in a data structure. Probabilities such as ‘class probability’ and ‘condition probability’ are stored. Training such a model is quick since the probability of every class and its associated value needs to be determined, and this doesn’t involve any optimization processes or changing of coefficient to give better predictions.
Not just the concept, once the user understands the way in which a data scientist needs to think, they will be able to focus on getting cleaner data, with better insights that would lead to performing better analysis, which in turn would give great results.
The methods used in statistics are important to train and test the data that is used as input to the machine learning model. Some of these include outlier/anomaly detection, sampling of data, data scaling, variable encoding, dealing with missing values, and so on.
Statistics is also essential to evaluate the model that has been used, i.e. see how well the machine learning model performs on a test dataset, or on data that it has never seen before.
Statistics is essential in selecting the final and appropriate model to deal with that specific data in a predictive modelling situation.
It is also needed to show how well the model has performed, by taking various metrics and showing how the model has fared.
Most of the data can be fit to a common pattern that is known as Gaussian distribution or normal distribution. It is a bell-shaped curve that can be used to summarize the data with the below mentioned two parameters:
Example: 25th percentile suggests that 25 percent of the data set is smaller than the remaining 75 percent of the data set.
Quartile helps understand how the data is distributed around the median (which is the 50th percentile/second quartile).
There are other distributions as well, and it depends on the type of data we have and the insights we need from that data, but Gaussian is considered as one of the basic distributions.
In this post, we understood why and how statistics is important to understand and work with data science. We saw a few terminologies of statistics that are essential in understanding the insights which statistics would end up giving to data scientist. We also saw a few basic algorithms that every data scientist needs to know, in order to learn other advanced algorithms.
Your email address will not be published. Required fields are marked *
Statistics is a science concerned with collection,... Read More