Search

Install Python on Ubuntu

This article will help you in the installation of python 3  on Ubuntu. You will learn the basics of configuring the environment to get started with Python.Brief introduction to PythonPython is an Interpreted programming language that is very popular these days due to its easy learning curve and simple syntax. Python finds use in many applications and for programming the backend code of websites. It is also very popular for data analysis across industries ranging from medical/scientific research purposes to retail, finances, entertainment, media and so on.When writing a python program or program in any other language, people usually use something called an IDE or Integrated Development Environment that includes everything you need to write a program. It has an inbuilt text editor to write the program and a debugger to debug the programs as well.Latest version of pythonThe latest version of python is python3 and the latest release is python3.9.0.Installation linksFor downloading python and the documentation for Ubuntu, visit the official website and go to the downloads section, from where you can download the latest python version for Ubuntu.Key terms (pip, virtual environment, path etc.)pip:pip is a package manager to simplify installation of python packages. To install pip ,run the below command on the terminalsudo apt  install  python3-pip  Once the installation is done just install a package  by runningpip install <package-name>virtual environment:The purpose of virtual environments is to have a separate space where you can install packages which are specific to a certain project, For example if you have lot of flask or Django- based applications and not all the applications are using the same version, we use virtual env where each project will have its own version.In order to use a virtual environment you need to be on the python 3.x version. Let’s understand how to create the virtual environment. You do not need any library as it comes along with standard python installation.If you don't have virtual environment installed, use this command to install it:pip3 install virtualenvSo to create a new virtual env, run the below command:virtualenv env (name of virtual env)This will create a virtual environment and will install some standard packages as well as part of the virtual environment creation.To activate the virtual environment on ubuntu, use the below command:source env/bin/activateTo deactivate it you can run the below command in the environment:deactivateGetting and Installing Ubuntu:To download Python, visit the official website and go to the downloads section. You can download the latest python version for Ubuntu as shown below:Download the tarball and untar the file. After untarring the file, you will see a couple of files. The file you will be interested in is readme file where you can access a set of instructions to install the python on the ubuntu machine.Open the terminal, change the directory of the untarred python file, and run the below command under cd ~/<python untarred folder>Install python command:./configuremakemake testsudo make installThis will install python as python3.If you get an error when running sudo ./configure, like no compiler found, just install the below library to get rid of it:apt-get install build-essentialAlso if you get an error when running, like make not found, just run the below command to install make:sudo apt install makeSo once you are done installing the above libraries, like make and build-essential, you should be good with the above install python command.The other way of installing python is by running apt get commands as below:Open the terminal and run:sudo apt-get updateThis will make sure repos are updated to the latest in Ubuntu. Install python by running the below command:sudo apt-get install python3Setting pathTo find the existing  system path set in your machine, you can run the below command:echo  $PATNow suppose you want to set a different path for your Python executable, you can just use export command and give it a directory path like below:export PATH=$PATH:`<path to executable file>’By just running the above export command, this will not be persisted across different terminals. Again, if you close that terminal and open it again, the change would have been lost. So to make it persistent,  you need to add the above command in the ~/.bashrc file present in the home directory of the ubuntu system.How to run python codeTo run python code just run the commandpython <pythonfile.py>Installing Additional Python Packages:If you want to see what all packages are installed in the env, run the command pip3 list which will list down the current packages installed in the env. If you install any other packages in the env, for instance let’s say you want to install request library, you can just install it by running pip3 install requests. Now try running pip3 list again to see this requests lib installed in this env.Directory as package for distribution:Inside the python project or directory you should have a file called __init__.py inside the directory. So you can create this file by a simple touch command. This file does not need to have any data inside it, it only has to exist inside the directory for that to work as a package.Documentation links for pythonhttps://www.python.org/doc/ConclusionThis article will help you with stepwise instructions on the installation of Python on ubuntuOs.

Install Python on Ubuntu

3K
Install Python on Ubuntu

This article will help you in the installation of python 3  on Ubuntu. You will learn the basics of configuring the environment to get started with Python.

Brief introduction to Python

Python is an Interpreted programming language that is very popular these days due to its easy learning curve and simple syntax. Python finds use in many applications and for programming the backend code of websites. It is also very popular for data analysis across industries ranging from medical/scientific research purposes to retail, finances, entertainment, media and so on.

When writing a python program or program in any other language, people usually use something called an IDE or Integrated Development Environment that includes everything you need to write a program. It has an inbuilt text editor to write the program and a debugger to debug the programs as well.

Latest version of python

The latest version of python is python3 and the latest release is python3.9.0.

Installation links

For downloading python and the documentation for Ubuntu, visit the official website and go to the downloads section, from where you can download the latest python version for Ubuntu.

Install Python on Ubuntu

Key terms (pip, virtual environment, path etc.)

pip:
pip is a package manager to simplify installation of python packages. To install pip ,run the below command on the terminal

sudo apt  install  python3-pip  

Once the installation is done just install a package  by running

pip install <package-name>

virtual environment:

The purpose of virtual environments is to have a separate space where you can install packages which are specific to a certain project, For example if you have lot of flask or Django- based applications and not all the applications are using the same version, we use virtual env where each project will have its own version.

In order to use a virtual environment you need to be on the python 3.x version. Let’s understand how to create the virtual environment. You do not need any library as it comes along with standard python installation.

If you don't have virtual environment installed, use this command to install it:

pip3 install virtualenv

So to create a new virtual env, run the below command:

virtualenv env (name of virtual env)

This will create a virtual environment and will install some standard packages as well as part of the virtual environment creation.
To activate the virtual environment on ubuntu, use the below command:

source env/bin/activate

To deactivate it you can run the below command in the environment:

deactivate

Getting and Installing Ubuntu:

To download Python, visit the official website and go to the downloads section. You can download the latest python version for Ubuntu as shown below:

Install Python on Ubuntu

Download the tarball and untar the file. After untarring the file, you will see a couple of files. The file you will be interested in is readme file where you can access a set of instructions to install the python on the ubuntu machine.

Open the terminal, change the directory of the untarred python file, and run the below command under cd ~/<python untarred folder>

Install python command:

  • ./configure
  • make
  • make test
  • sudo make install

This will install python as python3.

If you get an error when running sudo ./configure, like no compiler found, just install the below library to get rid of it:

apt-get install build-essential

Also if you get an error when running, like make not found, just run the below command to install make:

sudo apt install make

So once you are done installing the above libraries, like make and build-essential, you should be good with the above install python command.

The other way of installing python is by running apt get commands as below:

Open the terminal and run:

sudo apt-get update

This will make sure repos are updated to the latest in Ubuntu. Install python by running the below command:

sudo apt-get install python3

Setting path

To find the existing  system path set in your machine, you can run the below command:

echo  $PAT

Now suppose you want to set a different path for your Python executable, you can just use export command and give it a directory path like below:

export PATH=$PATH:`<path to executable file>’

By just running the above export command, this will not be persisted across different terminals. Again, if you close that terminal and open it again, the change would have been lost. So to make it persistent,  you need to add the above command in the ~/.bashrc file present in the home directory of the ubuntu system.

How to run python code

To run python code just run the command

python <pythonfile.py>

Installing Additional Python Packages:

If you want to see what all packages are installed in the env, run the command pip3 list which will list down the current packages installed in the env. If you install any other packages in the env, for instance let’s say you want to install request library, you can just install it by running pip3 install requests. Now try running pip3 list again to see this requests lib installed in this env.

Directory as package for distribution:

Inside the python project or directory you should have a file called __init__.py inside the directory. So you can create this file by a simple touch command. This file does not need to have any data inside it, it only has to exist inside the directory for that to work as a package.

Documentation links for python

https://www.python.org/doc/

Conclusion

This article will help you with stepwise instructions on the installation of Python on ubuntuOs.

KnowledgeHut

KnowledgeHut

Author

KnowledgeHut is an outcome-focused global ed-tech company. We help organizations and professionals unlock excellence through skills development. We offer training solutions under the people and process, data science, full-stack development, cybersecurity, future technologies and digital transformation verticals.
Website : https://www.knowledgehut.com

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

What is factor analysis in data science?

Factor analysis is a part of the general linear model (GLM). It is a method in which large amounts of data are collected and reduced in size to a smaller dataset. This reduction in the size of the dataset ensures that the data is manageable and easily understood by people.  In addition to manageability and interpretability, it helps extract patterns in data as well as show the characteristics that are commonly seen in the different patterns (that are extracted). It helps create a  variable set for data points in the datasets that are similar. This similar set of data is also known as dimensions.  AssumptionAn assumption while dealing with factor analysis is that, in a collection of the variables observed, there is a set of underlying variables, which is known as ‘factor’. This factor helps explain the inter-relationship between these variables.  There should be a linear relationship between the variables in the data.  There should be no multicollinearity between variables in the data.  There should be true correlation between the variables and factors in the data.  There are multiple methods to extract factors from data, but principal component analysis is one of the most frequently used methods. In Principal component analysis (PCA), maximum variance is extracted and placed in the first factor. Once this is done, the variance explained by the first set of factors is eliminated and then maximum variance is again extracted for the second factor. This goes on until the last factor in the variable set.  Types of factor analysisThe word ‘factor’ in factor analysis refers to the variable set which has similar patterns. They are sometimes associated with a hidden variable, which is also known as confounding variable. This hidden variable is not measured directly. The ‘factors’ talk about the variation in data which can be explained.  There are two types of factors:  Exploratory;Confirmatory Exploratory factor analysisThis deals with data that is unstructured or when the person/s dealing with the data are clueless about the structure of the data and the dimensions of the variable associated with the data. Exploratory factor analysis gives information about the optimum number of factors which may be required to represent the data. If a researcher wishes to explore patterns, it is suggested to use exploratory factor analysis.Confirmatory factor analysisThis kind of analysis is used to verify the structure of the data, given the condition that the people dealing with the data are aware of its structure and dimensions of the variable associated with the data. This kind of analysis helps specify the number of factors required to perform the analysis.Factor analysis is a multivariate method- this means it deals with multiple variables associated with data. This is a data reduction technique wherein the basic idea is to use a smaller set of variables, which is known as ‘factors’, that is a representation of a bigger set of variables.It helps the researcher in understanding whether the relationship between the observed variables (aka manifest variables) and their underlying construct exists or not.If a researcher wishes to perform hypothesis testing, it is suggested to use exploratory factor analysis.What are factors?Factors can be understood as a construct which can’t be measured with the help of a single variable. Factor analysis is generally used with interval data, but it can be used for ordinal data as well.  What is ordinal data?Ordinal data is statistical data in which variables exist in naturally occurring categories that are in a particular order. The distance between categories in ordinal data can’t be found using ordinal data itself.For a dataset to be ordinal data, it needs to fulfil a few conditions.  Multiple terms in the dataset are in an ordered fashion.  The difference between variables in the dataset is not homogeneous/uniform.  A group of ordinal numbers indicates ordinal data, and a group of ordinal data can be represented using an ordinal scale.Likert Scale is one type of ordinal data. Let us understand Likert scale with the help of an example:Suppose we have a question that says “Please indicate how satisfied you are with this product purchase”. A Likert scale may have numbers between 0/1 to 5 or 0/1 to 10. On this scale, 0/1 indicates a lesser value and 5 or 10 indicates a higher value.Let us understand ordinal data with the help of another example. If we have variables stored in a specific order, say “low, medium, high” or “not happy, slightly happy, happy, very happy, extremely happy”, it is considered as ordinal data.Conditions for variables in factor analysisThese variables (in factor analysis) need to be linearly associated with each other. Linear relationship or association describes a relationship that forms a straight line when two variables are plotted on a graph. It can also be represented as a mathematical equation in the form ‘y = mx + b’.This linear associativity can be checked by plotting scatterplots of the pairs of variables. This indicates that the variables need to be moderately correlated to each other.If the variables are not correlated, the number of factors will be the same as the number of original variables. This means that performing factor analysis on this kind of variables would be useless.How can factor analysis be performed?Factor analysis is a complex mathematical procedure. It can be performed with the help of software applications. Before performing the analysis, it is essential to check if the data is relevant. This can be done with the help of Kaiser-Meyer-Olkin test.Kaiser-Meyer-Olkin testThis is also known as the KMO test, which is used to see how well the data is suited to perform factor analysis. It measures the sampling adequacy for every variable in the model.This statistic measures the proportion of variance among all the variables in the data. The lower the proportion, more suited the data is to perform factor analysis.KMO returns values between 0 and 1.If KMO value lies between 0.8 and 1, it means that the sampling is adequate.If KMO value is less than 0.6 or lies between 0.5 and 0.6, it means that the sampling is not adequate. This means proper actions need to be taken.If KMO value is closer to 0, this indicates that the data contains large number of partial correlations in comparison to the sum of correlations. This is not suited for factor analysis. Values between 0 and 0.49 are considered unacceptable. Values between 0.50 and 0.59 are considered not good. Values between 0.60 and 0.69 are considered mediocre.  Values between 0.70 and 0.79 are considered to be good.  Values between 0.80 and 0.89 are considered to be great. Values between 0.90 and 1.00 are considered to be absolutely fantastic. The formula to perform KMO test is:Here, R =  which is the correlation matrix; and U =  which is the partial covariance matrix.Once the relevant data has been collected, factor analysis can be performed in a variety of ways.Using StataIt can be performed in Stata with the help of postestimation command- ‘estat kmo’.Using RIt can be performed in R using the command ‘KMO(r)’ where ‘r’ refers to the correlation matrix that needs to be analysed.Using SPSSSPSS is a statistical platform that can be used to run factor analysis. First go to Analyze -> Dimension Reduction -> Factor, and check the “KMO and Bartlett’s test of sphericity” box.If the measure of sampling adequacy (MSA) for single variable is needed, the ‘”anti-image” box needs to be checked. An ‘anti-image’ box shows the MSAs listed in diagonals of matrix.The test can also be executed by specifying KMO in the Factor Analysis command. The KMO statistic is found in the “KMO and Bartlett’s Test” table in the Factor output.ConclusionIn short, Factor Analysis brings in simplicity after reducing variables. Factor Analysis, including Principal Component Analysis, is also often used along with segmentation studies. In this post, we understood about the factor analysis method, and the assumptions made before working on the method. We also saw different kinds of factor analysis, and how they can be performed on different platforms.
7346
What is factor analysis in data science?

Factor analysis is a part of the general linear m... Read More

Combining Models – Python Machine Learning

Machine Learning is emerging as the latest technology these days, and is solving many problems that are impossible for humans. This technology has extended its wings into diverse industries like Automobile, Manufacturing, IT services, Healthcare, Robotics and so on. The main reason behind using this technology is that it provides more accurate solutions for problems, simplifies tasks and eases work processes. It automates the world with its applications that are helpful for many organizations and for the well-being of people. This technology uses the input data to develop a model, and further predicts the outcomes to know the performance of the model.Generally, we develop machine learning models to solve a problem by using the given input data. When we work on a single algorithm, we are unable to distinguish the performance of the model for that particular statement, as there is nothing to compare it against. So, we feed the input data to other machine learning algorithms and then compare them with each other to know which is the best algorithm that suits the given problem. Every algorithm has its own mathematical computation and significance to deal with a specific problem to bring out the best results at the end.Why do we combine models?While dealing with a specific problem with a machine learning algorithm we sometimes fail, because of the poor performance of the model. The algorithm that we have used may be well suited to the model, but we still fail in getting better outcomes at the end. In this situation, we might have many questions in our mind. How can we bring out better results from the model? What are the steps to be taken further in the model development? What are the hidden techniques that can help to develop an efficient model?To overcome this situation there is a procedure called “Combining Models”, where we mix one or two weaker machine learning models to solve a problem and get better outcomes. In machine learning, the combining of models is done by using two approaches namely “Ensemble Models” & “Hybrid Models”. Ensemble Models use multiple machine learning algorithms to bring out better predictive results, as compared to using a single algorithm. There are different approaches in Ensemble models to perform a particular task. There is another model called Hybrid model that is flexible and helps to create a more innovative model than an Ensemble model. While combining models we need to check how strong or weak a particular machine learning model is, to deal with a particular problem.What are Ensemble Methods?An Ensemble is made up of things that are grouped together, that take up a particular task. This method combines several algorithms together to bring out better predictive results, as compared to using a single algorithm. The objective behind the usage of an Ensemble method is that it decreases variance, bias and improves predictions in a developed model. Technically speaking, it helps in avoiding overfitting.The models that contribute to an Ensemble are referred to as the Ensemble Members, which may be of the same type or different types, and may or may not be trained on the same training data.  In the late 2000s, adoption of ensembles picked up due in part to their huge success in machine learning competitions, such as the Netflix Prize and other competitions on Kaggle.  These ensemble methods greatly increase the computational cost and complexity of the model. This increase comes from the expertise and time required to train and maintain multiple models rather than a single model.  Ensemble models are preferred because of two main reasons; namely Performance & Robustness. The ensemble methods majorly focus on improving the accuracy of the model by reducing variance component of the prediction error and by adding bias to the model.Performance helps a Machine Learning model to make better predictions. Robustness reduces the spread or dispersion of the prediction and model performance.The goal of a supervised machine learning algorithm is to have “low bias and low variance”. The Bias and the Variance are inversely proportional to each other i.e., if the bias is low then the variance is high, else the bias is high then the variance is low.We explicitly use ensemble methods to seek better predictive performance, such as lower error on regression or higher accuracy for classification. They are also further used in Computer vision and are given utmost importance in academic competitions also.Decision TreesThis type of algorithm is commonly used in decision analysis and operation Research, and it is one of the mostly used algorithms in the context of Machine Learning.The decision tree algorithm aims to produce better results for small and large amounts of data, which are taken as input data and fed to the model. These   algorithms are majorly used in decision making problem statements.The decision tree algorithm is a tree like structure consisting of nodes at each stage. The top of the tree is the Root Node which describes the main problem that we deal with, and there are Sub Nodes which act as classes or labels for the data given in the dataset. The Leaf Node is the last layer of the decision tree, representing the outcomes or values of the problem.The tree structure is extended with a number of nodes till a perfect prediction is made from the given data using the model. Decision tree algorithms are used in classification as well as regression problems. This algorithm is widely used in machine learning to solve problems, and the main advantage of this model is that we can have 2 or more outputs, from which we can select the most suitable one for the given problem.These can operate on both small and large amounts of data. Decisions taken using this algorithm are often fast and accurate. In machine learning the different types of Decision Tree algorithms includeClassification and Regression Tree (CART)Decision stumpChi-squared automatic interaction detection (CHAID)Types of Ensemble MethodsEnsemble methods are used to improve the accuracy of the model by reducing the bias and variance. These methods are widely used in dealing with Classification and Regression Problems. In ensemble method, several models combine together to form one reliable model that results in improving accuracy at the end.Ensemble methods are widely classified into the following types to exhibit better performance of the model. They are:BaggingBoostingStackingThese ensemble methods are broadly classified into four categories, namely “Sequential methods”, “Parallel methods”, “Homogeneous Ensemble” and “Heterogeneous Ensemble”. They help us to differentiate the performance and accuracy of models for a problem.Sequential methods generate sequential base learners who are data dependent. Here the new data we take as input to the model is dependent on the previous data, and the data which is mislabeled previously by the model is tuned with weights to get better accuracies at the end. This technique is possible in “BOOSTING”, for example in Adaptive Boosting (AdaBoost).Parallel methods generate parallel order base learners in which the data is independent. This independence of the base learners on the data significantly reduces the error with the application of averages. This technique is possible in “STACKING”, for example in Random Forest.A Homogenous ensemble is a combination of the same type of classifiers. Even though the dataset consists of different classifiers, this ensemble technique makes a model that best suits a given problem. This type of technique is computationally expensive and is suitable for solving large datasets. “BAGGING” & “BOOSTING” are the popular methods that exhibit homogeneous ensemble.Heterogeneous ensemble is a combination of different types of classifiers, in which each classifier is based on the same data. This method works on small datasets. “STACKING” comes in this category.BaggingBagging is a short form of Bootstrap Aggregating, used to improve the accuracy of the model. It is used when dealing with problems related to Classification and Regression. This technique improves the accuracy of the model by reducing variance, and helps to prevent the overfitting of the model. Bagging can be applied with any type of method in machine learning, but generally it is implemented using Decision Trees.Bagging is an ensemble technique, in which several models are grouped together to make one single reliable model to improve the accuracy. In the technique of bagging, we fit several independent models together and average their predictions to get a model that results in low variance and high accuracy to the model.Bootstrapping is a sampling technique, where we obtain the data in the form of samples. The samples are derived from the whole population with the help of replacement procedure. The sampling technique with the help of replacement method helps the learners to make the selection procedure randomized. Now the base learning algorithm is run across the samples to complete the procedure for better results.Aggregation is a technique in bagging that helps to incorporate all the possible outcomes of the predictions and randomizes the outcomes at the end. Without the usage of aggregation, the predictions will not be that accurate, because all the outcomes that are obtained at the end of the model are not taken into consideration. Thus, the aggregation is used based on the probability bootstrapping procedures or on the basis of all outcomes of the predictive models.Bagging is an advantageous procedure in Machine Learning, as it combines all the weak base learners that come together to form a single strong learner which is more stable. This technique reduces variance, thereby increasing the accuracy to the model. It prevents overfitting of the model. The limitation for bagging is that it is computationally expensive. When the proper procedure for bagging is established, we should not ignore bias as it fails in obtaining better results at the end.Random Forest ModelsIt is a supervised machine learning algorithm, which is flexible and widely used because of its simplicity and diversity. It produces great results without hyper-parameter tuning.In the term “Random Forest”, the “Forest” refers to a group of decision trees or an ensemble of decision trees, usually trained with the method of “Bagging”. We know that the method of bagging is the combination of learning models that increases the overall result.Random forest is used for classification and regression problems. It builds many decision trees and combines them together to get a more accurate and stable prediction at the end of the model.Random forest adds additional randomness to the model, while growing the trees. Instead of finding the most important feature at the time of splitting a node, the random forest model searches for the best feature among a random subset of features. Thus in random forest, only a random subset of features is taken into consideration by the algorithm for node splitting.Random forest has the quality of measuring the relative importance of each feature on the prediction. In order to use the random forest algorithm, we import a tool “Sklearn”, which measures features importance by looking at the amount of tree nodes used to reduce the impurity across all the trees in the forest.The benefits of using random forest include the following:The training time is less compared to other algorithms.Runs efficiently on a large dataset, and predicts output with high accuracy.When a large proportion of data is missing it also maintains accuracy.It is flexible to apply and outcomes are obtained easily.BoostingBoosting is an ensemble technique, which converts the weak machine learning models into strong models. The main goal of this technique is to reduce bias and variance of a model to improve accuracy. This technique learns from the previous predictor mistakes of data to make better predictions in future by improving the performance of the model.It is a stack like structure in which the weak learners are placed at the bottom and the strong learners are placed at the top. In the stack, the learners at the upper layers initially learn from the weak learners by applying some sort of modifications to the previous techniques.It exists in many forms, that includes XGBoost (Extreme Gradient Boosting), Gradient Boosting, Adaptive Boosting (AdaBoost).AdaBoost makes use of weak learners that are in the form of decision trees, which includes one split normally known as decision stumps. The main decision stumps of Adaboost comprises of observations carrying similar weights.Gradient Boosting follows the sequential addition of predictors to an ensemble, each correcting the previous one. Without changing the weights of incorrect classified observations like Adaboost, this Gradient boosting technique places a new predictor based on the residual errors made by the previous predictors in the generated model.XGBoost is called as Extreme Gradient Boosting. It is designed in order to show better speed and performance of the machine learning model, that we developed. XGBoost technique is an implementation of Gradient Boosted Decision Trees. Generally, normal boosting techniques are very slow as they are in sequential form of training, so XGBoost technique is widely used to have good computational speed and to show better model performance.Simple Averaging / Weighted MethodIt is a technique to improve the accuracy of the model, mainly used for regression problems. It is based on the weights of the model multiplied with the actual instance values in the given problem. This method produces some consistent results that are reliable and help to get a better understanding about the outcomes of the given problem.In the case of a simple averaging method, average predictions are calculated for every instance of the test dataset. It can reduce the overfitting of the model, and is mainly suitable for regression problems as it consists of numerical data. It creates a smoother regression model at the end by reducing the effect of overfitting. The technique of simple averaging is like calculating the mean of the given values.The weighted averaging method is a slight modification to the simple averaging method, in which the prediction values are multiplied with the weight factor and sum up all the multiplied values for every instance. We then calculate the average. We assume that the predicted values are in the range of 0 to 1.StackingThis method is a combination of multiple regression or classifier techniques with a meta-regressor or meta-classifier. Stacking is different from bagging and boosting. Bagging and boosting models work mainly on homogeneous weak learners and don’t consider heterogeneous learners, whereas stacking works mainly on heterogeneous weak learners, and consists of different algorithms altogether.The bagging and boosting techniques combine weak learners with the help of deterministic algorithms, whereas the stacking method combines the weak base learners with the help of a meta-model. As we defined earlier, when using stacking, we learn from several weak base learners and combine them together by training with a meta-model to predict the results that are predicted by the weak learners used in the model.Stacking results in a pile-like structure, in which the lower-level output is used as the input to the next layer. In the same way the stack increases from maximum error rate at the bottom to the minimum error rate area at the top. The top layer in the stack has good prediction accuracy compared to the lower levels. The aim of stacking is to produce a low bias model for accurate results for a given problem.BlendingIt is a technique similar to the stacking approach, but uses only the validation set from the training set of the model to make predictions. The validation set is also called a holdout set.The blending technique uses a holdout set to make predictions for the given problem. With the help of holdout set and the predictions, a model is built which will run across the test set. The process of blending is explained below:Train dataset is divided into training and validation setsThe model is fitted on to the training setPredictions are made on the validation set and the test setNow the validation set and the predictions are used as features to build a new modelThis developed model is used to make final predictions on the test set and on the meta-features.The stacking and blending techniques are useful to improve the performance of the machine learning models. They are used to minimize the errors to get good accuracy for the given problem.Voting Voting is the easiest ensemble method in machine learning. It is mainly used for classification purposes. In this technique, the first step is to create multiple classification models using a training dataset. When the voting is applied to regression problems, the prediction is made with the average of multiple other regression models.In the case of classification there are two types of voting,Hard Voting  Soft VotingThe Hard Voting ensemble involves summing up the votes for crisp class labels from other models and predicting the class with the most votes. Soft Voting ensemble involves summing up the predicted probabilities for class labels and predicting the class label with the largest sum probability.In short, for the Regression voting ensemble the predictions are the averages of contributing models, whereas for Classification voting ensemble, the predictions are the majority vote of contributing models.There are other forms of voting like “Majority Voting” and “Weighted Voting”. In the case of Majority Voting, the final output predictions are based on the number of votes it gets. If the count of votes is high, that model is taken into consideration. In some of the articles this method is also called as “Plurality Voting”.Unlike the technique of Majority voting, the weighted voting works based on the weights to increase the importance of one or more models. In the case of weighted voting, we count the prediction of the better models multiple times.ConclusionIn order to improve the performance of weak machine learning models, there is a technique called Ensembling to improve or boost the accuracy of the model. It is comprised of different techniques, helpful for solving different types of regression and classification problems.
6759
Combining Models – Python Machine Learning

Machine Learning is emerging as the latest technol... Read More

How to get datasets for Machine Learning?

Datasets are the repository of information that is required to solve a particular type of problem. Also called data storage areas, they help users to understand the essential insights about the information they represent. Datasets play a crucial role and are at the heart of all Machine Learning models.  Machine Learning without data sets will not exist because ML depends on data sets to bring out relevant insights and solve real-world problems. Machine learning uses algorithms that comb through data sets and continuously improve the machine learning model.  Quality data is therefore important to ensure the efficacy of a machine learning model. Datasets are often related to a particular type of problem and machine learning models can be built to solve those problems by learning from the data. Datasets help users uncover insights before actually applying the machine learning model to it.  Many datasets are available online for learners who are starting off on building machine learning models. Alternatively, we also can make our own datasets.  Every problem statement we are dealing with comprises of data, which helps us better understand the problem and draw better insights from data by applying ML methods. In the real world, datasets are huge. So, you may have tons and tons of data that represents a particular problem. Datasets may also be confidential as they may contain sensitive information pertaining to a product, organization or government.   Data is not available in a specific format.  Dataset files may be in the form of excel sheets containing rows and columns, bunch of images, videos and audios, in the form of Text like words, sentences and paragraphs, in the form of numbers or values, messages, chats, statuses and in the form of different files like word, txt, pdf, xml and so on. Data can be related to sales of a company, weather reports, income of a company, types of manufacturing products, salary paid to each employee, customers count for a particular item, monthly savings of an employee, frequent visits of a person to a particular place, statistics of any type of industry, quality performance check of a particular item, type of projects a company deals with, etc. Data is defined according to the problem it represents.  Machine Learning Datasets  In Machine Learning, a dataset plays a key role in understanding the problem statement given by a user. A dataset is a repository of information, a collection of instances that help a user to better understand something. A dataset is used to draw better insights and get a clear picture of a particular problem statement. In Machine learning, a dataset is used as input for the machine learning model that has been developed to offer predictions based on the data.  The more data we feed a machine learning model, the better it works and more accurate it gets. If you are a beginner, there are many data sets available that you can make use of to enhance your machine learning skills.  Open-source repositories like Kaggle, UCI, Google etc. can help users to get started with Machine Learning. Open Dataset Finders To solve any problem in data science, be it in the field of Machine Learning, Deep Learning, or Artificial Intelligence, one needs a dataset that can be input into the model to derive insights. A technology has no significance without data. In the real world, data is not open source, as it is confidential and may contain very sensitive information related to an item, user or product. But raw data is available as open source for beginners and learners who wish to learn technologies associated with data. This raw data may or may not be the exact match of the real-time data. But it is a great resource for   users/learners to get better connected with the data and draw insights from it by applying different types of algorithms on it. The commonly used sites from where learners can access datasets to practice their machine learning skills include:   Kaggle  UCI Machine Learning Repository Machine Learning Datasets for Data Science Beginners Data Science, a field that encompasses machine learning, artificial intelligence, deep learning, data mining and more, has seen an unprecedented growth in the past decade.  The sole reason for this growth has been the explosion of data that we have seen in the past few years. Tons and tons of data are being generated each day and organizations have realized the vast potential that this data holds in terms of fueling innovation and predicting market trends and customer preferences.  Data science and its associated fields use algorithms, processes, and other modern tools and techniques to draw insights from vast amounts of structured and unstructured data. Data science has been consistently rated as being among the hottest job trends that is both lucrative and allows growth opportunities.  If you are a learner or an experienced IT professional wanting to learn about data science, then there are several resources available online that help you get access to datasets and polish your machine learning skills. These include:  Iris dataset  Loan Prediction Dataset  Boston Housing Dataset  Wine quality Dataset  Big Mart Sales Dataset  Time Series Analysis Dataset  Beginners of machine learning are often advised to work on Regression and Classification Problems. To make a career in data science and to know more about Machine Learning models or algorithm functionality, it is important to have a grasp of the basics of Math concepts like Statistics, Probability, Linear Algebra, and Calculus. A background of Mathematics also helps users to implement algorithms on their own. It helps to better understand about the different types of implementation of complex strategies of the model and problems in the field of Data Science. Machine Learning Datasets for Natural Language Processing  Natural Language Processing is a branch of artificial intelligence and among the fastest-growing fields in machine learning.  NLP has found applications across fields like Text Classification, Speech Recognition, Language Modelling, Summarization, Image Captioning, Sentiment Analysis, Question Answering, and more. Some popular examples of NLP applications include Amazon “Alexa”, Google Assistant, and Apple’s “Siri”. The main use of NLP is smart search, summarization, classification etc., which majorly solves most of the users' problems. NLP requires a lot of data to function well. Given below are some datasets that can be used for NLP use cases. These are classified based on different types of domain areas and are as follows.  For Text Classification, the datasets are IMDB Movie Reviews, Twitter Analysis data, Sentiment 140, and Reuters Newswire Topic Classification. For Speech Recognition, the datasets are VoxForge, TIMIT Acoustic-Phonetic Continuous Speech Corpus, LibriSpeech ASR corpus etc.   For Language Modelling, the datasets are Project Gutenberg, Google 1 Billion Word Corpus etc.  For Summarization, the datasets are Legal Case Reports Dataset, TIPSTER Text summarization evaluation conference corpus etc. For Image Captioning, the datasets are Common Objects in Context (COCO), Flickr 8k, Flickr 30k etc.  For Question Answering, the datasets are Stanford Question Answering Dataset (SQuAD), Deepmind Question Answering Corpus, and Amazon question/answer Data. The above are the basic datasets to get started with the Natural Language Processing. Learners and beginners can explore these datasets and use them to build their NLP practice projects.  Machine Learning Datasets for Computer Vision and Image Processing  Computer vision (CV) is called the other “Human eye” and focuses on enabling computers to classify images the way humans do. Machines are trained with Computer vision and Image Processing techniques and used in interpreting real-world images and videos. CV helps in the visual interpretation of images and videos and is among the most widely used applications in the world of machine learning. Computer vision applications have applications right from classifying MNIST dataset of numbers to the real-world applications like Self Driving Cars. This technology is used in various industries like Medical, Automobile, robotics, etc. It can detect the objects at any given point of time and can be used in the application of CCTVs. Computer vision technology is used in mobile applications to detect a person’s images and label them further. The basic datasets required by a user to get started with Computer Vision and Image Processing are as follows. Labelme MS-COCO ImageNet LSUN VisualQA CIFAR-10 Flowers Image sourceThe above datasets are a great resource to better understand about Computer Vision and Image Processing. Machine Learning Datasets for Deep Learning Deep Learning is a core part of Machine Learning, which deals with complex problems that deal with vast amounts of data. It has been developed to mimic the neural networks of the human brain. Deep learning uses neural networks consisting of many layers to solve problems like decision making and problem solving. Generally, machine learning has two layers. One is the Input layer-- to take input from the user and the output layer-- used to show the given problem statement's end results after processing it with a ML model. But in the case of Deep Learning there are 3 layers--called Input Layer, Hidden Layer and Output Layer. Deep learning finds applications in many industries and is used to tackle many difficult problems. The datasets for Deep Learning are as follows. Yelp Review CIFAR-10 Google AudioSet Blogger Corpus Image sourceThe datasets for Deep Learning include the datasets for Computer Vision, Natural Language Processing etc., because these are all the applications and core areas of Deep Learning. Machine Learning Datasets for Finance and Economics  We can say that the technology of Machine Learning is a boon for the Finance and Economics sector, as ML applications are widely used in these two areas. ML is used in these fields as a tool for predictions of sales forecasting, business growth, goods sold, manufacturing etc. ML is also expected to predict behavior of the consumer, which is turn will help develop economic models for the growth of the company. The basic datasets in this field are as follows. Quandl IMF Data Google Trends Financial Times Market Data Image sourceThe application of Machine Learning in the fields of Finance and Economics can be further used in stock market predictions, trading in an algorithmic way, for fraud detections etc., Machine Learning Datasets for Public Government These datasets are used by the government in making economic decisions beneficial for the citizens of the nation. The Machine Learning models train the public data that can help the government policy makers to identify the trends,  population growth or decline, migration and ageing. The datasets for the public Government are as follows. Data.gov EU Open Data Portal The UK Data Services Data USA Image sourceGiven above are the basic datasets to get started with applying Machine Learning models in context to Government data, to best analyze the trends and needs of the people of a nation. Sentiment Analysis Datasets for Machine Learning  It is a part of Natural Language Processing used to analyze text for polarity, from positive to negative. This process is used in detecting the emotions in the text of the users. We can detect the different behaviors of the author/user. We can tell how the writer's article or blog is either Humorous, Depressed, Insightful, etc. The following are the basic datasets for sentiment analysis. IMDB Reviews Sentiment140 Stanford Sentiment Treebank Twitter US Airline Sentiment Sentiment analysis is mostly used in the area of classification of tweets, chats, text etc., to know the users’ behavior at that particular context of time.  Datasets for Autonomous Driving The application of Autonomous driving is a widely used application by many of the automobile industry at present, and most possibly in the future too. It is a sophisticated application, and it includes many of the technologies incorporated in it for better functioning of the system. It comprises of the latest technologies like Computer Vision, Natural Language Processing, Deep Learning, Machine Learning etc., in order to implement the complete functioning of the system. Autonomous driving application is used in self-driving cars at present, and it can be further extended to airplanes, ships etc., to provide a better experience to the user of moving from one place to the other without driving on their own. The following are the datasets of Autonomous Driving. Berkeley DeepDrive Landmarks Landmarks-v2 Open Images v5 Level 5 Pandaset Image sourceThis technology is a boon for the Automotive industry to best deal with problems like rash driving, road accidents, harmful emissions, decreased lane capacity etc. and provide users with a better and more sophisticated way to travel.  Clinical Datasets The use of Machine Learning has extended its wings into Healthcare to solve the urgent needs and requirements of many people. ML has the capability to analyze huge patient related data sets and aid doctors in coming up with faster, better and low-cost approach to providing treatments.  ML techniques in the medical field can help in identifying cancerous tumors, rare conditions, and abnormalities and help physicians make quick decisions by providing real time data on patients. The following are some of the Clinical Datasets that beginners can use to build their machine learning models.MIMIC Critical Care Database HealthData.gov Human Mortality Database SEER HCUP ML can change the way healthcare is approached. It can lead to low-cost affordable care that everyone can access.  Datasets for Recommender Systems Recommender systems help us remember the history of previously browsed sites or necessary applications in the system in a particular site. This application has found use on e-commerce and streaming sites like Flipkart, Amazon, Netflix etc., to help users search for a particular item on the site or a movie in their play list. The recommender system is built based on the user’s preferences or choices based on a particular item. It helps the user by providing smart search to display ads on frequently visited sites. Google search Engine is the biggest Recommender system is very beneficial to the users and understands user behavior in the site search. The following are some of the datasets related to Recommender systems. Amazon Review Dataset LastFM Social Network Influencer Free Music Archive Million Song Dataset Image sourceSummaryThe above discussion is all about datasets, their significance in machine learning and the associated fields of machine learning including Deep Learning, Computer vision, and Natural Language Processing. ML is revolutionizing the way we live. It has found applications in all facets of our lives from healthcare to automobiles to banking and finance. And the crux of all Machine Learning innovations are datasets. The size and quality of the dataset affects the efficiency of the machine learning model. Machine learning models with the right datasets can provide solutions to a whole range of business challenges. Knowing how to work with and implementing datasets is a must for professionals who plan to work with machine learning and data science.    
9792
How to get datasets for Machine Learning?

Datasets are the repository of information that is... Read More