Search

Essential Steps to Mastering Machine Learning with Python

One of the world’s most popular programming languages today, Python is a great tool for Machine Learning (ML) and Artificial Intelligence (AI). It is an open-source, reusable, general-purpose, object-oriented, and interpreted programming tool. Python’s key design ideology is code readability, ease of use and high productivity. The latest trend shows that the interest in Python has grown significantly over the past five years. Python is the top choice for ML/AI enthusiasts when compared to other programming languages.   Image source: Google Trends - comparing Python with other tools in the marketWhat makes Python a perfect recipe for Machine Learning? Python can be used to write Machine Learning algorithms and it computes pretty accurately. Python’s concise and easy readability allows the writing of reliable code very quickly. Another reason for its popularity is the availability of various versatile, ready-to-use libraries.  It has an excellent library ecosystem and a great tool for developing prototypes. Unlike R, Python is a general-purpose programming language which can be used to build web applications and enterprise applications.  The community of Python has developed libraries that adhere to a particular area of data science application. For instance, there are libraries available for handling arrays, performing numerical computation with matrices, statistical computing, machine learning, data visualization and many more. These libraries are highly efficient and make the coding much easier with fewer lines of codes. Let us have a brief look at some of the important Python libraries that are used for developing machine learning models. NumPy: One of the fundamental packages for numerical and scientific computing. It is a mathematical library to work with n-dimensional arrays in Python. Pandas: Provides highly efficient, easy-to-use DataFrame for DataFrame manipulations and Exploratory Data Analysis (EDA). SciPy: SciPy is a functional library for scientific and high-performance computations. It contains modules for optimization and for several statistical distributions and tests. Matplotlib: It is a complete plotting package that provides 2D plotting as well as 3D plotting. It can plot static and interactive plots. Seaborn: Seaborn library is based on Matplotlib. It is used to plot more elegant statistical visualization.  StatsModels: The StatsModels library provides functionalities for estimation of various statistical models and conducting different statistical tests. Scikit-learn: Scikit-Learn is built on NumPy, SciPy and Matplotlib. Free to use, overpowered and provides various range of supervised and unsupervised machine learning algorithms. One should also take into account the importance of IDEs specially designed for Python for Machine Learning. The Jupyter Notebook  -  an open-source web-based application that enables ML enthusiasts to create, share, quote, visualize, and live-code their projects.  There are various other IDEs that can be used like PyCharm, Spyder, Vim, Visual Studio Code. For beginners, there is a nice simple online compiler available – Programiz. Roadmap to master Machine Learning Using Python Learn Python: Learn Python from basic to advanced. Practice those features that are important for data analysis, statistical analysis and Machine Learning. Start from declaring variables, conditional statements, control flow statements, functions, collection objects, modules and packages. Deep dive into various libraries that are used for statistical analysis and building machine learning models. Descriptive Analytics : Learn the concept of descriptive analytics, understand the data, learn to load structured data and perform Exploratory Data Analysis (EDA). Practice data filtering, ordering, grouping, multiple joining of datasets. Handle missing values, prepare visualization plots in 2D or 3D format (from libraries like seaborn, matplotlib) to find hidden information and insights. Take a break from Python and Learn Stats - Learn the concept of the random variable and its important role in the field of analytics. Learn to draw insights from the measures of dispersion (mean, median, mode, quartiles and other statistical measures like confidence interval and distribution functions. The next step is to understand probability & various probability distributions and their crucial role in analytics. Understand the concept of various hypothesis tests like t-tests, z-test, ANOVA (Analysis of Variance), ANCOVA (Analysis of Covariance), MANOVA (Multivariate Analysis of Variance), MANCOVA (Multivariate Analysis of Covariance) and chi-square test.  Understand Major Machine Learning AlgorithmsImage sourceDifferent algorithms have different tasks. It is advisable to understand the context and select the right algorithm for the right task. Types of ML ProblemDescriptionExamplesClassificationPick one of N labelsPredict if loan is going to be defaulted or notRegressionPredict numerical valuesPredict property priceClusteringGroup similar examplesMost relevant documentsAssociation rule learningInfer likely association patterns in dataIf you buy butter you are likely to buy bread (unsupervisedStructured OutputCreate complex outputNatural language parse trees, images recognition bounding boxesRankingIdentify position on a scale or statusSearch result rankingSourceA. Regression (Prediction):  Regression algorithms are used for predicting numeric values. For example, predicting property price, vehicle mileage, stock prices and so on.   SourceB. Linear Regression – predicting a response variable, which is numeric in nature, using one or more features or variables. Linear regression model is mathematically represented as:  SourceVarious regression algorithms include: Linear Regression Polynomial Regression  Exponential Regression Decision Tree Random Forest Neural Network As a note to new learners, it is suggested to understand the concepts of – Regression assumptions, Ordinary Least Square Method, Dummy Variables (n-1 dummy encoding, one hot encoding), and performance evaluation metrics (RMSE, MSE, MAD). Classification - We use classification algorithms for predicting a set of items’ classes or a categorical feature. For example, predicting loan default (yes/no) or predicting cancer (yes/no) and so on. Various classification algorithms include: Binomial Logistic Regression Fractional Binomial Regression Quasibinomial Logistic regression Decision Tree Random Forest Neural Networks K-Nearest Neighbor Support Vector Machines Some of the classification algorithms are explained here: K-Nearest Neighbors – simple yet often used classification algorithm. It is a non-parametric algorithm (does not make any assumption on the underlying data distribution) It chooses to memorize the learning instances The output is a class membership  There are three key elements in this approach – a set of labelled objects, eg, a set of stored records, a distance between objects, and the value of k, the number of nearest neighbours  Distance measures that the K-NN algorithm uses - Euclidean distance (square root of the sum of the squared distance between a new point and the existing point across all the input attributes.  Other distances include – Hamming distance, Manhattan distance, Minkowski distance  SourceExample of K-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. In other words the number of triangles is more than the number of squares If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle). It is to be noted that to avoid equal voting, the value of k should be odd and not even.  Logistic Regression – A supervised algorithm that is used for binary classification. The basis for logistic regression is the logit feature aka sigmoid characteristic which takes any real value and maps it between zero and 1. In other words, Logistic Regression returns a probability value for the class label.   SourceIf the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we can classify it as 0 or NO For instance, let us take cancer prediction. If the output of the Logistic Regression is 0.75, we can say in terms of probability that, “There is a 75 percent chance that the patient will suffer from cancer.” Decision Tree – Is a type of supervised learning algorithm which is most commonly used in the case of a classification problem. Decision Tree algorithms can also be used for regression problems i.e. to predict a numerical response variable. In other words, Decision Tree works for both categorical and continuous input and output variables. Each branch node of the decision tree represents a choice between some alternatives and each leaf node represents a decision. SourceAs an early learner, it is suggested to understand the concept of ID3 algorithm, Gini Index, Entropy, Information Gain, Standard Deviation and Standard Deviation Reduction. Random Forest – is a collection of multiple decision trees. It is a supervised learning algorithm, that can be used for both classification & regression problems. While algorithms like Decision Tree can cause a problem of overfitting wherein a model performs well in training data but does not perform well in testing or unseen data, algorithms like Random Forest can help avoid overfitting. It achieves uncorrelated decision trees throughout the concept of bootstrapping (i.e. sampling with replacement) and features randomness.  SourceAs a new learner it is important to understand the concept of bootstrapping.  Support Vector Machine – a supervised learning algorithm, used for classification problems. Another flavour of Support Vector Machines (SVM) is Support Vector Regressor (SVR) which can be used for regression problems. In this, we plot each data item as a point in n-dimensional space n here represents the number of features SourceThe value of each feature is the value of a particular coordinate.  Classification is performed by finding hyperplanes that differentiate the two classes.  It is important to understand the concept of margin, support vectors, hyperplanes and tuning hyper-parameters (kernel, regularization, gamma, margin). Also get to know various types of kernels like linear kernel, radial basis function kernel and polynomial kernel Naive Bayes – a supervised learning classifier which assumes features are independent and there is no correlation between them. The idea behind Naïve Bayes algorithm is the Bayes theorem.  SourceC.Clustering Clustering algorithms are unsupervised algorithms that are used for dividing data points into groups such that the data points in each group are similar to each other and very different from other groups.  Some of the clustering algorithms include: K-means – An unsupervised learning algorithm in which the items are grouped into k-cluster The elements of the cluster are similar or homogenous. Euclidean distance is used to calculate the distance between two data points. Data points have a centroid; this centroid represents the cluster. The objective is to minimize the intra-cluster variations or the squared error function.SourceOther types of clustering algorithms: DBSCAN Mean Shift Hierarchical d) Association Association algorithms, which form part of unsupervised learning algorithms, are for associating co-occurring items or events. Association algorithms are rule-based methods for finding out interesting relationships in large sets of data. For example, find out a relationship between products that are being bought together – say, people who buy butter also buy bread. Some of the association algorithms are: Apriori Rules - Most popular algorithm for mining strong associations between variables. To understand how this algorithm works, concepts like Support, Confidence & Lift to be studied. ECLAT - Equivalence Class Clustering and bottom-up Lattice Traversal. This is one of the popular algorithms that is used for association problems. This algorithm is an enhanced version of the Apriori algorithm and is more efficient. FP Growth - Frequent Pattern Growth Algorithm - Another very efficient & scalable algorithm for mining associations between variables e) Anomaly Detection We recommend the use of anomaly detection for discovering abnormal activities and unusual cases like fraud detection. An algorithm that can be used for anomaly detection: Isolation Forest - This is an unsupervised algorithm that can help isolate anomalies from huge volume of data thereby enabling anomaly detection f) Sequence Pattern Mining We use sequential pattern mining for predicting the next data events between data examples in a sequence. Predicting the next dose of medicine for a patient g) Dimensionality ReductionDimensionality reduction is used for reducing the dimension of the original data. The idea is to reduce the set of random features by obtaining a set of principal components or features. The key thing to understand in this is that the components retain or represent some meaningful properties of the original data. It can be divided into feature extraction and selection. Algorithms that can be used for dimensionality reduction are: SourcePrincipal Component Analysis - This is a dimensionality reduction algorithm that is used to reduce the number of dimensions or variables in large datasets that have a very high number of variables. However it is to be noted that though PCA transforms a very large set of features or variables into smaller sets, it helps retain most of the information of the dataset. While the reduction of dimensions comes at a cost of model accuracy, the idea is to bring in simplicity in the model by reducing the number of variables or dimensions.  h) Recommendation Systems - Recommender Systems are used to build recommendation engines. Recommender algorithms are used in various business areas that include online stores to recommend the right product to its buyers like Amazon , content recommendation for online video & music sites like Netflix, Amazon Prime Music and various social media platforms like FaceBook, Twitter and so on.   SourceRecommender Engines can be broadly categorized into the following types: Content-based methods — recommends items to a user based on their profile history. It revolves around customer’s taste and preference.  Collaborating filtering method — it can be further subdivided into two categories Model-based — a stipulation wherein user and item interact. Both user and item interaction are learned from interactions matrix. Memory-based — Unlike model-based it relies on the similarity between the users and the items. Hybrid methods — Mix content which is based on collaborative filtering approaches. Examples: Movie recommendation system Food recommendation system E-commerce recommendation system 5. Choose the Algorithm — Several machine learning models can be used with the given context. These models are chosen depending on the data (image, numerical values, texts, sounds) and the data distribution 6. Train the model — Training the model is a process in which the machine learns from the historical data and provides a mathematical model that can be used for prediction. Different algorithms use different computation methods to compute the weights for each of the variables. Some algorithms like Neural Network initialize the weight of the variables at random. These weights are the values which affect the relationship between the actual and the predicted values.  7. Evaluation metrics to evaluate the model— Evaluation process comprises understanding the output model and evaluating the model accuracy for the result. There are various metrics to evaluate model performance. Regression problems have various metrics like MSE, RMSE, MAD, MAPE as key evaluation metrics while classification problems have metrics like Confusion Matrix, Accuracy, Sensitivity (True Positive Rate), Specificity (True Negative Rate), AUC (Area under ROC Curve), Kappa Value and so on. It is only after the evaluation, the model can be improved or fine-tuned to get more accurate predictions. It is important to know a few more concepts like:  True Positive  True Negative  False Positive  False Negative  Confusion Matrix  Recall (R) F1 Score ROC AUC Log loss When we talk about regression the most commonly used regression metrics are: Mean Absolute Error (MAE) Mean Squared Error (MSE) Root Mean Squared Error (RMSE) Root Mean Squared Logarithmic Error (RMSLE) Mean Percentage Error (MPE) Mean Absolute Percentage Error (MAPE) We must know when to use which metric. It depends on the kind of data and the target variable you have. 8. Tweaking the model or the hyperparameter tuning  - With great models, comes the great problem of optimizing hyperparameters to build an improved and accurate ML model. Tuning certain parameters (which are called hyperparameters) is important to ensure improved performance. The hyperparameters vary from algorithm to algorithm and it is important to learn the hyperparameters for each algorithm.  9. Making predictions  - The final nail to the coffin. With all these aforementioned steps followed one can tackle real-life problems with advanced Machine Learning models.  Steps to remember while building the ML model: Data assembling or data collection  - generally represents the data in the form of the dataset.  Data preparation - understanding the problem statement. This includes data wrangling for building or training models, data cleaning, removing duplicates, checking for missing values, data visualization for understanding the relationship between variables, checking for (imbalanced) bias data, and other exploratory data analysis. It also includes splitting the data into train and test. Choosing the model  -  the ML model which answers the problem statement. Different algorithms serve different purposes. Training the model  -  the idea to train the model is to ensure that the prediction is accurate more often. Model evaluation — evaluation metric to measure the performance of the model. How does the model perform against the previously unseen data? The train/test splitting ratio — (70:30) or (80:20), depending on the dataset. There is no exact rule to split the data by (80:20) or (70:30); it depends on the data and the target variable. Some of the data scientists use a range of 60% to 80% for training and the rest for testing the model. Parameter tuning  - to ensure improved performance by controlling the model’s learning process. The hyperparameters have to be tuned so that the model can optimally solve the machine learning problem. For parameter tuning, we either specify a grid of parameters known as the grid search or we randomly select a combination of parameters known as the random search.GridSearchCV — It is the process to search the best combination of parameters over the grid. For instance, n_estimator could possibly be 100,250,350,500; max_depth can be 2,5,11,15 and the criterion could be gini or entropy. Though these don’t look like a lot of parameters, just imagine the scenario if the dataset is too large. The grid search has to run on a loop and calculate the score on the validation set. RandomSearchCV —We randomly select a combination of parameters and then calculate the cross-validation score. It computes faster than GridSearch. Note: Cross-validation is the first and most essential step when it comes to building ML models. If the cross-validation score is good, we can say that the validation data is a representation of training or the real-world data. Finally, making predictions — using the test data, of how the model will perform in real-world cases. ConclusionPython has an extensive catalogue of modules and frameworks. It is fast, less complex and thus it saves development time and cost. It makes the program completely readable particularly for novice users. This particular feature makes Python an ideal recipe for Machine Learning.  Both Machine Learning and Deep Learning require work on complex algorithms and several workflows. When using Python, the developer can worry less about the coding, and can focus more on finding the solution. It is open-source and has an abundance of available resources and step-by-step documentation. It also has an active community of developers who are open to knowledge sharing and networking. The benefits and the ease of coding makes Python the go to choice for developers. We saw how Python has an edge over other programming tools, and why knowledge of Python is essential for ML right now.  Summing up we saw the benefits of Python, the way ahead for beginners and finally the steps required in a machine learning project. This article can be considered as a roadmap to your mastery over Machine Learning. 

Essential Steps to Mastering Machine Learning with Python

7K
Essential Steps to Mastering Machine Learning with Python

One of the world’s most popular programming languages today, Python is a great tool for Machine Learning (ML) and Artificial Intelligence (AI). It is an open-source, reusable, general-purpose, object-oriented, and interpreted programming tool. Python’s key design ideology is code readability, ease of use and high productivity. The latest trend shows that the interest in Python has grown significantly over the past five years. Python is the top choice for ML/AI enthusiasts when compared to other programming languages.   

Image source: Google Trends - comparing Python with other tools in the market

What makes Python a perfect recipe for Machine Learning? 

Python can be used to write Machine Learning algorithms and it computes pretty accurately. Python’s concise and easy readability allows the writing of reliable code very quickly. Another reason for its popularity is the availability of various versatile, ready-to-use libraries.  

It has an excellent library ecosystem and a great tool for developing prototypes. Unlike R, Python is a general-purpose programming language which can be used to build web applications and enterprise applications.  

The community of Python has developed libraries that adhere to a particular area of data science application. For instance, there are libraries available for handling arrays, performing numerical computation with matrices, statistical computing, machine learning, data visualization and many more. These libraries are highly efficient and make the coding much easier with fewer lines of codes. 

Let us have a brief look at some of the important Python libraries that are used for developing machine learning models. 

  • NumPy: One of the fundamental packages for numerical and scientific computing. It is a mathematical library to work with n-dimensional arrays in Python. 
  • Pandas: Provides highly efficient, easy-to-use DataFrame for DataFrame manipulations and Exploratory Data Analysis (EDA). 
  • SciPy: SciPy is a functional library for scientific and high-performance computations. It contains modules for optimization and for several statistical distributions and tests. 
  • Matplotlib: It is a complete plotting package that provides 2D plotting as well as 3D plotting. It can plot static and interactive plots. 
  • Seaborn: Seaborn library is based on Matplotlib. It is used to plot more elegant statistical visualization.  
  • StatsModels: The StatsModels library provides functionalities for estimation of various statistical models and conducting different statistical tests. 
  • Scikit-learn: Scikit-Learn is built on NumPy, SciPy and Matplotlib. Free to use, overpowered and provides various range of supervised and unsupervised machine learning algorithms. 

One should also take into account the importance of IDEs specially designed for Python for Machine Learning. 

The Jupyter Notebook  -  an open-source web-based application that enables ML enthusiasts to create, share, quote, visualize, and live-code their projects.  

There are various other IDEs that can be used like PyCharm, Spyder, Vim, Visual Studio Code. For beginners, there is a nice simple online compiler available – Programiz. 

Roadmap to master Machine Learning Using Python 

  1. Learn Python: Learn Python from basic to advanced. Practice those features that are important for data analysis, statistical analysis and Machine Learning. Start from declaring variables, conditional statements, control flow statements, functions, collection objects, modules and packages. Deep dive into various libraries that are used for statistical analysis and building machine learning models. 
  2. Descriptive Analytics : Learn the concept of descriptive analytics, understand the data, learn to load structured data and perform Exploratory Data Analysis (EDA). Practice data filtering, ordering, grouping, multiple joining of datasets. Handle missing values, prepare visualization plots in 2D or 3D format (from libraries like seaborn, matplotlib) to find hidden information and insights. 
  3. Take a break from Python and Learn Stats - Learn the concept of the random variable and its important role in the field of analytics. Learn to draw insights from the measures of dispersion (mean, median, mode, quartiles and other statistical measures like confidence interval and distribution functions. The next step is to understand probability & various probability distributions and their crucial role in analytics. Understand the concept of various hypothesis tests like t-tests, z-test, ANOVA (Analysis of Variance), ANCOVA (Analysis of Covariance), MANOVA (Multivariate Analysis of Variance), MANCOVA (Multivariate Analysis of Covariance) and chi-square test. 
  4.  Understand Major Machine Learning Algorithms

Image source

Different algorithms have different tasks. It is advisable to understand the context and select the right algorithm for the right task. 

Types of ML ProblemDescriptionExamples
ClassificationPick one of N labelsPredict if loan is going to be defaulted or not
RegressionPredict numerical valuesPredict property price
ClusteringGroup similar examplesMost relevant documents
Association rule learningInfer likely association patterns in dataIf you buy butter you are likely to buy bread (unsupervised
Structured OutputCreate complex outputNatural language parse trees, images recognition bounding boxes
RankingIdentify position on a scale or statusSearch result ranking

Source

A. Regression (Prediction):  Regression algorithms are used for predicting numeric values. For example, predicting property price, vehicle mileage, stock prices and so on.   

Source

B. Linear Regression – predicting a response variable, which is numeric in nature, using one or more features or variables. Linear regression model is mathematically represented as:  

Source

Various regression algorithms include: 

  • Linear Regression 
  • Polynomial Regression  
  • Exponential Regression 
  • Decision Tree 
  • Random Forest 
  • Neural Network 

As a note to new learners, it is suggested to understand the concepts of – Regression assumptions, Ordinary Least Square Method, Dummy Variables (n-1 dummy encoding, one hot encoding), and performance evaluation metrics (RMSE, MSE, MAD). 

  • Classification We use classification algorithms for predicting a set of items’ classes or a categorical feature. For example, predicting loan default (yes/no) or predicting cancer (yes/no) and so on. 

Various classification algorithms include: 

  • Binomial Logistic Regression 
  • Fractional Binomial Regression 
  • Quasibinomial Logistic regression 
  • Decision Tree 
  • Random Forest 
  • Neural Networks 
  • K-Nearest Neighbor 
  • Support Vector Machines 

Some of the classification algorithms are explained here: 

  • K-Nearest Neighbors – simple yet often used classification algorithm. 
  • It is a non-parametric algorithm (does not make any assumption on the underlying data distribution) 
  • It chooses to memorize the learning instances 
  • The output is a class membership  
  • There are three key elements in this approach – a set of labelled objects, eg, a set of stored records, a distance between objects, and the value of k, the number of nearest neighbours  
  • Distance measures that the K-NN algorithm uses - Euclidean distance (square root of the sum of the squared distance between a new point and the existing point across all the input attributes.  

Other distances include – Hamming distance, Manhattan distance, Minkowski distance  

Source

Example of K-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. In other words the number of triangles is more than the number of squares If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle). It is to be noted that to avoid equal voting, the value of should be odd and not even.  

  • Logistic Regression – A supervised algorithm that is used for binary classification. The basis for logistic regression is the logit feature aka sigmoid characteristic which takes any real value and maps it between zero and 1. In other words, Logistic Regression returns a probability value for the class label.  

 Source

  1. If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we can classify it as 0 or NO 

  1. For instance, let us take cancer prediction. If the output of the Logistic Regression is 0.75, we can say in terms of probability that, “There is a 75 percent chance that the patient will suffer from cancer.” 

Decision Tree – Is a type of supervised learning algorithm which is most commonly used in the case of a classification problem. Decision Tree algorithms can also be used for regression problems i.e. to predict a numerical response variable. In other words, Decision Tree works for both categorical and continuous input and output variables. 

  • Each branch node of the decision tree represents a choice between some alternatives and each leaf node represents a decision. 

Source

As an early learner, it is suggested to understand the concept of ID3 algorithm, Gini Index, Entropy, Information Gain, Standard Deviation and Standard Deviation Reduction. 

  • Random Forest – is a collection of multiple decision trees. It is a supervised learning algorithm, that can be used for both classification & regression problems. While algorithms like Decision Tree can cause a problem of overfitting wherein a model performs well in training data but does not perform well in testing or unseen data, algorithms like Random Forest can help avoid overfitting. 
    • It achieves uncorrelated decision trees throughout the concept of bootstrapping (i.e. sampling with replacement) and features randomness.  

Source

As a new learner it is important to understand the concept of bootstrapping.  

  • Support Vector Machine – a supervised learning algorithm, used for classification problems. Another flavour of Support Vector Machines (SVM) is Support Vector Regressor (SVR) which can be used for regression problems. 
    • In this, we plot each data item as a point in n-dimensional space 
    • n here represents the number of features 

Source

The value of each feature is the value of a particular coordinate.  

Classification is performed by finding hyperplanes that differentiate the two classes.  

It is important to understand the concept of margin, support vectors, hyperplanes and tuning hyper-parameters (kernel, regularization, gamma, margin). Also get to know various types of kernels like linear kernel, radial basis function kernel and polynomial kernel 

  • Naive Bayes – a supervised learning classifier which assumes features are independent and there is no correlation between them. The idea behind Naïve Bayes algorithm is the Bayes theorem. 

 Source

C.Clustering 

Clustering algorithms are unsupervised algorithms that are used for dividing data points into groups such that the data points in each group are similar to each other and very different from other groups.  

Some of the clustering algorithms include: 

  • K-means – An unsupervised learning algorithm in which the items are grouped into k-cluster 
    • The elements of the cluster are similar or homogenous. 
    • Euclidean distance is used to calculate the distance between two data points. 
    • Data points have a centroid; this centroid represents the cluster. 
    • The objective is to minimize the intra-cluster variations or the squared error function.

Source

Other types of clustering algorithms: 

  • DBSCAN 
  • Mean Shift 
  • Hierarchical 

d) Association 

Association algorithms, which form part of unsupervised learning algorithms, are for associating co-occurring items or events. Association algorithms are rule-based methods for finding out interesting relationships in large sets of data. For example, find out a relationship between products that are being bought together – say, people who buy butter also buy bread. 

Some of the association algorithms are: 

  • Apriori Rules - Most popular algorithm for mining strong associations between variables. To understand how this algorithm works, concepts like Support, Confidence & Lift to be studied. 
  • ECLAT - Equivalence Class Clustering and bottom-up Lattice Traversal. This is one of the popular algorithms that is used for association problems. This algorithm is an enhanced version of the Apriori algorithm and is more efficient. 
  • FP Growth - Frequent Pattern Growth Algorithm - Another very efficient & scalable algorithm for mining associations between variables 

e) Anomaly Detection 

We recommend the use of anomaly detection for discovering abnormal activities and unusual cases like fraud detection. 

An algorithm that can be used for anomaly detection: 

  • Isolation Forest - This is an unsupervised algorithm that can help isolate anomalies from huge volume of data thereby enabling anomaly detection 

f) Sequence Pattern Mining 

We use sequential pattern mining for predicting the next data events between data examples in a sequence. 

  • Predicting the next dose of medicine for a patient 

g) Dimensionality Reduction

Dimensionality reduction is used for reducing the dimension of the original data. The idea is to reduce the set of random features by obtaining a set of principal components or features. The key thing to understand in this is that the components retain or represent some meaningful properties of the original data. It can be divided into feature extraction and selection. 

Algorithms that can be used for dimensionality reduction are: 

Source

Principal Component Analysis - This is a dimensionality reduction algorithm that is used to reduce the number of dimensions or variables in large datasets that have a very high number of variables. However it is to be noted that though PCA transforms a very large set of features or variables into smaller sets, it helps retain most of the information of the dataset. While the reduction of dimensions comes at a cost of model accuracy, the idea is to bring in simplicity in the model by reducing the number of variables or dimensions.  

h) Recommendation Systems - 

Recommender Systems are used to build recommendation engines. Recommender algorithms are used in various business areas that include online stores to recommend the right product to its buyers like Amazon , content recommendation for online video & music sites like Netflix, Amazon Prime Music and various social media platforms like FaceBook, Twitter and so on.   

Source

Recommender Engines can be broadly categorized into the following types: 

  • Content-based methods — recommends items to a user based on their profile history. It revolves around customer’s taste and preference.  
  • Collaborating filtering method — it can be further subdivided into two categories 
    • Model-based — a stipulation wherein user and item interact. Both user and item interaction are learned from interactions matrix. 
    • Memory-based — Unlike model-based it relies on the similarity between the users and the items. 
  • Hybrid methods — Mix content which is based on collaborative filtering approaches. 

Examples: 

  1. Movie recommendation system 
  2. Food recommendation system 
  3. E-commerce recommendation system 

5. Choose the Algorithm — Several machine learning models can be used with the given context. These models are chosen depending on the data (image, numerical values, texts, sounds) and the data distribution 

6. Train the model — Training the model is a process in which the machine learns from the historical data and provides a mathematical model that can be used for prediction. Different algorithms use different computation methods to compute the weights for each of the variables. Some algorithms like Neural Network initialize the weight of the variables at random. These weights are the values which affect the relationship between the actual and the predicted values.  

7. Evaluation metrics to evaluate the model— Evaluation process comprises understanding the output model and evaluating the model accuracy for the result. There are various metrics to evaluate model performance. Regression problems have various metrics like MSE, RMSE, MAD, MAPE as key evaluation metrics while classification problems have metrics like Confusion Matrix, Accuracy, Sensitivity (True Positive Rate), Specificity (True Negative Rate), AUC (Area under ROC Curve), Kappa Value and so on. 

It is only after the evaluation, the model can be improved or fine-tuned to get more accurate predictions. It is important to know a few more concepts like:  

  • True Positive  
  • True Negative  
  • False Positive  
  • False Negative  
  • Confusion Matrix  
  • Recall (R) 
  • F1 Score 
  • ROC 
  • AUC 
  • Log loss 

When we talk about regression the most commonly used regression metrics are: 

  • Mean Absolute Error (MAE) 
  • Mean Squared Error (MSE) 
  • Root Mean Squared Error (RMSE) 
  • Root Mean Squared Logarithmic Error (RMSLE) 
  • Mean Percentage Error (MPE) 
  • Mean Absolute Percentage Error (MAPE) 

We must know when to use which metric. It depends on the kind of data and the target variable you have. 

8. Tweaking the model or the hyperparameter tuning  - With great models, comes the great problem of optimizing hyperparameters to build an improved and accurate ML model. Tuning certain parameters (which are called hyperparameters) is important to ensure improved performance. The hyperparameters vary from algorithm to algorithm and it is important to learn the hyperparameters for each algorithm.  

9. Making predictions  - The final nail to the coffin. With all these aforementioned steps followed one can tackle real-life problems with advanced Machine Learning models.  

Steps to remember while building the ML model: 

  • Data assembling or data collection  - generally represents the data in the form of the dataset.  
  • Data preparation - understanding the problem statement. This includes data wrangling for building or training models, data cleaning, removing duplicates, checking for missing values, data visualization for understanding the relationship between variables, checking for (imbalanced) bias data, and other exploratory data analysis. It also includes splitting the data into train and test. 
  • Choosing the model  -  the ML model which answers the problem statement. Different algorithms serve different purposes. 
  • Training the model  -  the idea to train the model is to ensure that the prediction is accurate more often. 
  • Model evaluation — evaluation metric to measure the performance of the model. How does the model perform against the previously unseen data? The train/test splitting ratio — (70:30) or (80:20), depending on the dataset. There is no exact rule to split the data by (80:20) or (70:30); it depends on the data and the target variable. Some of the data scientists use a range of 60% to 80% for training and the rest for testing the model. 
  • Parameter tuning  - to ensure improved performance by controlling the model’s learning process. The hyperparameters have to be tuned so that the model can optimally solve the machine learning problem. For parameter tuning, we either specify a grid of parameters known as the grid search or we randomly select a combination of parameters known as the random search.
    • GridSearchCV — It is the process to search the best combination of parameters over the grid. For instance, n_estimator could possibly be 100,250,350,500; max_depth can be 2,5,11,15 and the criterion could be gini or entropy. Though these don’t look like a lot of parameters, just imagine the scenario if the dataset is too large. The grid search has to run on a loop and calculate the score on the validation set. 
    • RandomSearchCV —We randomly select a combination of parameters and then calculate the cross-validation score. It computes faster than GridSearch. 

Note: Cross-validation is the first and most essential step when it comes to building ML models. If the cross-validation score is good, we can say that the validation data is a representation of training or the real-world data. 

  • Finally, making predictions — using the test data, of how the model will perform in real-world cases. 

Conclusion

Python has an extensive catalogue of modules and frameworks. It is fast, less complex and thus it saves development time and cost. It makes the program completely readable particularly for novice users. This particular feature makes Python an ideal recipe for Machine Learning.  

Both Machine Learning and Deep Learning require work on complex algorithms and several workflows. When using Python, the developer can worry less about the coding, and can focus more on finding the solution. It is open-source and has an abundance of available resources and step-by-step documentation. It also has an active community of developers who are open to knowledge sharing and networking. The benefits and the ease of coding makes Python the go to choice for developers. We saw how Python has an edge over other programming tools, and why knowledge of Python is essential for ML right now.  

Summing up we saw the benefits of Python, the way ahead for beginners and finally the steps required in a machine learning project. This article can be considered as a roadmap to your mastery over Machine Learning. 

KnowledgeHut

KnowledgeHut

Author

KnowledgeHut is an outcome-focused global ed-tech company. We help organizations and professionals unlock excellence through skills development. We offer training solutions under the people and process, data science, full-stack development, cybersecurity, future technologies and digital transformation verticals.
Website : https://www.knowledgehut.com

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

How to Install Python on Mac

This article will help you in the installation of Python 3  on macOS. You will learn the basics of configuring the environment to get started with Python.Brief introduction to PythonPython is an Interpreted programming language that is very popular these days due to its easy learning curve and simple syntax. Python finds use in many applications and for programming the backend code of websites. It is also very popular for data analysis across industries ranging from medical/scientific research purposes to retail, finances, entertainment, media and so on.When writing a python program or program in any other language, people usually use something called an IDE or Integrated Development Environment that includes everything you need to write a program. It has an inbuilt text editor to write the program and a debugger to debug the programs as well. PyCharm is a well-known IDE for writing python programs.Latest version of pythonThe latest version of python is python3 and the latest release is python3.9.0.Installation linksFor downloading python and the documentation for MacOS, visit the official website https://www.python.org and go to the downloads section, from where you can download the latest python version for MacOS.Key terms (pip, virtual environment, path etc.)pip:pip is a package manager to simplify the installation of python packages. To install pip, run the below command on the terminal:curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py.If you install python by using brew which is a package management to simplify installation of software on macOs, it installs other dependent  packages as well along with python3  like pip etc.virtual environment:The purpose of virtual environments is to have a separate space where you can install packages which are specific to a certain project. For example if you have a lot of flask or Django-based applications and not all the applications are using the same version, we use virtual env wherein each project will have its own version.In order to use a virtual environment you need to be on the python 3.x version. Let’s understand how to create the virtual environment. You do not need any library as it comes along with standard python installation.So to create a new virtual env, run the below command:python3 -m venv demo-m expects a module name which is venv in this case, so with this python searches your sys.path and executes the module as the main module.Venv expects an environment name that you must create.Now you should have a new environment called demo. Let’s activate this virtual env by running the below command:source demo/bin/activateAfter running this, the environment is activated and you can see the environment name in the terminal. Another way to check if the env is activated is by running which python. You will see the python that you are using with this project env, and the version that it will use is the same that you used to create the environment.Getting and Installing MacPython:For MacOS, python usually comes pre-installed, so to check if python is installed open the terminal in the mac and use `python --version` to confirm the same. You can also see what is the default python version installed, which is usually python2.x by default. However, Python2.x is going to get deprecated soon, and with everyone moving to python3.x ,we will go with the latest python3 installation.Installation stepsFor downloading python, visit the official website https://www.python.org and go to the downloads section. You can download the latest python version for MacOS as shown below:It will download a pkg file. Click on that file to start the installation wizard. You can continue with default settings. If you want to change the install location, however,  you can change it, and then continue and finish the installation with the rest of the default settings.Once the installation is finished it would have created the python 3.x directory in the application folder. Just open the application folder and verify the same.Now you have python3.x installed.To verify it from the terminal, go to the terminal and check the version of python by using `python --version` command. So you will still see it is showing the old default python version, Now instead if you use python3 explicitly like `python3 –version, you can see the version that you have installed with python3 version.Once the installation is finished it would have created a python3.x directory in the application folder. Open the application folder and verify the same.You can also install python3 on mac by using brew which is a package management to simplify installation of software on macOs.brew install python3brew will install other dependent  packages as well along with python3  like pip etcSetting pathSuppose you have installed a new python 3  version but when you type python it still shows the default python2 version which comes by default in mac os. To solve this, add an alias by runningalias python=python3Add this line in the file called .bash_profile present in home directory. In case this file is not present, you can create it, save the changes and restart the terminal by closing it. Next, open the terminal and run python and hit enter. You should see the latest python3 that you have installed.Sometimes when you type python or python3 explicitly, it does not work even if you have installed the python. You get the message, “command is not found”. This means the command is not present in the directories used by the machine for lookup. Let’s  check the directories where the machine is looking for commands by runningecho $PATHIt will list all your directories where the machine looks for commands. This will vary from machine to machine. If the command that you are trying is not under the directory path listed by echo, that command will not work. It will throw an error saying command is not present, until you provide the full path of the directory where it's installed.Now let’s open the file  .bash_profile and add the directory path where python is installed to the current path env variableFor example  let’s add the following lines in that bash_profile file which will add the below directory to the current env variable. This can vary from machine to machine based on the installed location.PATH=”/Library/Frameworks/Python.framework/Versions/3.7/bin:${PATH}”export PATHSave the changes and restart the terminal. Open the terminal now and run echo $PATH again and see the above path that you added for python3. When you now type python3 command, you should see it working.  Also, if you are trying to import a package that you have installed and it says that it cannot find that package, this means pip install is installing the packages in the different version of python directory. Make sure the location of the package is in the site-packages directory of the version of the python that you are using. You can see the location of the package that you are trying to import by running  pip show The above command will have a location field in which you can see and cross verify the path.9. How to run python codeTo run python code just run the commandpython Installing Additional Python Packages:If you want to see what all packages are installed in the env, run the command pip3 list which will list down the current packages installed. Let’s say you want to install request library. You can just install it by running pip3 install requests. Now try running pip3 list again, to see this requests lib installed in this env.Directory as package for distribution:Inside the python project or directory you should have a file called __init__.py. You can create this file by a simple touch command, and this file does not need to have any data inside it, All it has to do is to exist inside the directory, for that to work as a package.Documentation links for pythonhttps://www.python.org/doc/ConclusionThis article will help you with stepwise instructions on the installation of python on mac.
4446
How to Install Python on Mac

This article will help you in the installation of ... Read More

What Is Data Science(with Examples), It's Lifecycle and Who exactly is a Data Scientist

Oh yes, Science is everywhere. A while ago, when children embarked on the journey of learning everyday science in school, the statement that always had a mention was “Science is everywhere”. The situation is more or less the same even in present times. Science has now added a few feathers to its cap. Yes, the general masses sing the mantra “Data Science” is everywhere. What does it mean when I say Data Science is everywhere? Let us take a look at the Science of Data. What are those aspects that make this Science unique from everyday Science?The Big Data Age as you may call it has in it Data as the object of study.Data Science for a person who has set up a firm could be a money spinnerData Science for an architect working at an IT consulting company could be a bread earnerData Science could be the knack behind the answers that come out from the juggler’s hatData Science could be a machine imported from the future, which deals with the Math and Statistics involved in your lifeData science is a platter full of data inference, algorithm development, and technology. This helps the users find recipes to solve analytically complex problems.With data as the core, we have raw information that streams in and is stored in enterprise data warehouses acting as the condiments to your complex problems. To extract the best from the data generated, Data Science calls upon Data Mining. At the end of the tunnel, Data Science is about unleashing different ways to use data and generate value for various organizations.Let us dig deeper into the tunnel and see how various domains make use of Data Science.Example 1Think of a day without Data Science, Google would not have generated results the way it does today.Example 2Suppose you manage an eatery that churns out the best for different taste buds. To model a product in the pipeline, you are keen on knowing what the requirements of your customers are. Now, you know they like more cheese on the pizza than jalapeno toppings. That is the existing data that you have along with their browsing history, purchase history, age and income. Now, add more variety to this existing data. With the vast amount of data that is generated, your strategies to bank upon the customers’ requirements can be more effective. One customer will recommend your product to another outside the circle; this will further bring more business to the organization.Consider this image to understand how an analysis of the customers’ requirements helps:Example 3Data Science plays its role in predictive analytics too.I have an organization that is into building devices that will send a trigger if a natural calamity is soon to occur. Data from ships, aircraft, and satellites can be accumulated and analyzed to build models that will not only help with weather forecasting but also predict the occurrence of natural calamities. The model device that I build will send triggers and save lives too.Consider the image shown below to understand how predictive analytics works:Example 4A lot many of us who are active on social media would have come across this situation while posting images that show you indulging in all fun and frolic with your friends. You might miss tagging your friends in the images you post but the tag suggestion feature available on most platforms will remind you of the tagging that is pending.The automatic tag suggestion feature uses the face recognition algorithm.Lifecycle of Data ScienceCapsulizing the main phases of the Data Science Lifecycle will help us understand how the Data Science process works. The various phases in the Data Science Lifecycle are:DiscoveryData PreparationModel PlanningModel BuildingOperationalizingCommunicating ResultsPhase 1Discovery marks the first phase of the lifecycle. When you set sail with your new endeavor,it is important to catch hold of the various requirements and priorities. The ideation involved in this phase needs to have all the specifications along with an outline of the required budget. You need to have an inquisitive mind to make the assessments – in terms of resources, if you have the required manpower, technology, infrastructure and above all time to support your project. In this phase, you need to have a business problem laid out and build an initial hypotheses (IH) to test your plan. Phase 2Data preparation is done in this phase. An analytical sandbox is used in this to perform analytics for the entire duration of the project. While you explore, preprocess and condition data, modeling follows suit. To get the data into the sandbox, you will perform ETLT (extract, transform, load and transform).We make use of R for data cleaning, transformation, and visualization and further spot the outliers and establish a relationship between the variables. Once the data is prepared after cleaning, you can play your cards with exploratory analytics.Phase 3In this phase of Model planning, you determine the methods and techniques to pick on the relationships between variables. These relationships set the base for the algorithms that will be implemented in the next phase.  Exploratory Data Analytics (EDA) is applied in this phase using various statistical formulas and visualization tools.Subsequently, we will look into the various models that are required to work out with the Data Science process.RR is the most commonly used tool. The tool comes with a complete set of modeling capabilities. This proves a good environment for building interpretive models.SQL Analysis Services SQL Analysis services has the ability to perform in-database analytics using basic predictive models and common data mining functions.SAS/ACCESS  SAS/ACCESS helps you access data from Hadoop. This can be used for creating repeatable and reusable model flow diagrams.You have now got an overview of the nature of your data and have zeroed in on the algorithms to be used. In the next stage, the algorithm is applied to further build up a model.Phase 4This is the Model building phase as you may call it. Here, you will develop datasets for training and testing purposes. You need to understand whether your existing tools will suffice for running the models that you build or if a more robust environment (like fast and parallel processing) is required. The various tools for model building are SAS Enterprise Miner, WEKA, SPCS Modeler, Matlab, Alpine Miner and Statistica.Phase 5In the Operationalize phase, you deliver final reports, briefings, code and technical documents. Moreover, a pilot project may also be implemented in a real-time production environment on a small scale. This helps users get a clear picture of the performance and other related constraints before full deployment.Phase 6The Communicate results phase is the conclusion. Here, we evaluate if you have been able to meet your goal the way you had planned in the initial phase. It is in this phase that the key findings pop their heads out. You communicate to the stakeholders in this phase. This phase brings you the result of your project whether it is a success or a failure.Why Do We Need Data Science?Data Science to be precise is an amalgamation of Infrastructure, Software, Statistics and the various data sources.To really understand big data, it would help us if we bridge back to the historical background. Gartner’s definition circa 2001, which is still the go-to definition says,Big data is data that contains greater variety arriving in increasing volumes and with ever-higher velocity. This is known as the three Vs.When we break the definition into simple terms, all that it means is, big data is humongous. This involves the multiplication of complex data sets with the addition of new data sources. When the data sets are in such high volumes, our traditional data processing software fails to manage them. It is just like how you cannot expect your humble typewriter to do the job of a computer. You cannot expect a typewriter to even do the ctrl c + ctrl v job for you. The amount of data that comes with the solutions to all your business problems is massive. To help you with the processing of this data, you have Data Science playing the key role.The concept of big data itself may sound relatively new; however, the origins of large data sets can be traced back to the 1960s and the '70s. This is when the world of data was just getting started. The world witnessed the set up of the first data centers and the development of the relational database.Around 2005, Facebook, YouTube, and other online services started gaining immense popularity. The more people indulged in the use of these platforms, the more data they generated. The processing of this data involved a lot of Data Science. The masses had to store the amassed data and analyse it at a later point. As a platform that answers to the storage and analysis of the amassed data, Hadoop was developed. Hadoop is an open-source framework that helps in the storage and analysis of big data sets. And as we say, the rest will follow suit; we had NoSQL gaining popularity during this time.With the advent of big data, the need for its storage also grew. The storage of data became a major issue for enterprise industries until 2010. We have had Hadoop, Spark and other frameworks mitigating the challenge to a very large extent. Though the volume of big data is skyrocketing, the focus remains on the processing of the data, all thanks to these efficient frameworks. And, Data Science once again hogs the limelight.Can we say it is only the users leading to huge amounts of data? No, we cannot. It is not only humans generating the data but also the work they indulge in.Delving into the iota of the Internet of Things (IoT) will get us some clarity on the question that we just raised. As we have more objects and devices connected to the Internet, data gathers not just by use but also by the pattern of your usage and the performance of the various products.The Three Vs of Big DataData Science helps in the extraction of knowledge from the accumulated data. While big data has come far with the accumulation of users’ data, its usefulness is only just beginning.Following are the Three Properties that define Big Data:VolumeVelocityVarietyVolumeThe amount of data is a crucial factor here. Big data stands as a pillar when you have to process a multitude of low-density, unstructured data. The data may contain unknown value – such as clickstreams on a webpage or a mobile app and Twitter data feeds. The values of the data may differ from user to user. For some, the value might be in tens of terabytes of data. For others, the value might be in hundreds of petabytes.Consider the different social media platforms – Facebook records 2 billion users, YouTube has 1 billion users, 350 million users for Twitter and a whopping 700 million users on Instagram. There is exchange of billions of images, posts and tweets on these platforms. Imagine the amuck storage of data the users contribute too. Mind Boggling, is it not? This insanely large amount of data is generated every minute and every hour.VelocityThe fast rate at which the data is received and acted upon is the Velocity. Usually, the data is written to the disk. When there is data with highest velocity, it streams directly into the memory. With the advancement in technology, we now have more numbers of Internet-connected devices across industries. The velocity of the data generated through these devices that act real time or near real time may call for real-time evaluation and action.Sticking to our social media example, Facebook accounts for 900 million photo uploads, Twitter handles 500 million tweets, Google is to go to solution for 3.5 billion searches, YouTube calls for 0.4 millions hours of video uploads; all this on a daily basis. The bundled amount of data is stifling.VarietyThe data generated by the users comes in different types. The different types form different varieties of data. Dating back, we had traditional data types that were structured and organized in a relational database.Texts, tweets, videos, photos uploaded form the different varieties of structured data uploaded on the Internet.Voicemails, emails, ECG reading, audio recordings and a lot more form the different varieties of unstructured data that we find on the Internet.Who is a Data Scientist? A curious brain and an impressive training is all that you need to become a Data Scientist. Not as easy as it may sound.Deep thinking, deep learning with intense intellectual curiosity is a common trait found in data scientists. The more you ask questions, the more discoveries you come up with, the more augmented your learning experience is, the more it gets easier for you to tread on the path of Data Science.A factor that differentiates a data scientist from a normal bread earner is that they are more obsessed with creativity and ingenuity. A normal bread earner will go seeking money whereas, the motivator for a data scientist is the ability to solve analytical problems with a pinch of curiosity and creativity. Data scientists are always on a treasure hunt – hunting for the best from the trove.If you think, you need a degree in Sciences or you need to be a PhD in Math to become a legitimate data scientist, mind you, you are carrying a misconception. A natural propensity in these areas will definitely add to your profile but you can be an expert data scientist without a degree in these areas too. Data Science becomes a cinch with heaps of knowledge in programming and business acumen.Data Science is a discipline gaining colossal prominence of late. Educational institutions are yet to come up with comprehensive Data Science degree programs. A data scientist can never claim to have undergone all the required schooling. Learning the rights skills, guided by self-determination is a never-ending process for a data scientist.As Data Science is multidisciplinary, many people find it confusing to differentiate between Data Scientist and Data Analyst.Data Analytics is one of the components of Data Science. Analytics help in understanding the data structure of an organization. The achieved output is further used to solve problems and ring in business insights.The Basic Differences between a Data Scientist and a Data AnalystScientists and Analysts are not exactly synonymous. The roles are not mutually exclusive either. The roles of Data Scientists and Data Analysts differ a lot. Let us take a look at some of the basic differences:CriteriaData ScientistData AnalystGoalInquisitive nature and a strong business acumen helps Data Scientists to arrive at solutionsThey perform data analysis and sourcingTasksData Scientists need to be adept at data insight mining, preparation, and analysis to extract informationData Analysts gather, arrange, process and model both structured and unstructured dataSubstantive expertiseRequiredNot RequiredNon-technical skillsRequiredNot RequiredWhat Skills Are Required To Become a Data Scientist?Data scientists blend with the best skills. The fundamental skills required to become a Data Scientist are as follows:Proficiency in MathematicsTechnology knowhow and the knack to hackBusiness AcumenProficiency in MathematicsA Data Scientist needs to be equipped with a quantitative lens. You can be a Data Scientist if you have the ability to view the data quantitatively.Before a data product is finally built, it calls for a tremendous amount of data insight mining. There are portions of data that include textures, dimensions and correlations. To be able to find solutions to come with an end product, a mathematical perspective always helps.If you have that knack for Math, finding solutions utilizing data becomes a cakewalk laden with heuristics and quantitative techniques. The path to finding solutions to major business problems is a tedious one. It involves the building of analytical models. Data Scientists need to identify the underlying nuts and bolts to successfully build models.Data Science carries with it a misconception that it is all about statistics. Statistics is crucial; however, only the Math type is more accountable. Statistics has two offshoots – the classical and the Bayesian. When people talk about stats, they are usually referring to classical stats. Data Scientists need to refer both types to arrive at solutions. Moreover, there is a mix of inferential techniques and machine learning algorithms; this mix leans on the knowledge of linear algebra. There are popular methods in Data Science; finding a solution using these methods calls upon matrix math which has got very less to do with classical stats.Technology knowhow and the knack to hackOn a lighter note, let us put a disclaimer… you are not being asked to learn hacking to come crashing on computers. As a hacker, you need to be gelled with the amalgam of creativity and ingenuity. You are expected to use the right technical skills to build models and thereby find solutions to complex analytical problems.Why does the world of Data Science vouch on your hacking ability? The answer finds its element in the use of technology by Data Scientists. Mindset, training and the right technology when put together can squeeze out the best from mammoth data sets. Solving complex algorithms requires more sophisticated tools than just Excel. Data scientists need to have the nitty-gritty ability to code. They should be able to prototype quick solutions, as well as integrate with complex data systems. SQL, Python, R, and SAS are the core languages associated with Data Science. A knowhow of Java, Scala, Julia, and other languages also helps. However, the knowledge of language fundamentals does not suffice the quest to extract the best from enormous data sets. A hacker needs to be creative to sail through technical waters and make the codes reach the shore.Business AcumenA strong business acumen is a must-have in the portfolio of any Data Scientist. You need to make tactical moves and fetch that from the data, which no one else can. To be able to translate your observation and make it a shared knowledge calls for a lot of responsibility that can face no fallacy.With the right business acumen, a Data Scientist finds it easy to present a story or the narration of a problem or a solution.To be able to put your ideas and the solutions you arrive at across the table, you need to have business acumen along with the prowess for tech and algorithms.Data, Math, and tech will not help always. You need to have a strong business influence that can further be influenced by a strong business acumen.Companies Using Data ScienceTo address the issues associated with the management of complex and expanding work environments, IT organizations make use of data to identify new value sources. The identification helps them exploit future opportunities and to further expand their operations. What makes the difference here is the knowledge you extract from the repository of data. The biggest and the best companies use analytics to efficiently come up with the best business models.Following are a few top companies that use Data Science to expand their services and increase their productivity.GoogleAmazonProcter & GambleNetflixGoogle Google has always topped the list on a hiring spree for top-notch data scientists. A force of data scientists, artificial intelligence and machine learning by far drives Google. Moreover, when you are here, you get the best when you give the best of your data expertise.AmazonAmazon, the global e-commerce and cloud computing giant hire data scientists on a big scale. To bank upon the customers’ mindsets, enhance the geographical outreach of both the cloud domain and e-commerce domain among other business-driven goals, they make use of Data Science. Data Scientists play a crucial role in steering Data Science.Procter & Gamble and NetflixBig Data is a major component of Data Science.It has answers to a range of business problems – from customer experience to analytics.Netflix and Procter & Gamble join the race of product development by using big data to anticipate customer demand. They make use of predictive analytics, an offshoot of Data Science to build models for services in their pipeline. This modelling is an attribute that contributes to their commercial success. The significant addition to the commercial success of P&G is that it uses data and analytics from test markets, social media, and early store rollouts. Following this strategy, it further plans, produces, and launches the final products. And, the finale often garners an overwhelming response for them.The Final Component of the Big Data StoryWhen speed multiplied with storage capabilities, thus evolved the final component of the Big Data story – the generation and collection of the data. If we still had massive room-sized calculators working as computers, we may not have come across the humongous amount of data that we see today. With the advancement in technology, we called upon ubiquitous devices. With the increase in the number of devices, we have more data being generated. We are generating data at our own pace from our own space owing to the devices that we make use of from our comfort zones. Here I tweet, there you post, while a video is being uploaded on some platform by someone from some corner of the room you are seated in.The more you inform people about what you are doing in your life, the more data you end up writing. I am happy and I share a quote on Facebook expressing my feelings; I am contributing to more data. This is how enormous amount of data is generated. The Internet-connected devices that we use support in writing data. Anything that you engage with in this digital world, the websites you browse, the apps you open on your cell phone, all the data pertaining to these can be logged in a database miles away from you.Writing data and storing it is not an arduous task anymore. At times, companies just push the value of the data to the backburner. At some point of time, this data will be fetched and cooked when they see the need for it.There are different ways to cash upon the billions of data points. Data Science puts the data into categories to get a clear picture. On a Final NoteIf you are an organization looking out to expand your horizons, being data-driven will take you miles. The application of an amalgam of Infrastructure, Software and Statistics, and the various data sources is the secret formula to successfully arrive at key business solutions. The future belongs to Data Science. Today, it is data that we see all around us. This new age sounds the bugle for more opportunities in the field of Data Science. Very soon, the world will need around one million Data Scientists.If you are keen on donning the hat of a Data Scientist, be your own architect when it comes to solving analytical problems. You need to be a highly motivated problem solver to overcome the toughest analytical challenges.Master Data Science with our in-depth online courses. Explore them now!
10069
What Is Data Science(with Examples), It's Lif...

Oh yes, Science is everywhere. A while ago, when c... Read More

Bagging and Random Forest in Machine Learning

In today’s world, innovations happen on a daily basis, rendering all the previous versions of that product, service or skill-set outdated and obsolete. In such a dynamic and chaotic space, how can we make an informed decision without getting carried away by plain hype? To make the right decisions, we must follow a set of processes; investigate the current scenario, chart down your expectations, collect reviews from others, explore your options, select the best solution after weighing the pros and cons, make a decision and take the requisite action. For example, if you are looking to purchase a computer, will you simply walk up to the store and pick any laptop or notebook? It’s highly unlikely that you would do so. You would probably search on Amazon, browse a few web portals where people have posted their reviews and compare different models, checking for their features, specifications and prices. You will also probably ask your friends and colleagues for their opinion. In short, you would not directly jump to a conclusion, but will instead make a decision considering the opinions and reviews of other people as well. Ensemble models in machine learning also operate on a similar manner. They combine the decisions from multiple models to improve the overall performance. The objective of this article is to introduce the concept of ensemble learning and understand algorithms like bagging and random forest which use a similar technique. What is Ensemble Learning? Ensemble methods aim at improving the predictive performance of a given statistical learning or model fitting technique. The general principle of ensemble methods is to construct a linear combination of some model fitting method, instead of using a single fit of the method. An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. Ensemble methods combine several decision trees classifiers to produce better predictive performance than a single decision tree classifier. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model.When we try to predict the target variable using any machine learning technique, the main causes of difference in actual and predicted values are noise, variance, and bias. Ensemble helps to reduce these factors (except noise, which is irreducible error). The noise-related error is mainly due to noise in the training data and can't be removed. However, the errors due to bias and variance can be reduced.The total error can be expressed as follows: Total Error = Bias + Variance + Irreducible Error A measure such as mean square error (MSE) captures all of these errors for a continuous target variable and can be represented as follows: Where, E stands for the expected mean, Y represents the actual target values and fˆ(x) is the predicted values for the target variable. It can be broken down into its components such as bias, variance and noise as shown in the following formula: Using techniques like Bagging and Boosting helps to decrease the variance and increase the robustness of the model. Combinations of multiple classifiers decrease variance, especially in the case of unstable classifiers, and may produce a more reliable classification than a single classifier. Ensemble Algorithm The goal of ensemble algorithms is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. There are two families of ensemble methods which are usually distinguished: Averaging methods. The driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.|Examples: Bagging methods, Forests of randomized trees. Boosting methods. Base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.Examples: AdaBoost, Gradient Tree Boosting.Advantages of Ensemble Algorithm Ensemble is a proven method for improving the accuracy of the model and works in most of the cases. Ensemble makes the model more robust and stable thus ensuring decent performance on the test cases in most scenarios. You can use ensemble to capture linear and simple as well nonlinear complex relationships in the data. This can be done by using two different models and forming an ensemble of two. Disadvantages of Ensemble Algorithm Ensemble reduces the model interpret-ability and makes it very difficult to draw any crucial business insights at the end It is time-consuming and thus might not be the best idea for real-time applications The selection of models for creating an ensemble is an art which is really hard to master Basic Ensemble Techniques Max Voting: Max-voting is one of the simplest ways of combining predictions from multiple machine learning algorithms. Each base model makes a prediction and votes for each sample. The sample class with the highest votes is considered in the final predictive class. It is mainly used for classification problems.  Averaging: Averaging can be used while estimating the probabilities in classification tasks. But it is usually used for regression problems. Predictions are extracted from multiple models and an average of the predictions are used to make the final prediction. Weighted Average: Like averaging, weighted averaging is also used for regression tasks. Alternatively, it can be used while estimating probabilities in classification problems. Base learners are assigned different weights, which represent the importance of each model in the prediction. Ensemble Methods Ensemble methods became popular as a relatively simple device to improve the predictive performance of a base procedure. There are different reasons for this: the bagging procedure turns out to be a variance reduction scheme, at least for some base procedures. On the other hand, boosting methods are primarily reducing the (model) bias of the base procedure. This already indicates that bagging and boosting are very different ensemble methods. From the perspective of prediction, random forests is about as good as boosting, and often better than bagging.  Bootstrap Aggregation or Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. It combines Bootstrapping and Aggregation to form one ensemble model Reduces the variance error and helps to avoid overfitting Bagging algorithms include: Bagging meta-estimator Random forest Boosting refers to a family of algorithms which converts weak learner to strong learners. Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. Boosting is focused on reducing the bias. It makes the boosting algorithms prone to overfitting. To avoid overfitting, parameter tuning plays an important role in boosting algorithms. Some examples of boosting are mentioned below: AdaBoost GBM XGBM Light GBM CatBoost Why use ensemble models? Ensemble models help in improving algorithm accuracy as well as the robustness of a model. Both Bagging and Boosting should be known by data scientists and machine learning engineers and especially people who are planning to attend data science/machine learning interviews. Ensemble learning uses hundreds to thousands of models of the same algorithm and then work hand in hand to find the correct classification. You may also consider the fable of the blind men and the elephant to understand ensemble learning, where each blind man found a feature of the elephant and they all thought it was something different. However, if they would work together and discussed among themselves, they might have figured out what it is. Using techniques like bagging and boosting leads to increased robustness of statistical models and decreased variance. Now the question becomes, between these different “B” words. Which is better? Which is better, Bagging or Boosting? There is no perfectly correct answer to that. It depends on the data, the simulation and the circumstances. Bagging and Boosting decrease the variance of your single estimate as they combine several estimates from different models. So the result may be a model with higher stability. If the problem is that the single model gets a very low performance, Bagging will rarely get a better bias. However, Boosting could generate a combined model with lower errors as it optimizes the advantages and reduces pitfalls of the single model. By contrast, if the difficulty of the single model is overfitting, then Bagging is the best option. Boosting for its part doesn’t help to avoid over-fitting; in fact, this technique is faced with this problem itself. For this reason, Bagging is effective more often than Boosting. In this article we will discuss about Bagging, we will cover Boosting in the next post. But first, let us look into the very important concept of bootstrapping. Bootstrap Sampling Sampling is the process of selecting a subset of observations from the population with the purpose of estimating some parameters about the whole population. Resampling methods, on the other hand, are used to improve the estimates of the population parameters. In machine learning, the bootstrap method refers to random sampling with replacement. This sample is referred to as a resample. This allows the model or algorithm to get a better understanding of the various biases, variances and features that exist in the resample. Taking a sample of the data allows the resample to contain different characteristics then it might have contained as a whole. This is demonstrated in figure 1 where each sample population has different pieces, and none are identical. This would then affect the overall mean, standard deviation and other descriptive metrics of a data set. In turn, it can develop more robust models. Bootstrapping is also great for small size data sets that can have a tendency to overfit. In fact, we recommended this to one company who was concerned because their data sets were far from “Big Data”. Bootstrapping can be a solution in this case because algorithms that utilize bootstrapping can be more robust and handle new data sets depending on the methodology chosen(boosting or bagging). The reason behind using the bootstrap method is because it can test the stability of a solution. By using multiple sample data sets and then testing multiple models, it can increase robustness. Perhaps one sample data set has a larger mean than another, or a different standard deviation. This might break a model that was overfit, and not tested using data sets with different variations. One of the many reasons bootstrapping has become very common is because of the increase in computing power. This allows for many times more permutations to be done with different resamples than previously. Bootstrapping is used in both Bagging and Boosting Let us assume we have a sample of ‘n’ values (x) and we’d like to get an estimate of the mean of the sample. mean(x) = 1/n * sum(x) Consider a sample of 100 values (x) and we’d like to get an estimate of the mean of the sample. We can calculate the mean directly from the sample as: We know that our sample is small and that the mean has an error in it. We can improve the estimate of our mean using the bootstrap procedure: Create many (e.g. 1000) random sub-samples of the data set with replacement (meaning we can select the same value multiple times). Calculate the mean of each sub-sample Calculate the average of all of our collected means and use that as our estimated mean for the data Example: Suppose we used 3 re-samples and got the mean values 2.3, 4.5 and 3.3. Taking the average of these we could take the estimated mean of the data to be 3.367. This process can be used to estimate other quantities like the standard deviation and even quantities used in machine learning algorithms, like learned coefficients. While using Python, we do not have to implement the bootstrap method manually. The scikit-learn library provides an implementation that creates a single bootstrap sample of a dataset. The resample() scikit-learn function can be used for sampling. It takes as arguments the data array, whether or not to sample with replacement, the size of the sample, and the seed for the pseudorandom number generator used prior to the sampling. For example, let us create a bootstrap that creates a sample with replacement with 4 observations and uses a value of 1 for the pseudorandom number generator. boot = resample(data, replace=True, n_samples=4, random_state=1)As the bootstrap API does not allow to easily gather the out-of-bag observations that could be used as a test set to evaluate a fit model, in the univariate case we can gather the out-of-bag observations using a simple Python list comprehension. # out of bag observations  oob = [x for x in data if x not in boot]Let us look at a small example and execute it.# scikit-learn bootstrap  from sklearn.utils import resample  # data sample  data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]  # prepare bootstrap sample  boot = resample(data, replace=True, n_samples=4, random_state=1)  print('Bootstrap Sample: %s' % boot)  # out of bag observations  oob = [x for x in data if x not in boot]  print('OOB Sample: %s' % oob) The output will include the observations in the bootstrap sample and those observations in the out-of-bag sample.Bootstrap Sample: [0.6, 0.4, 0.5, 0.1]  OOB Sample: [0.2, 0.3]Bagging Bootstrap Aggregation, also known as Bagging, is a powerful ensemble method that was proposed by Leo Breiman in 1994 to prevent overfitting. The concept behind bagging is to combine the predictions of several base learners to create a more accurate output. Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees. Suppose there are N observations and M features. A sample from observation is selected randomly with replacement (Bootstrapping). A subset of features are selected to create a model with sample of observations and subset of features. Feature from the subset is selected which gives the best split on the training data. This is repeated to create many models and every model is trained in parallel Prediction is given based on the aggregation of predictions from all the models. This approach can be used with machine learning algorithms that have a high variance, such as decision trees. A separate model is trained on each bootstrap sample of data and the average output of those models used to make predictions. This technique is called bootstrap aggregation or bagging for short. Variance means that an algorithm’s performance is sensitive to the training data, with high variance suggesting that the more the training data is changed, the more the performance of the algorithm will vary. The performance of high variance machine learning algorithms like unpruned decision trees can be improved by training many trees and taking the average of their predictions. Results are often better than a single decision tree. What Bagging does is help reduce variance from models that are might be very accurate, but only on the data they were trained on. This is also known as overfitting. Overfitting is when a function fits the data too well. Typically this is because the actual equation is much too complicated to take into account each data point and outlier. Bagging gets around this by creating its own variance amongst the data by sampling and replacing data while it tests multiple hypothesis(models). In turn, this reduces the noise by utilizing multiple samples that would most likely be made up of data with various attributes(median, average, etc). Once each model has developed a hypothesis. The models use voting for classification or averaging for regression. This is where the “Aggregating” in “Bootstrap Aggregating” comes into play. Each hypothesis has the same weight as all the others. When we later discuss boosting, this is one of the places the two methodologies differ. Essentially, all these models run at the same time, and vote on which hypothesis is the most accurate. This helps to decrease variance i.e. reduce the overfit. Advantages Bagging takes advantage of ensemble learning wherein multiple weak learners outperform a single strong learner.  It helps reduce variance and thus helps us avoid overfitting. Disadvantages There is loss of interpretability of the model. There can possibly be a problem of high bias if not modeled properly. While bagging gives us more accuracy, it is computationally expensive and may not be desirable depending on the use case. There are many bagging algorithms of which perhaps the most prominent would be Random Forest.  Decision Trees Decision trees are simple but intuitive models. Using a top-down approach, a root node creates binary splits unless a particular criteria is fulfilled. This binary splitting of nodes results in a predicted value on the basis of the interior nodes which lead to the terminal or the final nodes. For a classification problem, a decision tree will output a predicted target class for each terminal node produced. We have covered decision tree algorithm  in detail for both classification and regression in another article. Limitations to Decision Trees Decision trees tend to have high variance when they utilize different training and test sets of the same data, since they tend to overfit on training data. This leads to poor performance when new and unseen data is added. This limits the usage of decision trees in predictive modeling. However, using ensemble methods, models that utilize decision trees can be created as a foundation for producing powerful results. Bootstrap Aggregating Trees We have already discussed about bootstrap aggregating (or bagging), we can create an ensemble (forest) of trees where multiple training sets are generated with replacement, meaning data instances. Once the training sets are created, a CART model can be trained on each subsample. Features of Bagged Trees Reduces variance by averaging the ensemble's results. The resulting model uses the entire feature space when considering node splits. Bagging trees allow the trees to grow without pruning, reducing the tree-depth sizes and resulting in high variance but lower bias, which can help improve predictive power. Limitations to Bagging Trees The main limitation of bagging trees is that it uses the entire feature space when creating splits in the trees. Suppose some variables within the feature space are indicating certain predictions, there is a risk of having a forest of correlated trees, which actually  increases bias and reduces variance. Why a Forest is better than One Tree?The main objective of a machine learning model is to generalize properly to new and unseen data. When we have a flexible model, overfitting takes place. A flexible model is said to have high variance because the learned parameters (such as the structure of the decision tree) will vary with the training data. On the other hand, an inflexible model is said to have high bias as it makes assumptions about the training data. An inflexible model may not have the capacity to fit even the training data and in both cases — high variance and high bias — the model is not able to generalize new and unseen data properly. You can through the article on one of the foundational concepts in machine learning, bias-variance tradeoff which will help you understand that the balance between creating a model that is so flexible memorizes the training data and an inflexible model cannot learn the training data.  The main reason why decision tree is prone to overfitting when we do not limit the maximum depth is because it has unlimited flexibility, which means it keeps growing unless there is one leaf node for every single observation. Instead of limiting the depth of the tree which results in reduced variance and increase in bias, we can combine many decision trees into a single ensemble model known as the random forest. What is Random Forest algorithm? Random forest is like bootstrapping algorithm with Decision tree (CART) model. Suppose we have 1000 observations in the complete population with 10 variables. Random forest will try to build multiple CART along with different samples and different initial variables. It will take a random sample of 100 observations and then chose 5 initial variables randomly to build a CART model. It will go on repeating the process say about 10 times and then make a final prediction on each of the observations. Final prediction is a function of each prediction. This final prediction can simply be the mean of each prediction. The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random: Random sampling of training data points when building trees Random subsets of features considered when splitting nodes How the Random Forest Algorithm Works The basic steps involved in performing the random forest algorithm are mentioned below: Pick N random records from the dataset. Build a decision tree based on these N records. Choose the number of trees you want in your algorithm and repeat steps 1 and 2. In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). The final value can be calculated by taking the average of all the values predicted by all the trees in the forest. Or, in the case of a classification problem, each tree in the forest predicts the category to which the new record belongs. Finally, the new record is assigned to the category that wins the majority vote. Using Random Forest for Regression Here we have a problem where we have to predict the gas consumption (in millions of gallons) in 48 US states based on petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with the driving license. We will use the random forest algorithm via the Scikit-Learn Python library to solve this regression problem. First we import the necessary libraries and our dataset. import pandas as pd  import numpy as np  dataset = pd.read_csv('/content/petrol_consumption.csv')  dataset.head() Petrol_taxAverage_incomepaved_HighwaysPopulation_Driver_licence(%)Petrol_Consumption09.0357119760.52554119.0409212500.57252429.0386515860.58056137.5487023510.52941448.043994310.544410You will notice that the values in our dataset are not very well scaled. Let us scale them down before training the algorithm. Preparing Data For Training We will perform two tasks in order to prepare the data. Firstly we will divide the data into ‘attributes’ and ‘label’ sets. The resultant will then be divided into training and test sets. X = dataset.iloc[:, 0:4].values  y = dataset.iloc[:, 4].valuesNow let us divide the data into training and testing sets:from sklearn.model_selection import train_test_split  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)Feature Scaling The dataset is not yet a scaled value as you will see that the Average_Income field has values in the range of thousands while Petrol_tax has values in the range of tens. It will be better if we scale our data. We will use Scikit-Learn's StandardScaler class to do the same. # Feature Scaling  from sklearn.preprocessing import StandardScaler  sc = StandardScaler()  X_train = sc.fit_transform(X_train)  X_test = sc.transform(X_test)Training the Algorithm Now that we have scaled our dataset, let us train the random forest algorithm to solve this regression problem. from sklearn.ensemble import Random Forest Regressor  regressor = Random Forest Regressor(n_estimators=20,random_state=0)  regressor.fit(X_train, y_train)  y_pred = regressor.predict(X_test)The RandomForestRegressor is used to solve regression problems via random forest. The most important parameter of the RandomForestRegressor class is the n_estimators parameter. This parameter defines the number of trees in the random forest. Here we started with n_estimator=20 and check the performance of the algorithm. You can find details for all of the parameters of RandomForestRegressor here. Evaluating the Algorithm Let us evaluate the performance of the algorithm. For regression problems the metrics used to evaluate an algorithm are mean absolute error, mean squared error, and root mean squared error.  from sklearn import metrics  print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred))) Mean Absolute Error: 51.76500000000001 Mean Squared Error: 4216.166749999999 Root Mean Squared Error: 64.93201637097064 With 20 trees, the root mean squared error is 64.93 which is greater than 10 percent of the average petrol consumption i.e. 576.77. This may indicate, among other things, that we have not used enough estimators (trees). Let us now change the number of estimators to 200, the results are as follows: Mean Absolute Error: 48.33899999999999 Mean Squared Error: 3494.2330150000003  Root Mean Squared Error: 59.112037818028234 The graph below shows the decrease in the value of the root mean squared error (RMSE) with respect to number of estimators.  You will notice that the error values decrease with the increase in the number of estimators. You may consider 200 a good number for n_estimators as the rate of decrease in error diminishes. You may try playing around with other parameters to figure out a better result. Using Random Forest for ClassificationNow let us consider a classification problem to predict whether a bank currency note is authentic or not based on four attributes i.e. variance of the image wavelet transformed image, skewness, entropy, andkurtosis of the image. We will use Random Forest Classifier to solve this binary classification problem. Let’s get started. import pandas as pd  import numpy as np  dataset = pd.read_csv('/content/bill_authentication.csv')  dataset.head()VarianceSkewnessKurtosisEntropyClass03.621608.6661-2.8073-0.44699014.545908.1674-2.4586-1.46210023.86600-2.63831.92420.10645033.456609.5228-4.0112-3.59440040.32924-4.45524.5718-0.988800Similar to the data we used previously for the regression problem, this data is not scaled. Let us prepare the data for training. Preparing Data For Training The following code divides data into attributes and labels: X = dataset.iloc[:, 0:4].values  y = dataset.iloc[:, 4].values The following code divides data into training and testing sets:from sklearn.model_selection import train_test_split  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) Feature Scaling We will do the same thing as we did for the previous problem. # Feature Scaling  from sklearn.preprocessing import StandardScaler  sc = StandardScaler()  X_train = sc.fit_transform(X_train)  X_test = sc.transform(X_test)Training the Algorithm Now that we have scaled our dataset, let us train the random forest algorithm to solve this classification problem. from sklearn.ensemble import Random Forest Classifier  classifier = RandomForestClassifier(n_estimators=20, random_state=0)  classifier.fit(X_train, y_train)  y_pred = classifier.predict(X_test)For classification, we have used RandomForestClassifier class of the sklearn.ensemble library. It takes n_estimators as a parameter. This parameter defines the number of trees in out random forest. Similar to the regression problem, we have started with 20 trees here. You can find details for all of the parameters of Random Forest Classifier here. Evaluating the Algorithm For evaluating classification problems,  the metrics used are accuracy, confusion matrix, precision recall, and F1 valuesfrom sklearn.metrics import classification_report, confusion_matrix, accuracy_score  print(confusion_matrix(y_test,y_pred))  print(classification_report(y_test,y_pred))  print(accuracy_score(y_test, y_pred)) The output will look something like this: Output:[ [ 155   2] [     1  117] ]Precisionrecallf1-scoresupport00.990.990.9915710.980.990.99118accuracy0.99275macro avg0.990.990.992750.98909090909090910.990.990.99275The accuracy achieved by our random forest classifier with 20 trees is 98.90%. Let us change the number of trees to 200.from sklearn.ensemble import Random Forest Classifier  classifier = Random Forest Classifier(n_estimators=200, random_state=0)  classifier.fit(X_train, y_train)  y_pred = classifier.predict(X_test) Output:[ [ 155   2] [     1  117] ]Precisionrecallf1-scoresupport00.990.990.9915710.980.990.99118accuracy0.99275macro avg0.990.990.992750.98909090909090910.990.990.99275Unlike the regression problem, changing the number of estimators for this problem did not make any difference in the results.An accuracy of 98.9% is pretty good. In this case, we have seen that there is not much improvement if the number of trees are increased. You may try playing around with other parameters of the RandomForestClassifier class and see if you can improve on our results. Advantages and Disadvantages of using Random Forest As with any algorithm, there are advantages and disadvantages to using it. Let us look into the pros and cons of using Random Forest for classification and regression. Advantages Random forest algorithm is unbiased as there are multiple trees and each tree is trained on a subset of data.  Random Forest algorithm is very stable. Introducing a new data in the dataset does not affect much as the new data impacts one tree and is pretty hard to impact all the trees. The random forest algorithm works well when you have both categorical and numerical features. With missing values in the dataset, the random forest algorithm performs very well. Disadvantages A major disadvantage of random forests lies in their complexity. More computational resources are required and also results in the large number of decision trees joined together. Due to their complexity, training time is more compared to other algorithms. Summary In this article we have covered what is ensemble learning and discussed about basic ensemble techniques. We also looked into bootstrap sampling involves iteratively resampling of a dataset with replacement which allows the model or algorithm to get a better understanding various features. Then we moved on to bagging followed by random forest. We also implemented random forest in Python for both regression and classification and came to a conclusion that increasing number of trees or estimators does not always make a difference in a classification problem. However, in regression there is an impact.  We have covered most of the topics related to algorithms in our series of machine learning blogs,click here. If you are inspired by the opportunities provided by machine learning, enrol in our  Data Science and Machine Learning Courses for more lucrative career options in this landscape.Build your own projects using Machine Learning with Python. Practice with our industry experts on our live workshops now.
17152
Bagging and Random Forest in Machine Learning

In today’s world, innovations happen on a daily ... Read More