Machine Learning Tutorial

By KnowledgeHut .

Decision tree is the building block of random forest algorithm, and is considered to be one of the most popular algorithms in machine learning, which is used for classification purposes. Visualize it this way- It works like a human brain before any decision is made on the task at hand. The idea behind using decision tree is to divide the input dataset into smaller dataset based on specific feature value until every target variable fall under one single category. This split is made so as to get the maximum information gain for every step. Every decision tree begins with a root name, which is the place where the first split is made. An efficient way needs to be determined to make sure that the nodes are defined properly. This is where Gini comes into picture. Gini is considered to be the most commonly used measurement that helps measure inequality. Inequality here refers to the target class which every subset in a node would belong to. Hence, the Gini value is calculated after every split. Based on how the Gini value/ the inequality value changes after every node, information gain can be defined. How is Gini value calculated? The probability of finding a class for every node split is taken, its sum is squared and this value is subtracted from 1. Hence, the subset is a pure subset, which means it contains just one class inside it. The Gini value would be 0, since probability of finding that specific class is actually 1. This means the lowermost node or the leaf has been reached. After this, there is no possibility or way to split the node further. Therefore, the decision tree would have been built. Instead of Gini value, another value can be used to calculate the inequality of classes, and this is known as ‘entropy’. Gini value and Entropy serve the same purpose but vary slightly with respect to the scale. Depending on which splitting strategy has been chosen, different values of Gini can be obtained for every subset of the data, and this value changes after every node. Information Gain can be defined as the different between Gini value of the parent node and the weighted average of the child nodes of the Gini values. All possible splits of the data nodes are considered by the decision tree and the one that has the highest information gain is considered. Implementing a simple Decision Tree Let us look at how a simple decision tree can be implemented with the help of a code example: from sklearn.tree import DecisionTreeClassifier import pandas as pd #Matrix of the input dataset is created data = [[8,8.68,'abc'],[50,41,'dabcog'],[7.9,9,'xyz'],[15,13,'abc'],[8.9,9.8,'xyz']] #A dataframe is generated df = pd.DataFrame(data, columns = ['weight','height','label']) #The predictors are defined X = df[['weight','height']] #The target variable is defined and is mapped to 'abc' and 'xyz' y = df['label'].replace({'dog':1, 'cat':0}) #The model is instantiated tree = DecisionTreeClassifier() #The model is fit on the data model = tree.fit(X,y) A dataframe was built which was made to fit the model. From the code, a few observations need to be made: The DecisionTreeClassifier was instantiated without providing any parameters to it. When the input data set is too large, the user has to control the tree from growing and overfitting. This is when the ‘max_depth’ parameter has to be considered, which help specify the number of splits that can be made to the decision tree. The ‘max_features’ parameter can also be set so that the number of predictors can also be maintained and controlled. The criterion can be defined as ‘entropy’ instead of ‘gini’ to change the inequality measure used. Consider the below code example: from sklearn.externals.six import StringIO from sklearn.tree import export_graphviz import pydotplus from IPython.display import Image dot_data = StringIO() export_graphviz( model, out_file = dot_data, filled=True, rounded=True, proportion=False, special_characters=True, feature_names=X.columns, class_names=["cat", "dog"] ) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) Image(graph.create_png()) This generates a decision tree that helps differentiate between ‘abc’ and ‘xyz’ values. Advantages of decision trees Easy to interpret Deal well with noisy and incomplete data It can be used to implement classification as well as regression algorithms. Disadvantages of decision trees Sometimes, it can be unstable, i.e a small change in the data can make a big difference in the model Sometimes, it tends to overfit with low bias and high variance. It might not perform well on never-before-seen data but may train well. Conclusion In this post, we understood what decision trees are, their significance, advantages and disadvantages with the help of code examples.

1. Machine Learning Overview

2. Machine Learning Terminologies

3. Demystifying Machine Learning

4. Applications of Machine Learning

5. Methods for Machine Learning

6. Underfitting and Overfitting in Machine Learning

7. Data Loading for ML Projects

8. Introduction to Data in Machine Learning

9. Why Data Pre-processing?

10. Normalization

11. Numpy

12. K-Nearest Neighbors (KNN)

13. Hyperparameter Tuning

14. Pre-procesing Data

15. What is Clustering in Machine Learning?

16. Overview - Regression & Logistic Regression

17. Linear Regression(Python Implementation)

18. Softmax Regression using TensorFlow

19. What is Linear Regression?

20. Linear Regression using PyTorch

21. Decision Trees

22. Introduction To Machine Learning using Python

23. Learning Model Building in Scikit-learn: A Python Machine Learning Library

24. Confusion matrix

25. Machine learning metrics

26. Improving Performance of ML Models

27. How to get synonyms/antonyms from NLTK WordNet in Python?

28. Removing stop words with NLTK in Python

29. Tokenize text using NLTK in Python

Decision Trees

Decision tree is the building block of random forest algorithm, and is considered to be one of the most popular algorithms in machine learning, which is used for classification purposes.

Visualize it this way- It works like a human brain before any decision is made on the task at hand.

The idea behind using decision tree is to divide the input dataset into smaller dataset based on specific feature value until every target variable fall under one single category. This split is made so as to get the maximum information gain for every step.

Every decision tree begins with a root name, which is the place where the first split is made. An efficient way needs to be determined to make sure that the nodes are defined properly. This is where Gini comes into picture.

Gini is considered to be the most commonly used measurement that helps measure inequality. Inequality here refers to the target class which every subset in a node would belong to. Hence, the Gini value is calculated after every split. Based on how the Gini value/ the inequality value changes after every node, information gain can be defined.

How is Gini value calculated?

The probability of finding a class for every node split is taken, its sum is squared and this value is subtracted from 1. Hence, the subset is a pure subset, which means it contains just one class inside it. The Gini value would be 0, since probability of finding that specific class is actually 1.

This means the lowermost node or the leaf has been reached. After this, there is no possibility or way to split the node further. Therefore, the decision tree would have been built.

Instead of Gini value, another value can be used to calculate the inequality of classes, and this is known as ‘entropy’. Gini value and Entropy serve the same purpose but vary slightly with respect to the scale.

Depending on which splitting strategy has been chosen, different values of Gini can be obtained for every subset of the data, and this value changes after every node. Information Gain can be defined as the different between Gini value of the parent node and the weighted average of the child nodes of the Gini values.

All possible splits of the data nodes are considered by the decision tree and the one that has the highest information gain is considered.

Implementing a simple Decision Tree

Let us look at how a simple decision tree can be implemented with the help of a code example:

from sklearn.tree import DecisionTreeClassifier 
import pandas as pd 
#Matrix of the input dataset is created 
data = [[8,8.68,'abc'],[50,41,'dabcog'],[7.9,9,'xyz'],[15,13,'abc'],[8.9,9.8,'xyz']] 
#A dataframe is generated 
df = pd.DataFrame(data, columns = ['weight','height','label']) 
#The predictors are defined 
X = df[['weight','height']] 
#The target variable is defined and is mapped to 'abc' and 'xyz' y = df['label'].replace({'dog':1, 'cat':0}) 
#The model is instantiated 
tree = DecisionTreeClassifier() 
#The model is fit on the data 
model = tree.fit(X,y)

A dataframe was built which was made to fit the model. From the code, a few observations need to be made:

The DecisionTreeClassifier was instantiated without providing any parameters to it. When the input data set is too large, the user has to control the tree from growing and overfitting. This is when the ‘max_depth’ parameter has to be considered, which help specify the number of splits that can be made to the decision tree. The ‘max_features’ parameter can also be set so that the number of predictors can also be maintained and controlled. The criterion can be defined as ‘entropy’ instead of ‘gini’ to change the inequality measure used.

Consider the below code example:

from sklearn.externals.six import StringIO 
from sklearn.tree import export_graphviz 
import pydotplus 
from IPython.display import Image 
dot_data = StringIO() 
export_graphviz( 
model, 
out_file = dot_data, 
filled=True, rounded=True, proportion=False, 
special_characters=True, 
feature_names=X.columns, 
class_names=["cat", "dog"] 
) 
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

This generates a decision tree that helps differentiate between ‘abc’ and ‘xyz’ values.

Advantages of decision trees

Easy to interpret
Deal well with noisy and incomplete data
It can be used to implement classification as well as regression algorithms.

Disadvantages of decision trees

Sometimes, it can be unstable, i.e a small change in the data can make a big difference in the model
Sometimes, it tends to overfit with low bias and high variance. It might not perform well on never-before-seen data but may train well.

Conclusion

In this post, we understood what decision trees are, their significance, advantages and disadvantages with the help of code examples.

20-A Linear Regression using PyTorch

22-A Introduction To Machine Learning using Python

Your email address will not be published. Required fields are marked *

Comments

Vinu

After reading your article, I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article. Thanks for sharing.

Johnson M

Good and informative article.

Vinu

I enjoyed reading your articles. This is truly a great read for me. Keep up the good work!

Vinu

Awesome blog. I enjoyed reading this article. This is truly a great read for me. Keep up the good work!

best data science courses in India

Thanks for sharing this article!! Machine learning is a branch of artificial intelligence (AI) and computer science that focus on the uses of data and algorithms. I came to know a lot of information from this article.

View More Comments

Search

Machine Learning Tutorial

By KnowledgeHut .

Machine Learning Tutorial

Decision Trees

How is Gini value calculated?

Implementing a simple Decision Tree

Conclusion

Leave a Reply

Comments

Vinu

Johnson M

Vinu

Vinu

best data science courses in India