top

Search

Machine Learning Tutorial

Decision tree is the building block of random forest algorithm, and is considered to be one of the most popular algorithms in machine learning, which is used for classification purposes. Visualize it this way- It works like a human brain before any decision is made on the task at hand. The idea behind using decision tree is to divide the input dataset into smaller dataset based on specific feature value until every target variable fall under one single category. This split is made so as to get the maximum information gain for every step. Every decision tree begins with a root name, which is the place where the first split is made. An efficient way needs to be determined to make sure that the nodes are defined properly. This is where Gini comes into picture. Gini is considered to be the most commonly used measurement that helps measure inequality. Inequality here refers to the target class which every subset in a node would belong to. Hence, the Gini value is calculated after every split. Based on how the Gini value/ the inequality value changes after every node, information gain can be defined. How is Gini value calculated? The probability of finding a class for every node split is taken, its sum is squared and this value is subtracted from 1. Hence, the subset is a pure subset, which means it contains just one class inside it. The Gini value would be 0, since probability of finding that specific class is actually 1. This means the lowermost node or the leaf has been reached. After this, there is no possibility or way to split the node further. Therefore, the decision tree would have been built. Instead of Gini value, another value can be used to calculate the inequality of classes, and this is known as ‘entropy’. Gini value and Entropy serve the same purpose but vary slightly with respect to the scale. Depending on which splitting strategy has been chosen, different values of Gini can be obtained for every subset of the data, and this value changes after every node. Information Gain can be defined as the different between Gini value of the parent node and the weighted average of the child nodes of the Gini values. All possible splits of the data nodes are considered by the decision tree and the one that has the highest information gain is considered. Implementing a simple Decision Tree Let us look at how a simple decision tree can be implemented with the help of a code example: from sklearn.tree import DecisionTreeClassifier  import pandas as pd  #Matrix of the input dataset is created  data = [[8,8.68,'abc'],[50,41,'dabcog'],[7.9,9,'xyz'],[15,13,'abc'],[8.9,9.8,'xyz']]  #A dataframe is generated  df = pd.DataFrame(data, columns = ['weight','height','label'])  #The predictors are defined  X = df[['weight','height']]  #The target variable is defined and is mapped to 'abc' and 'xyz' y = df['label'].replace({'dog':1, 'cat':0})  #The model is instantiated  tree = DecisionTreeClassifier()  #The model is fit on the data  model = tree.fit(X,y) A dataframe was built which was made to fit the model. From the code, a few observations need to be made: The DecisionTreeClassifier was instantiated without providing any parameters to it. When the input data set is too large, the user has to control the tree from growing and overfitting. This is when the ‘max_depth’ parameter has to be considered, which help specify the number of splits that can be made to the decision tree. The ‘max_features’ parameter can also be set so that the number of predictors can also be maintained and controlled. The criterion can be defined as ‘entropy’ instead of ‘gini’ to change the inequality measure used. Consider the below code example: from sklearn.externals.six import StringIO  from sklearn.tree import export_graphviz  import pydotplus  from IPython.display import Image  dot_data = StringIO()  export_graphviz(  model,  out_file = dot_data,  filled=True, rounded=True, proportion=False,  special_characters=True,  feature_names=X.columns,  class_names=["cat", "dog"]  )  graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  Image(graph.create_png()) This generates a decision tree that helps differentiate between ‘abc’ and ‘xyz’ values. Advantages of decision trees Easy to interpret Deal well with noisy and incomplete data It can be used to implement classification as well as regression algorithms. Disadvantages of decision trees Sometimes, it can be unstable, i.e a small change in the data can make a big difference in the model Sometimes, it tends to overfit with low bias and high variance. It might not perform well on never-before-seen data but may train well. Conclusion In this post, we understood what decision trees are, their significance, advantages and disadvantages with the help of code examples. 
logo

Machine Learning Tutorial

Decision Trees

Decision tree is the building block of random forest algorithm, and is considered to be one of the most popular algorithms in machine learning, which is used for classification purposes. 

Visualize it this way- It works like a human brain before any decision is made on the task at hand. 

The idea behind using decision tree is to divide the input dataset into smaller dataset based on specific feature value until every target variable fall under one single category. This split is made so as to get the maximum information gain for every step. 

Every decision tree begins with a root name, which is the place where the first split is made. An efficient way needs to be determined to make sure that the nodes are defined properly. This is where Gini comes into picture. 

Gini is considered to be the most commonly used measurement that helps measure inequality. Inequality here refers to the target class which every subset in a node would belong to. Hence, the Gini value is calculated after every split. Based on how the Gini value/ the inequality value changes after every node, information gain can be defined. 

How is Gini value calculated? 

The probability of finding a class for every node split is taken, its sum is squared and this value is subtracted from 1. Hence, the subset is a pure subset, which means it contains just one class inside it. The Gini value would be 0, since probability of finding that specific class is actually 1. 

This means the lowermost node or the leaf has been reached. After this, there is no possibility or way to split the node further. Therefore, the decision tree would have been built. 

Instead of Gini value, another value can be used to calculate the inequality of classes, and this is known as ‘entropy’. Gini value and Entropy serve the same purpose but vary slightly with respect to the scale. 

Depending on which splitting strategy has been chosen, different values of Gini can be obtained for every subset of the data, and this value changes after every node. Information Gain can be defined as the different between Gini value of the parent node and the weighted average of the child nodes of the Gini values. 

All possible splits of the data nodes are considered by the decision tree and the one that has the highest information gain is considered. 

Implementing a simple Decision Tree 

Let us look at how a simple decision tree can be implemented with the help of a code example: 

from sklearn.tree import DecisionTreeClassifier 
import pandas as pd 
#Matrix of the input dataset is created 
data = [[8,8.68,'abc'],[50,41,'dabcog'],[7.9,9,'xyz'],[15,13,'abc'],[8.9,9.8,'xyz']] 
#A dataframe is generated 
df = pd.DataFrame(data, columns = ['weight','height','label']) 
#The predictors are defined 
X = df[['weight','height']] 
#The target variable is defined and is mapped to 'abc' and 'xyz' y = df['label'].replace({'dog':1, 'cat':0}) 
#The model is instantiated 
tree = DecisionTreeClassifier() 
#The model is fit on the data 
model = tree.fit(X,y) 

A dataframe was built which was made to fit the model. From the code, a few observations need to be made: 

The DecisionTreeClassifier was instantiated without providing any parameters to it. When the input data set is too large, the user has to control the tree from growing and overfitting. This is when the ‘max_depth’ parameter has to be considered, which help specify the number of splits that can be made to the decision tree. The ‘max_features’ parameter can also be set so that the number of predictors can also be maintained and controlled. The criterion can be defined as ‘entropy’ instead of ‘gini’ to change the inequality measure used. 

Consider the below code example: 

from sklearn.externals.six import StringIO 
from sklearn.tree import export_graphviz 
import pydotplus 
from IPython.display import Image 
dot_data = StringIO() 
export_graphviz( 
model, 
out_file = dot_data, 
filled=True, rounded=True, proportion=False, 
special_characters=True, 
feature_names=X.columns, 
class_names=["cat", "dog"] 
) 
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png()) 

This generates a decision tree that helps differentiate between ‘abc’ and ‘xyz’ values. 

Advantages of decision trees 

  • Easy to interpret 
  • Deal well with noisy and incomplete data 
  • It can be used to implement classification as well as regression algorithms. 

Disadvantages of decision trees 

  • Sometimes, it can be unstable, i.e a small change in the data can make a big difference in the model 
  • Sometimes, it tends to overfit with low bias and high variance. It might not perform well on never-before-seen data but may train well. 

Conclusion 

In this post, we understood what decision trees are, their significance, advantages and disadvantages with the help of code examples. 

Leave a Reply

Your email address will not be published. Required fields are marked *