Entropy is a measure of the disorder or randomness of a system. In information theory, entropy is a measure of the uncertainty or unpredictability of a message. In thermodynamics, entropy is a measure of the disorder or randomness of a physical system.
Understand Context of Entropy in a Decision Tree
Entropy is a measure of disorder or impurity in a node. In the context of decision trees, entropy is used to measure the purity of a node. A node with a high entropy is a node where the data is very mixed, and there is no clear majority class. A node with a low entropy is a node where the data is very pure, and there is a clear majority class.
How to Calcuate Entropy?
The entropy of a system can be calculated using the following formula:
entropy = -sum(p * log(p))
Decision trees use entropy to decide how to split the data. The goal of a decision tree is to create a tree where each leaf node is pure, meaning that all of the data in the leaf node belongs to the same class. To do this, the decision tree will split the data at the point where the entropy is the highest. This ensures that the data is split into the purest possible nodes.
Entropy Use Case
For example, let’s say we have a dataset of patients with cancer. We want to build a decision tree to predict whether a patient has cancer or not. The dataset contains the following features:
- Age
- Gender
- Smoking status
- Family history of cancer
The entropy of the root node of the decision tree will be high, because the data is very mixed. There are patients of all ages, genders, smoking statuses, and family histories in the dataset. The decision tree will then split the data at the point where the entropy is the highest. For example, the decision tree might split the data based on age. The patients who are under 50 years old will be in one node, and the patients who are over 50 years old will be in another node.
The entropy of each of the child nodes will then be calculated. The child node with the highest entropy will be split again, and the process will continue until all of the leaf nodes are pure.
Entropy is a very important concept in decision trees. It is used to measure the purity of a node, and it is used to decide how to split the data. By using entropy, decision trees can be built that are very accurate at predicting the class of a new data point.thumb_upthumb_downshareGoogle it
import numpy as np
def entropy(y_target):
'''
Calculates the entropy given list of target(binary) variables
'''
# Initialize the entropy
entropy = 0
# calculate the counts of each unique element in the y_target
counts = np.unique(y_target, return_counts=True)
# Probabilities of each class label
prob = counts[1] / len(y_target)
# Calculate the entropy involving all the unique elements
for p in prob:
if p != 0:
entropy -= p * np.log2(p)
return entropy
def decision_tree(X, y):
'''
Builds a decision tree from the given data
'''
# Initialize the tree
tree = {}
# Get the entropy of the root node
entropy_root = entropy(y)
# If the entropy of the root node is 0, then the tree is complete
if entropy_root == 0:
return tree
# Find the feature that splits the data with the highest entropy
feature, threshold = np.argmax(entropy(X))
# Split the data based on the feature and threshold
left_X, right_X = X[X[:, feature] <= threshold], X[X[:, feature] > threshold]
left_y, right_y = y[X[:, feature] <= threshold], y[X[:, feature] > threshold]
# Recursively build the decision tree for the left and right child nodes
left_tree = decision_tree(left_X, left_y)
right_tree = decision_tree(right_X, right_y)
# Add the feature and threshold to the tree
tree[feature] = {"threshold": threshold, "left": left_tree, "right": right_tree}
return tree
if __name__ == "__main__":
# Generate some data
X = np.random.randint(0, 2, (100, 3))
y = np.random.randint(0, 2, 100)
# Build the decision tree
tree = decision_tree(X, y)
# Print the tree
print(tree)
This code first defines a function to calculate the entropy of a list of binary variables. Then, it defines a function to build a decision tree from the given data. Finally, it generates some data and builds a decision tree from the data. The code then prints the tree.