Entropy is a measure of the disorder or randomness of a system. In information theory, entropy is a measure of the uncertainty or unpredictability of a message. In thermodynamics, entropy is a measure of the disorder or randomness of a physical system.

Understand Context of Entropy in a Decision Tree

Entropy is a measure of disorder or impurity in a node. In the context of decision trees, entropy is used to measure the purity of a node. A node with a high entropy is a node where the data is very mixed, and there is no clear majority class. A node with a low entropy is a node where the data is very pure, and there is a clear majority class.

How to Calcuate Entropy?

The entropy of a system can be calculated using the following formula:

entropy = -sum(p * log(p))

Decision trees use entropy to decide how to split the data. The goal of a decision tree is to create a tree where each leaf node is pure, meaning that all of the data in the leaf node belongs to the same class. To do this, the decision tree will split the data at the point where the entropy is the highest. This ensures that the data is split into the purest possible nodes.

Entropy Use Case

For example, let’s say we have a dataset of patients with cancer. We want to build a decision tree to predict whether a patient has cancer or not. The dataset contains the following features:

  • Age
  • Gender
  • Smoking status
  • Family history of cancer

The entropy of the root node of the decision tree will be high, because the data is very mixed. There are patients of all ages, genders, smoking statuses, and family histories in the dataset. The decision tree will then split the data at the point where the entropy is the highest. For example, the decision tree might split the data based on age. The patients who are under 50 years old will be in one node, and the patients who are over 50 years old will be in another node.

The entropy of each of the child nodes will then be calculated. The child node with the highest entropy will be split again, and the process will continue until all of the leaf nodes are pure.

Entropy is a very important concept in decision trees. It is used to measure the purity of a node, and it is used to decide how to split the data. By using entropy, decision trees can be built that are very accurate at predicting the class of a new data point.thumb_upthumb_downshareGoogle it

import numpy as np

def entropy(y_target):
    '''
    Calculates the entropy given list of target(binary) variables
    '''

    # Initialize the entropy
    entropy = 0

    # calculate the counts of each unique element in the y_target
    counts = np.unique(y_target, return_counts=True)

    # Probabilities of each class label
    prob = counts[1] / len(y_target)

    # Calculate the entropy involving all the unique elements
    for p in prob:
        if p != 0:
            entropy -= p * np.log2(p)

    return entropy

def decision_tree(X, y):
    '''
    Builds a decision tree from the given data
    '''

    # Initialize the tree
    tree = {}

    # Get the entropy of the root node
    entropy_root = entropy(y)

    # If the entropy of the root node is 0, then the tree is complete
    if entropy_root == 0:
        return tree

    # Find the feature that splits the data with the highest entropy
    feature, threshold = np.argmax(entropy(X))

    # Split the data based on the feature and threshold
    left_X, right_X = X[X[:, feature] <= threshold], X[X[:, feature] > threshold]
    left_y, right_y = y[X[:, feature] <= threshold], y[X[:, feature] > threshold]

    # Recursively build the decision tree for the left and right child nodes
    left_tree = decision_tree(left_X, left_y)
    right_tree = decision_tree(right_X, right_y)

    # Add the feature and threshold to the tree
    tree[feature] = {"threshold": threshold, "left": left_tree, "right": right_tree}

    return tree

if __name__ == "__main__":
    # Generate some data
    X = np.random.randint(0, 2, (100, 3))
    y = np.random.randint(0, 2, 100)

    # Build the decision tree
    tree = decision_tree(X, y)

    # Print the tree
    print(tree)

This code first defines a function to calculate the entropy of a list of binary variables. Then, it defines a function to build a decision tree from the given data. Finally, it generates some data and builds a decision tree from the data. The code then prints the tree.

By Pankaj

Leave a Reply

Your email address will not be published. Required fields are marked *