5 ก.พ. 2024 เวลา 17:53 • วิทยาศาสตร์ & เทคโนโลยี

Chapter 2 : Decision tree

Decision Tree is a tool used to help make decisions by using tree-like graph structures, making it easier for humans to interpret results. To use classification in both problems
  • Binary Classes problem
  • Multilevel Classes problem
Binary Classes Multilevel Classes problems, as well as can be used to represent structures.
Database Information (Database) For quick search or regression analysis.
In addition, in this chapter, only the use of decision trees and classification problems is discussed.
2.1 Objective
After the reader has finished studying this chapter, the reader should be able to:
  • Understanding Decision Tree Procedures
  • Understand entropy and its application
  • Write a program to solve a simple problem using the ID3 algorithm.
  • Understand how to solve Overfitting
2.2 Components of Decision Tree
The decision tree consists of 3 main components:
1. Node is used to specify attribute names.
2. Branches are used to identify the value of each individual characteristic of the knot above that branch, sometimes referred to as an edge.
Or a link.
3. Leaf is used to identify the target value or class of the problem.
as shown in Figure below
Example of Decision Tree
2.3 Example of decision tree
Decision trees can be used for many types of problems and information. If categorized by the data used, such as value information:
  • Continuous values
  • Discrete values
  • Missing values
  • Weighted Values
if categorized by Target:
  • discrete values for classification and model values
  • Continuous regression analysis
The decision tree can also support training data with noisy data.
It can also be used as a substitute for large amounts of information or as a substitution....and then (If-then Rule) to help humans translate.
In this topic, an example of some decision trees related to classification work will be given.
Remark
Classification and regression are generally different at the target only, i.e.,
  • In case of discrete value : classification
  • In case of continuous value : regression
Therefore, classifiers can generally be used to analyze regression such as decision trees. Artificial neural networks and vector machine support, etc.
2.3.1 Boolean Decision Tree
This decision tree supports binary solutions, i.e., leaf positions are only 2 classes, such as plus or minus.
as shown in Figure below
Example of Boolean Decision Tree
Notice that we can write this tree as a rule in the normal selection (Disjunctive Normal Form: DNF) format.
(color = green)^(shape = square)→ positive
(color = blue)^(shape = circle)→ positive
(color = blue)^(shape = square) → positive
Or can be described in form of Disjunction of Conjunction such as..
((color = green)^(shape = square))V
((color = blue)^(shape = circle))V
((color = blue)^(shape = square))
2.3.2 Multi-class Decision Tree
This type of decision tree supports multi-class solutions, i.e., leaves can be worth two or more classes, as shown in Figure below
Multi-class Decision Problem
which can make decision trees as shown in Figure
Multi-class Decision Tree
2.3.3 Real-Value Decision Tree
Usually, the branching of a decision tree ton must be a discrete value (Discrete), but in the case of the value of that regular characteristic.
Is a real number or a continuous value (Continuous) must be made discrete first, as shown in Figure
Real-Value Decision Problem
We can see that. X and Y are regular characteristics that are continuously on the real axis, so we need to divide the values into intervals, which is how.
Finding the appropriate segmentation range will be mentioned later.
Creating a decision tree can be done by applying the value of the dividing range as shown in Figure
2.4 Procedures for creating inductive decision trees
Creating decision trees can be done by analyzing them from a sample group. If the sample is large enough and covers the area wide enough, it will make a good tree. Generally, good trees mean small trees in both sides.
The number of nodes and leaves and the integrity of the branch (Purity) is high.
remark
The goal of building a small tree is according to Occam's razor, which says "Entities must not be multiplied Beyond confidence" which is used as a principle to find the best value in many disciplines.
2.5 Selecting regular characteristics for node formation
From the aforementioned algorithms, it is found that choosing which regular characteristics to use is a knot. The criteria for choosing are required because of selection.
The regular characteristics are different, resulting in different results as follows. Consider the information in Table
Table 2.1. Binary Data Example
We can create two types of trees by choosing the first node as A or B as follows:
Decision trees created by selecting different nodes from  data in Table 2.1.
It will be found that if A characteristic is selected as the first node (left figure), a tree smaller than the tree that selects B characteristic as the first node (right figure). According to the previous topic according to 0ccam's razor, we consider A's tree as the first node.
From this example, it is concluded that we should first select regular characteristics that can distinguish data more completely. Therefore, we need quantities to measure the integrity of these differentiation.
2.6 The Iterative Dichotomiser 3 (ID3)
The Iterative Dichotomiser 3 (ID3) is a decision tree algorithm proposed by Quinlan (J. R. Quinlan, 1986). It utilizes Information Gain to aid in selecting the most suitable feature at each node. This involves applying the concept of entropy, as discussed in information theory, to measure the information content of the data. Entropy is used to assess the amount of information in the data set.
2.6.1 Entropy
Entropy is a measure that indicates the uncertainty, disorder, and impurity of data. In other words, higher entropy implies more information. Entropy (H(x)) can be calculated using the formula:
n is the total number of events. The equation quantifies the amount of information in the dataset. If entropy is high, there is more information.
Observations:
If we use b=n, the maximum value of entropy will not exceed 1, similar to the maximum value of probability. However, in general, we often use
b=2, making entropy exceed 1, with the unit being bits.
Here's an example calculation of the entropy for a normal coin toss, considering the sample space with two events: head (h) and tail (t). The calculation is expressed as
"Which is the maximum value because a regular coin has the highest uncertainty. In other words, the probability of getting heads is equal to tails (0.5). However, if the coin is biased, it will change the probabilities of getting heads and tails. This change can be observed through the modified entropy values using the following set of commands."
% Generate a range of probabilities for heads
Ph = 0:0.01:1;
% Calculate the corresponding probabilities for tails
Pt = 1 - Ph;
% Calculate the entropy for each probability
E = -Ph .* log2(Ph) - Pt .* log2(Pt);
% Set any NaN values to 0
E(isnan(E)) = 0;
% Plot the entropy as a function of the probability of heads
plot(Ph, E, 'LineWidth', 2);
title('Coin Toss Entropy');
xlabel('Probability of Heads');
ylabel('Entropy');
grid on;
Observations:
Normally, when substituting values into the formula, there is a case where
log(0), which is undefined. However, considering the definition of entropy, when the probability is 0, it means that the event is impossible. Therefore, the entropy value should be 0. In this case, the calculation may result in NaN, which can be addressed by using isnan(E) to identify and replace it with 0. The result is illustrated in the graph shown in Figure
Entropy of a Single Coin Toss Event with Various Coin Biases
In Figure above, the entropy of a single coin toss event is depicted for different coin biases. It can be observed that as the probability of getting heads increases or decreases to the extremes, the entropy approaches zero. This signifies that there is very little information in those positions because the events are already certain.
2.6.2 Information Gain (IG)
The previously discussed entropy values are now utilized to calculate Information Gain (IG). IG is employed to choose the best feature
a from the set of all features
A in the entire dataset
S. The definition is as follows:
โฆษณา