In data mining and machine
learning, a decision tree is a predictive model; that is, a mapping
from observations about an item to conclusions about its target value. More
descriptive names for such tree models are classification tree or reduction
tree. In these tree structures, leaves represent classifications and
branches represent conjunctions of features that lead to those classifications
[1]. The machine learning technique for inducing a decision tree from data is
called decision tree learning, or (colloquially) decision
trees.
Decision trees is widely used in data mining and/or machine
learning applications in order predict or classify samples. The prediction or
classification is done by the help of decison trees.
Suppose data set below is given:
snow | weather | season | physical condition | go skiing |
sticky | foggy | low | rested | no |
fresh | sunny | low | injured | no |
fresh | sunny | low | rested | yes |
fresh | sunny | high | rested | yes |
fresh | sunny | mid | rested | yes |
frosted | windy | high | tired | no |
sticky | sunny | low | rested | yes |
frosted | foggy | mid | rested | no |
fresh | windy | low | rested | yes |
fresh | windy | low | rested | yes |
fresh | foggy | low | rested | yes |
fresh | foggy | low | rested | yes |
sticky | sunny | mid | rested | yes |
frosted | foggy | low | injured | no |
To be able to generate a decision tree, we can use some
applications. The way that i implemented, is an Oracle based approach, PL/SQL
coded application. You can analyze the code at the end of this article. For
this application, table data must be loaded into Oracle:
drop table t;
create table t(
snow VARCHAR2(8),
weather VARCHAR2(8),
season VARCHAR2(8),
physical_condition VARCHAR2(8),
go_skiing VARCHAR2(3)
);
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('sticky', 'foggy', 'low', 'rested', 'no');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'sunny', 'low', 'injured', 'no');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'sunny', 'low', 'rested', 'yes');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'sunny', 'high', 'rested', 'yes');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'sunny', 'mid', 'rested', 'yes');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('frosted', 'windy', 'high', 'tired', 'no');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('sticky', 'sunny', 'low', 'rested', 'yes');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('frosted', 'foggy', 'mid', 'rested', 'no');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'windy', 'low', 'rested', 'yes');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'windy', 'low', 'rested', 'yes');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'foggy', 'low', 'rested', 'yes');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'foggy', 'low', 'rested', 'yes');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('sticky', 'sunny', 'mid', 'rested', 'yes');
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('frosted', 'foggy', 'low', 'injured', 'no');
commit;
Firstly, gain is calculated for each attributes. As you see, all atributes are categorical. So we can calculate gain with some statistical and matematical calculations. With Oracle's analytical functions, it will be more easy to make calculations.(calculations can be investigated within code )
The call spec for this application is shown below:
BEGIN
dm.GenerateDecisionTree('t', 'go_skiing',NULL,0,',' );
END;
Print-out of this application is:
INFO>Gain for Categorical Attribute "SNOW" is 5,8331
INFO>Gain for Categorical Attribute "WEATHER" is 5,5184
INFO>Gain for Categorical Attribute "SEASON" is 5,4819
INFO>Gain for Categorical Attribute "PHYSICAL_CONDITION" is 5,803
INFO>MAX Gain for Categorical Attribute "SNOW" is MAX
SNOW='fresh'
INFO>Gain for Categorical Attribute "WEATHER" is 1,1493
INFO>Gain for Categorical Attribute "SEASON" is 1,0674
INFO>Gain for Categorical Attribute "PHYSICAL_CONDITION" is 1,5549
INFO>MAX Gain for Categorical Attribute "PHYSICAL_CONDITION" is MAX
PHYSICAL_CONDITION='injured'.
PHYSICAL_CONDITION='rested'.
SNOW='frosted'.
SNOW='sticky'
INFO>Gain for Categorical Attribute "WEATHER" is 1,3082
INFO>Gain for Categorical Attribute "SEASON" is ,9749
INFO>Gain for Categorical Attribute "PHYSICAL_CONDITION" is ,3899
INFO>MAX Gain for Categorical Attribute "WEATHER" is MAX
WEATHER='foggy'.
WEATHER='sunny'.
Summarization of data set above is:
SNOW='fresh'
PHYSICAL_CONDITION='injured'.
PHYSICAL_CONDITION='rested'.
SNOW='frosted'.
SNOW='sticky'
WEATHER='foggy'.
WEATHER='sunny'.
With this decision tree, you can predict If SNOW is sticky then go_skiing would be NO.
Suppose tuple below will be added to data set.:
fresh | sunny | mid | rested | No |
Let's analyze what will happen
SNOW='fresh'
PHYSICAL_CONDITION='injured'.
PHYSICAL_CONDITION='rested'
SEASON='high'.
SEASON='low'.
SEASON='mid'
WEATHER='sunny'
SNOW='frosted'.
SNOW='sticky'
WEATHER='foggy'.
WEATHER='sunny'.
If your data set contains continous(numeric ) attributes, some additional calculations have to be done. Suppose data set below that contains numerical attributes:
A | Class |
15 | C1 |
20 | C2 |
25 | C1 |
30 | C1 |
35 | C2 |
25 | C1 |
15 | C2 |
20 | C2 |
If you generate decision tree of data set above
INFO>Gain for Numerical Attribute "A<=15" is 1,5
INFO>Gain for Numerical Attribute "A<=20" is 2,5661
INFO>Gain for Numerical Attribute "A<=25" is 1,5
INFO>Gain for Numerical Attribute "A<=30" is 1,01
INFO>Gain for Numerical Attribute "A<=35" is 0
A<=20
A>20
So, 20 can be considered as a treshold value for sample above.
Suppose data set above that contains both categorical and continous samples:
A | B | C | Class |
15 | 1 | A | C1 |
20 | 3 | B | C2 |
25 | 2 | A | C1 |
30 | 4 | A | C1 |
35 | 2 | B | C2 |
25 | 4 | A | C1 |
15 | 2 | B | C2 |
20 | 3 | B | C2 |
Generated decision tree will be
INFO>Gain for Numerical Attribute "A<=15" is 1,5
INFO>Gain for Numerical Attribute "A<=20" is 2,5661
INFO>Gain for Numerical Attribute "A<=25" is 1,5
INFO>Gain for Numerical Attribute "A<=30" is 1,01
INFO>Gain for Numerical Attribute "A<=35" is 0
INFO>Gain for Numerical Attribute "B<=1" is 1,01
INFO>Gain for Numerical Attribute "B<=2" is 2
INFO>Gain for Numerical Attribute "B<=3" is 2,0375
INFO>Gain for Numerical Attribute "B<=4" is 0
INFO>Gain for Categorical Attribute "C" is 4
INFO>MAX Gain for Categorical Attribute "C" is MAX
C='A'.
C='B'.
So, C attribute can be a dominat attribute.
Let's make a prediction with data set above. Suppose test sample is below:
A | B | C | D | Prediction | Truth |
10 | 2 | A | C2 | C1 | N |
20 | 1 | B | C1 | C2 | N |
30 | 3 | A | C2 | C1 | N |
40 | 2 | B | C2 | C2 | Y |
15 | 1 | B | C1 | C2 | N |
As it shown below only %20 of samples classified correctly. If we have more sample, it will be possible to make more accurately prediction.
Application code is below(Updated at 29-Apr-07):
PROCEDURE GenerateDecisionTree(pis_TableName IN VARCHAR2,
pis_ClassFieldName IN VARCHAR2,
pis_WhereStatement IN VARCHAR2,
pin_Level IN NUMBER DEFAULT 1,
pis_PassedAttributes IN VARCHAR2 DEFAULT ',',
pib_IsDebugMode IN BOOLEAN DEFAULT FALSE)
/**************************************************************************************
* Author : Mennan Tekbir
* Date : 06-April-2007
* Location :
* Notes :
* -------------------------------------------------------------------------------------
* Purpose :
* Parameters :
* Return :
* Exceptions :
* -------------------------------------------------------------------------------------
* History :
| Author | Date | Purpose
|------- |----------- |----------------------------------------------
| MTE | 06-Apr-2007 | Funciton creation.
| MTE | 29-Apr-2007 | Revised.Some comments added.
**************************************************************************************/
IS
vs_SqlInfoClass VARCHAR2(1024);
vs_SqlInfoAttribute VARCHAR2(1024);
vs_SqlClassCount VARCHAR2(1024);
vs_SqlClassValue VARCHAR2(1024);
vn_InfoClass NUMBER;
vn_InfoClassInn NUMBER;
vt_ColumnNames t_StringTab;
vt_DistinctColumnValues t_StringTab;
vt_DistinctColumnValuesInn t_StringTab;
vn_AttributeCount NUMBER;
vn_InfoAttribute NUMBER;
vn_InfoAttributeInn NUMBER;
vn_Gain NUMBER;
vt_GainList t_NumberTab;
vn_MaxGain NUMBER;
vn_MaxGainIndex NUMBER;
vn_GainInn NUMBER;
vt_GainListInn t_NumberTab;
vn_MaxGainInn NUMBER;
vn_MaxGainIndexInn NUMBER;
vn_Count NUMBER;
vs_GainedWhere VARCHAR2(64);
vs_GainedWhere2 VARCHAR2(64);
vs_GainedWhere3 VARCHAR2(64);
vb_IsNumericColumn BOOLEAN;
vs_ClassValue VARCHAR2(4000);
BEGIN
vt_ColumnNames := GetColumnNames(pis_TableName);
vn_AttributeCount := vt_ColumnNames.COUNT;
vt_GainList := t_NumberTab();
vt_GainList.EXTEND(vn_AttributeCount);
vs_SqlInfoClass := '
SELECT SUM(-1 * p * log(2, p)) i
FROM (SELECT COUNT(*) over(PARTITION BY <
FROM <
vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<
vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<
vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<
EXECUTE IMMEDIATE vs_SqlInfoClass INTO vn_InfoClass;
vn_MaxGain := -1;
vn_MaxGainIndex := -1;
vb_IsNumericColumn := FALSE;
-- for all columns(attributes)
FOR i IN 1 .. vn_AttributeCount LOOP
--If column is not class field(class field is not considered for decision tree)
IF upper(pis_ClassFieldName) != vt_ColumnNames(i) THEN
--If column is not processed before(if processed ignore it)
IF instr(pis_PassedAttributes, ',' || vt_ColumnNames(i) || ',') = 0 THEN
--Take out whether column is numerical(nominal) or categoric.(Categoric and numeric columns are calculated differently)
vb_IsNumericColumn := IsNumneriColumn(pis_TableName, vt_ColumnNames(i));
--If column is a categoric such as consist of VARCHAR data type
IF vb_IsNumericColumn = FALSE THEN
--Calculate attribute gain(this a generic calculation with "EXECUTE IMMEDIATE" and "REPLACE"s)
--Use template of SQL statement
vs_SqlInfoAttribute := '
SELECT SUM((b / d) * ((a / b) * log(2, (a / b)) * -1)) i
FROM (SELECT DISTINCT COUNT(*) over(PARTITION BY <
,COUNT(*) over(PARTITION BY <
,COUNT(*) over(ORDER BY NULL) d
FROM <
vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<
vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<
vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<
vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<
EXECUTE IMMEDIATE vs_SqlInfoAttribute INTO vn_InfoAttribute;
--Extract attribute gain from class gain that is calculated before
vn_Gain := vn_InfoClass - vn_InfoAttribute;
--If this debug mode, print out values.
IF pib_IsDebugMode = TRUE THEN
dbms_output.put_line(lpad(' ', pin_Level * 1) || 'INFO>Gain for Categorical Attribute "' || vt_ColumnNames(i) || '" is ' || trunc(vn_Gain, 4));
END IF;
--If column is a numeric such as consist of NUMBER data type
ELSE
--Use template below to calculate gain for numeric attribute
vs_SqlInfoClass := '
SELECT SUM(-1 * p * log(2, p)) i
FROM (SELECT COUNT(*) over(PARTITION BY <
FROM <
vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<
vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<
vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<
EXECUTE IMMEDIATE vs_SqlInfoClass INTO vn_InfoClassInn;
--Get distinct values of numeric attribute.
vt_DistinctColumnValuesInn := GetDistinctColumnValues(pis_TableName, vt_ColumnNames(i), pis_WhereStatement);
--Assign initially some values
vn_MaxGainInn := -1;
vn_MaxGainIndexInn := -1;
--Create a in-memory table to hold every gain for distinct values
vt_GainListInn := t_NumberTab();
vt_GainListInn.EXTEND(vt_DistinctColumnValuesInn.COUNT);
-- for all distinct value for attr.
FOR k IN 1 .. vt_DistinctColumnValuesInn.LAST LOOP
--Use template below to calculate gain for numeric attribute
--The template consist of a CASE statement that includes a "<=" and ">" operators
vs_SqlInfoAttribute := '
SELECT SUM(((a / d) * log(2, (a / b)) * -1)) s
FROM (SELECT COUNT(*) over(PARTITION BY p, <
,COUNT(*) over(PARTITION BY p ORDER BY NULL) b
,COUNT(*) over(ORDER BY NULL) d
FROM (SELECT <
,CASE WHEN <
FROM <
WHERE <
vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<
vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<
vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<
vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<
vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<
EXECUTE IMMEDIATE vs_SqlInfoAttribute INTO vn_InfoAttributeInn;
--Extract gain for a value of attribute from class gain that is calculated before
vn_GainInn := vn_InfoClassInn - vn_InfoAttributeInn;
--Assign information gain to table.
vt_GainListInn(k) := vn_GainInn;
--If debug mode, print out
IF pib_IsDebugMode = TRUE THEN
dbms_output.put_line(lpad(' ', pin_Level * 1) || 'INFO>Gain for Numerical Attribute "' || vt_ColumnNames(i) || '<=' || vt_DistinctColumnValuesInn(k) || '" is ' || trunc(vn_GainInn, 4));
END IF;
--If newly calculated gain is higher, then assign it as new Maximum gain
IF vn_GainInn > vn_MaxGainInn THEN
vn_MaxGainIndexInn := k;
vn_MaxGainInn := vn_GainInn;
vn_Gain := vn_GainInn;
vs_GainedWhere2 := vt_DistinctColumnValuesInn(vn_MaxGainIndexInn);
END IF;
END LOOP;-- for all distinct value for attr.
--Now, the max gain for numeric attribute is stored into "vn_Gain" and
-- max value index for gain is stored into "vn_MaxGainIndexInn" and
-- "vs_GainedWhere2" contains max gained value. This will use later
END IF;--If column is a numeric such as consist of NUMBER data type
--"vn_Gain" contains gain for attribute either attribute is categoric or numeric
vt_GainList(i) := vn_Gain;
--If newly calculated gain is higher, then assign it as new Maximum gain
IF vn_Gain > vn_MaxGain THEN
vn_MaxGainIndex := i;
vn_MaxGain := vn_Gain;
vs_GainedWhere3 := vs_GainedWhere2;
END IF;
END IF; --If column is not processed before(if processed ignore it)
END IF;--If column is not class field(class field is not considered for decision tree)
END LOOP;--for all columns(attributes)
--If at least one column is processed "vn_MaxGainIndex" will be differnt than "-1"
IF vn_MaxGainIndex != -1 THEN
--Is it a numeric or categoric column?Both of two will be processed differently
vb_IsNumericColumn := IsNumneriColumn(pis_TableName, vt_ColumnNames(vn_MaxGainIndex));
--If it is a categoric attribute(column)
IF vb_IsNumericColumn = FALSE THEN
IF pib_IsDebugMode = TRUE THEN
dbms_output.put_line(lpad(' ', pin_Level * 1) || 'INFO>MAX Gain for Categorical Attribute "' || vt_ColumnNames(vn_MaxGainIndex) || '" is MAX');
END IF;
--Get distinct values of column that has maimum gain value.
--These values will be used in order to find out which value is most suitable for classification
vt_DistinctColumnValues := GetDistinctColumnValues(pis_TableName, vt_ColumnNames(vn_MaxGainIndex), pis_WhereStatement);
-- for all distinct values for column that has maimum gain value.
FOR i IN 1 .. vt_DistinctColumnValues.LAST LOOP
--Use template below to calculate count(number of instances or rows) for distinct value.
vs_SqlClassCount := '
SELECT COUNT(*)
FROM (SELECT DISTINCT <
FROM <
WHERE <
vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<
vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<
vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<
vs_GainedWhere := vt_ColumnNames(vn_MaxGainIndex) || '=''' || vt_DistinctColumnValues(i) || '''';
vs_SqlClassCount := REPLACE(vs_SqlClassCount,'<
EXECUTE IMMEDIATE vs_SqlClassCount INTO vn_Count;
--If count is 1 then classification has finished over this value of attribute
IF vn_Count = 1 THEN
--Use template below to find out instance(row) value for classification has done.
vs_SqlClassValue := '
SELECT DISTINCT <
FROM <
WHERE <
vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<
vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<
vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<
vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<
EXECUTE IMMEDIATE vs_SqlClassValue INTO vs_ClassValue;
--Print out information
dbms_output.put_line(lpad(' ', pin_Level * 2, ' ') || vs_GainedWhere || ' =>>> ' || vs_ClassValue);
--If count is more than 1 then classification has not finished over this value of attribute.
--This value of attribute can not distingusih instances.So make a recursive call in order to
-- find again a decision tree with new dataset(that is specified with where condition)
ELSIF vn_Count > 1 THEN
dbms_output.put_line(lpad(' ', pin_Level * 2, ' ') || vs_GainedWhere);
GenerateDecisionTree(pis_TableName,
pis_ClassFieldName,
nvl(pis_WhereStatement, '1 = 1') || ' AND ' || vs_GainedWhere,
pin_Level + 1,
pis_PassedAttributes || vt_ColumnNames(vn_MaxGainIndex) || ',',
pib_IsDebugMode);
END IF;
END LOOP;--for all distinct values for column that has maimum gain value.
--If it is a numeric attribute(column)
ELSE
--Only 2 iteration is done.There is only one value but it can be either "<=" or ">"
FOR m IN 1 .. 2 LOOP
--Use template below to calculate count(number of instances or rows) for numeric value.
vs_SqlClassCount := '
SELECT COUNT(*)
FROM (SELECT DISTINCT <
FROM <
WHERE <
IF m = 1 THEN
vs_GainedWhere := vt_ColumnNames(vn_MaxGainIndex) || '<=' || vs_GainedWhere3;
ELSE
vs_GainedWhere := vt_ColumnNames(vn_MaxGainIndex) || '>' || vs_GainedWhere3;
END IF;
vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<
vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<
vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<
vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<
EXECUTE IMMEDIATE vs_SqlClassCount INTO vn_Count;
--If count is 1 then classification has finished over this value of numeric attribute
IF vn_Count = 1 THEN
--Use template below to find out instance(row) value for classification has done.
vs_SqlClassValue := '
SELECT DISTINCT <
FROM <
WHERE <
vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<
vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<
vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<
vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<
EXECUTE IMMEDIATE vs_SqlClassValue INTO vs_ClassValue;
dbms_output.put_line(lpad(' ', pin_Level * 2, ' ') || vs_GainedWhere || ' =>>> ' || vs_ClassValue);
--If count is more than 1 then classification has not finished over this value of attribute.
--This value of attribute can not distingusih instances.So make a recursive call in order to
-- find again a decision tree with new dataset(that is specified with where condition)
ELSIF vn_Count > 1 THEN
dbms_output.put_line(lpad(' ', pin_Level * 2, ' ') || vs_GainedWhere);
GenerateDecisionTree(pis_TableName,
pis_ClassFieldName,
nvl(pis_WhereStatement, '1 = 1') || ' AND ' || vs_GainedWhere,
pin_Level + 1,
pis_PassedAttributes || vt_ColumnNames(vn_MaxGainIndex) || ',',
pib_IsDebugMode);
END IF;
END LOOP;--Only 2 iteration is done.There is only one value but it can be either "<=" or ">"
END IF;
END IF;--If at least one column is processed "vn_MaxGainIndex" will be differnt than "-1"
END GenerateDecisionTree;