19 April 2007

Data Minig :: Generating Decision Trees On Oracle

As wikipedia says;


In data mining and machine
learning
, a decision tree is a predictive model; that is, a mapping
from observations about an item to conclusions about its target value. More
descriptive names for such tree models are classification tree or reduction
tree
. In these tree structures, leaves represent classifications and
branches represent conjunctions of features that lead to those classifications
[1]. The machine learning technique for inducing a decision tree from data is
called decision tree learning, or (colloquially) decision
trees
.



Decision trees is widely used in data mining and/or machine
learning applications in order predict or classify samples. The prediction or
classification is done by the help of decison trees.


Suppose data set below is given:




snow




weather




season




physical
condition





go
skiing





sticky




foggy




low




rested




no




fresh




sunny




low




injured




no




fresh




sunny




low




rested




yes




fresh




sunny




high




rested




yes




fresh




sunny




mid




rested




yes




frosted




windy




high




tired




no




sticky




sunny




low




rested




yes




frosted




foggy




mid




rested




no




fresh




windy




low




rested




yes




fresh




windy




low




rested




yes




fresh




foggy




low




rested




yes




fresh




foggy




low




rested




yes




sticky




sunny




mid




rested




yes




frosted




foggy




low




injured




no







To be able to generate a decision tree, we can use some
applications. The way that i implemented, is an Oracle based approach, PL/SQL
coded application. You can analyze the code at the end of this article. For
this application, table data must be loaded into Oracle:




 
drop table t;
create table t(
  snow               VARCHAR2(8),
  weather               VARCHAR2(8),
  season             VARCHAR2(8),
  physical_condition VARCHAR2(8),
  go_skiing          VARCHAR2(3)
);
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('sticky', 'foggy', 'low', 'rested', 'no');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'sunny', 'low', 'injured', 'no');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'sunny', 'low', 'rested', 'yes');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'sunny', 'high', 'rested', 'yes');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'sunny', 'mid', 'rested', 'yes');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('frosted', 'windy', 'high', 'tired', 'no');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('sticky', 'sunny', 'low', 'rested', 'yes');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('frosted', 'foggy', 'mid', 'rested', 'no');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'windy', 'low', 'rested', 'yes');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'windy', 'low', 'rested', 'yes');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'foggy', 'low', 'rested', 'yes');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'foggy', 'low', 'rested', 'yes');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('sticky', 'sunny', 'mid', 'rested', 'yes');
 
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('frosted', 'foggy', 'low', 'injured', 'no');
 
commit;
 
 
Firstly, gain is calculated for each attributes. As you see, all atributes are categorical. So we can calculate gain with some statistical and matematical calculations. With Oracle's analytical functions, it will be more easy to make calculations.(calculations can be investigated within code )
 
The call spec for this application is shown below:
 
BEGIN
  dm.GenerateDecisionTree('t', 'go_skiing',NULL,0,',' );
END;
 
 
Print-out of this application is:
 
INFO>Gain for Categorical Attribute "SNOW" is 5,8331
INFO>Gain for Categorical Attribute "WEATHER" is 5,5184
INFO>Gain for Categorical Attribute "SEASON" is 5,4819
INFO>Gain for Categorical Attribute "PHYSICAL_CONDITION" is 5,803
INFO>MAX Gain for Categorical Attribute "SNOW" is MAX
SNOW='fresh'
  INFO>Gain for Categorical Attribute "WEATHER" is 1,1493
  INFO>Gain for Categorical Attribute "SEASON" is 1,0674
  INFO>Gain for Categorical Attribute "PHYSICAL_CONDITION" is 1,5549
  INFO>MAX Gain for Categorical Attribute "PHYSICAL_CONDITION" is MAX
    PHYSICAL_CONDITION='injured'.
    PHYSICAL_CONDITION='rested'.
SNOW='frosted'.
SNOW='sticky'
  INFO>Gain for Categorical Attribute "WEATHER" is 1,3082
  INFO>Gain for Categorical Attribute "SEASON" is ,9749
  INFO>Gain for Categorical Attribute "PHYSICAL_CONDITION" is ,3899
  INFO>MAX Gain for Categorical Attribute "WEATHER" is MAX
    WEATHER='foggy'.
    WEATHER='sunny'.

 
Summarization of data set above is:
 
SNOW='fresh'
    PHYSICAL_CONDITION='injured'.
    PHYSICAL_CONDITION='rested'.
SNOW='frosted'.
SNOW='sticky'
    WEATHER='foggy'.
    WEATHER='sunny'.

 
With this decision tree, you can predict If SNOW is sticky then go_skiing would be NO.
 
Suppose tuple below will be added to data set.:
 
fresh
sunny
mid
rested
No
 
 
Let's analyze what will happen
 
SNOW='fresh'
  PHYSICAL_CONDITION='injured'.
  PHYSICAL_CONDITION='rested'
    SEASON='high'.
    SEASON='low'.
    SEASON='mid'
      WEATHER='sunny'
SNOW='frosted'.
SNOW='sticky'
  WEATHER='foggy'.
  WEATHER='sunny'.
 
 
If your data set contains continous(numeric ) attributes, some additional calculations have to be done. Suppose data set below that contains numerical attributes:
 
 
A
Class
15
C1
20
C2
25
C1
30
C1
35
C2
25
C1
15
C2
20
C2
 
If you generate decision tree of data set above
 
 
INFO>Gain for Numerical Attribute "A<=15" is 1,5
INFO>Gain for Numerical Attribute "A<=20" is 2,5661
INFO>Gain for Numerical Attribute "A<=25" is 1,5
INFO>Gain for Numerical Attribute "A<=30" is 1,01
INFO>Gain for Numerical Attribute "A<=35" is 0


A<=20
A>20

 
So, 20 can be considered as a treshold value for sample above.
 
Suppose data set above that contains both categorical and continous samples:
 
A
B
C
Class
15
1
A
C1
20
3
B
C2
25
2
A
C1
30
4
A
C1
35
2
B
C2
25
4
A
C1
15
2
B
C2
20
3
B
C2
 
Generated decision tree will be
 
INFO>Gain for Numerical Attribute "A<=15" is 1,5
INFO>Gain for Numerical Attribute "A<=20" is 2,5661
INFO>Gain for Numerical Attribute "A<=25" is 1,5
INFO>Gain for Numerical Attribute "A<=30" is 1,01
INFO>Gain for Numerical Attribute "A<=35" is 0
INFO>Gain for Numerical Attribute "B<=1" is 1,01
INFO>Gain for Numerical Attribute "B<=2" is 2
INFO>Gain for Numerical Attribute "B<=3" is 2,0375
INFO>Gain for Numerical Attribute "B<=4" is 0
INFO>Gain for Categorical Attribute "C" is 4
INFO>MAX Gain for Categorical Attribute "C" is MAX


C='A'.
C='B'.

 
 
So, C attribute can be a dominat attribute.
 
Let's make a prediction with data set above. Suppose test sample is below:
 
 
A
B
C
D
Prediction
Truth
10
2
A
C2
C1
N
20
1
B
C1
C2
N
30
3
A
C2
C1
N
40
2
B
C2
C2
Y
15
1
B
C1
C2
N
 
As it shown below only %20 of samples classified correctly. If we have more sample, it will be possible to make more accurately prediction.
 
 
 
 
 
 
Application code is below(Updated at 29-Apr-07):
 
 
  PROCEDURE GenerateDecisionTree(pis_TableName        IN VARCHAR2,
                                 pis_ClassFieldName   IN VARCHAR2,
                                 pis_WhereStatement   IN VARCHAR2,
                                 pin_Level            IN NUMBER DEFAULT 1,
                                 pis_PassedAttributes IN VARCHAR2 DEFAULT ',',
                                 pib_IsDebugMode      IN BOOLEAN DEFAULT FALSE)
  /**************************************************************************************
    * Author     : Mennan Tekbir
    * Date       : 06-April-2007
    * Location   :
    * Notes      :
    * -------------------------------------------------------------------------------------
    * Purpose    :
    * Parameters :
    * Return     :
    * Exceptions :
    * -------------------------------------------------------------------------------------
    * History    :
     | Author        | Date                 | Purpose
     |-------        |-----------           |----------------------------------------------
     | MTE           | 06-Apr-2007          | Funciton creation.
     | MTE           | 29-Apr-2007          | Revised.Some comments added.
    **************************************************************************************/
   IS
 
    vs_SqlInfoClass            VARCHAR2(1024);
    vs_SqlInfoAttribute        VARCHAR2(1024);
    vs_SqlClassCount           VARCHAR2(1024);
    vs_SqlClassValue           VARCHAR2(1024);
    vn_InfoClass               NUMBER;
    vn_InfoClassInn            NUMBER;
    vt_ColumnNames             t_StringTab;
    vt_DistinctColumnValues    t_StringTab;
    vt_DistinctColumnValuesInn t_StringTab;
    vn_AttributeCount          NUMBER;
    vn_InfoAttribute           NUMBER;
    vn_InfoAttributeInn        NUMBER;
    vn_Gain                    NUMBER;
    vt_GainList                t_NumberTab;
    vn_MaxGain                 NUMBER;
    vn_MaxGainIndex            NUMBER;
    vn_GainInn                 NUMBER;
    vt_GainListInn             t_NumberTab;
    vn_MaxGainInn              NUMBER;
    vn_MaxGainIndexInn         NUMBER;
    vn_Count                   NUMBER;
    vs_GainedWhere             VARCHAR2(64);
    vs_GainedWhere2            VARCHAR2(64);
    vs_GainedWhere3            VARCHAR2(64);
    vb_IsNumericColumn         BOOLEAN;
    vs_ClassValue              VARCHAR2(4000);
 
  BEGIN
 
    vt_ColumnNames    := GetColumnNames(pis_TableName);
    vn_AttributeCount := vt_ColumnNames.COUNT;
    vt_GainList       := t_NumberTab();
    vt_GainList.EXTEND(vn_AttributeCount);
 
    vs_SqlInfoClass := '
     SELECT SUM(-1 * p * log(2, p)) i
      FROM (SELECT COUNT(*) over(PARTITION BY <> ORDER BY NULL) / COUNT(*) over(ORDER BY NULL) p
              FROM <> WHERE <>)';
 
    vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<>', pis_TableName);
    vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<>', pis_ClassFieldName);
    vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<>', nvl(pis_WhereStatement, ' 1=1 '));
  
    EXECUTE IMMEDIATE vs_SqlInfoClass INTO vn_InfoClass;
 
    vn_MaxGain         := -1;
    vn_MaxGainIndex    := -1;
    vb_IsNumericColumn := FALSE;
 
    -- for all columns(attributes)
    FOR i IN 1 .. vn_AttributeCount LOOP
  
      --If column is not class field(class field is not considered for decision tree)
      IF upper(pis_ClassFieldName) != vt_ColumnNames(i) THEN
      
        --If column is not processed before(if processed ignore it)
        IF instr(pis_PassedAttributes, ',' || vt_ColumnNames(i) || ',') = 0 THEN
        
          --Take out whether column is numerical(nominal) or categoric.(Categoric and numeric columns are calculated differently)
          vb_IsNumericColumn := IsNumneriColumn(pis_TableName, vt_ColumnNames(i));
        
          --If column is a categoric such as consist of VARCHAR data type
          IF vb_IsNumericColumn = FALSE THEN
          
            --Calculate attribute gain(this a generic calculation with "EXECUTE IMMEDIATE" and "REPLACE"s)
            --Use template of SQL statement
            vs_SqlInfoAttribute := '
               SELECT SUM((b / d) * ((a / b) * log(2, (a / b)) * -1)) i
                FROM (SELECT DISTINCT COUNT(*) over(PARTITION BY <>, <> ORDER BY NULL) a
                                     ,COUNT(*) over(PARTITION BY <> ORDER BY NULL) b
                                     ,COUNT(*) over(ORDER BY NULL) d
                       FROM <> WHERE <>)';
        
            vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',pis_TableName);
            vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',pis_ClassFieldName);
            vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',vt_ColumnNames(i));
            vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',nvl(pis_WhereStatement, ' 1=1 '));
        
            EXECUTE IMMEDIATE vs_SqlInfoAttribute INTO vn_InfoAttribute;
        
            --Extract attribute gain from class gain that is calculated before
            vn_Gain := vn_InfoClass - vn_InfoAttribute;
          
            --If this debug mode, print out values.
            IF pib_IsDebugMode = TRUE THEN
              dbms_output.put_line(lpad(' ', pin_Level * 1) || 'INFO>Gain for Categorical Attribute "' || vt_ColumnNames(i) || '" is ' || trunc(vn_Gain, 4));
            END IF;
        
          --If column is a numeric such as consist of NUMBER data type
          ELSE
            --Use template below to calculate gain for numeric attribute
            vs_SqlInfoClass := '
             SELECT SUM(-1 * p * log(2, p)) i
              FROM (SELECT COUNT(*) over(PARTITION BY <> ORDER BY NULL) / COUNT(*) over(ORDER BY NULL) p
                      FROM <> WHERE <>)';
        
            vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<>', pis_TableName);
            vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<>', pis_ClassFieldName);
            vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<>', nvl(pis_WhereStatement, ' 1=1 '));
          
            EXECUTE IMMEDIATE vs_SqlInfoClass INTO vn_InfoClassInn;
          
            --Get distinct values of numeric attribute.
            vt_DistinctColumnValuesInn := GetDistinctColumnValues(pis_TableName, vt_ColumnNames(i), pis_WhereStatement);
        
            --Assign initially some values
            vn_MaxGainInn      := -1;
            vn_MaxGainIndexInn := -1;
          
            --Create a in-memory table to hold every gain for distinct values
            vt_GainListInn     := t_NumberTab();
            vt_GainListInn.EXTEND(vt_DistinctColumnValuesInn.COUNT);
        
            -- for all distinct value for attr.
            FOR k IN 1 .. vt_DistinctColumnValuesInn.LAST LOOP
            
              --Use template below to calculate gain for numeric attribute
              --The template consist of a CASE statement that includes a "<=" and ">" operators
              vs_SqlInfoAttribute := '
                SELECT SUM(((a / d) * log(2, (a / b)) * -1)) s
                  FROM (SELECT COUNT(*) over(PARTITION BY p, <> ORDER BY NULL) a
                              ,COUNT(*) over(PARTITION BY p ORDER BY NULL) b
                              ,COUNT(*) over(ORDER BY NULL) d
                          FROM (SELECT <>
                               ,CASE WHEN <> <= <> THEN 1 ELSE 0 END p
                                  FROM <>
                         WHERE <>))';
          
              vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>', pis_TableName);
              vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',pis_ClassFieldName);
              vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',vt_ColumnNames(i));
              vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>', vt_DistinctColumnValuesInn(k));
              vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',nvl(pis_WhereStatement, ' 1=1 '));
          
              EXECUTE IMMEDIATE vs_SqlInfoAttribute INTO vn_InfoAttributeInn;
          
              --Extract gain for a value of attribute from class gain that is calculated before
              vn_GainInn := vn_InfoClassInn - vn_InfoAttributeInn;
              --Assign information gain to table.
              vt_GainListInn(k) := vn_GainInn;
              --If debug mode, print out
              IF pib_IsDebugMode = TRUE THEN
                dbms_output.put_line(lpad(' ', pin_Level * 1) || 'INFO>Gain for Numerical Attribute "' ||  vt_ColumnNames(i) || '<=' || vt_DistinctColumnValuesInn(k) || '" is ' || trunc(vn_GainInn, 4));
              END IF;
            
              --If newly calculated gain is higher, then assign it as new Maximum gain
              IF vn_GainInn > vn_MaxGainInn THEN
                vn_MaxGainIndexInn := k;
                vn_MaxGainInn      := vn_GainInn;
                vn_Gain            := vn_GainInn;
                vs_GainedWhere2    := vt_DistinctColumnValuesInn(vn_MaxGainIndexInn);
              END IF;
          
            END LOOP;-- for all distinct value for attr.
          
          
            --Now, the max gain for numeric attribute is stored into "vn_Gain" and
            -- max value index for gain is stored into "vn_MaxGainIndexInn" and
            -- "vs_GainedWhere2" contains max gained value. This will use later
          
          
          END IF;--If column is a numeric such as consist of NUMBER data type
        
          --"vn_Gain" contains gain for attribute either attribute is categoric or numeric
          vt_GainList(i) := vn_Gain;
        
          --If newly calculated gain is higher, then assign it as new Maximum gain
          IF vn_Gain > vn_MaxGain THEN
            vn_MaxGainIndex := i;
            vn_MaxGain      := vn_Gain;
            vs_GainedWhere3 := vs_GainedWhere2;
          END IF;
      
        END IF; --If column is not processed before(if processed ignore it)
    
      END IF;--If column is not class field(class field is not considered for decision tree)
  
    END LOOP;--for all columns(attributes)
  
    --If at least one column is processed "vn_MaxGainIndex" will be differnt than "-1"
    IF vn_MaxGainIndex != -1 THEN
  
      --Is it a numeric or categoric column?Both of two will be processed differently
      vb_IsNumericColumn := IsNumneriColumn(pis_TableName, vt_ColumnNames(vn_MaxGainIndex));
    
      --If it is a categoric attribute(column)
      IF vb_IsNumericColumn = FALSE THEN
     
        IF pib_IsDebugMode = TRUE THEN
          dbms_output.put_line(lpad(' ', pin_Level * 1) || 'INFO>MAX Gain for Categorical Attribute "' || vt_ColumnNames(vn_MaxGainIndex) || '" is MAX');
        END IF;
      
        --Get distinct values of column that has maimum gain value.
        --These values will be used in order to find out which value is most suitable for classification
        vt_DistinctColumnValues := GetDistinctColumnValues(pis_TableName, vt_ColumnNames(vn_MaxGainIndex),  pis_WhereStatement);
    
        -- for all distinct values for column that has maimum gain value.
        FOR i IN 1 .. vt_DistinctColumnValues.LAST LOOP
      
          --Use template below to calculate count(number of instances or rows) for distinct value.
          vs_SqlClassCount := '
             SELECT COUNT(*)
              FROM (SELECT DISTINCT <>
                      FROM <>
                     WHERE <> AND <>)';
                               
          vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<>', pis_TableName);
          vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<>', pis_ClassFieldName);
          vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<>', nvl(pis_WhereStatement, ' 1=1 '));
          vs_GainedWhere   := vt_ColumnNames(vn_MaxGainIndex) || '=''' || vt_DistinctColumnValues(i) || '''';
          vs_SqlClassCount := REPLACE(vs_SqlClassCount,'<>', vs_GainedWhere);       
      
          EXECUTE IMMEDIATE vs_SqlClassCount INTO vn_Count;
        
      
          --If count is 1 then classification has finished over this value of attribute
          IF vn_Count = 1 THEN
        
            --Use template below to find out instance(row) value for classification has done.
            vs_SqlClassValue := '
              SELECT DISTINCT <>
                          FROM <>
                         WHERE <> AND <> ';
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', pis_TableName);
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', pis_ClassFieldName);
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', nvl(pis_WhereStatement, ' 1=1 ')); 
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', vs_GainedWhere);
          
            EXECUTE IMMEDIATE vs_SqlClassValue INTO vs_ClassValue;   
            
            --Print out information
            dbms_output.put_line(lpad(' ', pin_Level * 2, ' ') ||  vs_GainedWhere || ' =>>> ' || vs_ClassValue);
       
          --If count is more than 1 then classification has not finished over this value of attribute.
          --This value of attribute can not distingusih instances.So make a recursive call in order to
          -- find again a decision tree with new dataset(that is specified with where condition)
          ELSIF vn_Count > 1 THEN
            dbms_output.put_line(lpad(' ', pin_Level * 2, ' ') || vs_GainedWhere);
            GenerateDecisionTree(pis_TableName,
                                 pis_ClassFieldName,
                                 nvl(pis_WhereStatement, '1 = 1') || ' AND ' || vs_GainedWhere,
                                 pin_Level + 1,
                                 pis_PassedAttributes || vt_ColumnNames(vn_MaxGainIndex) || ',',
                                 pib_IsDebugMode);
          END IF;
      
        END LOOP;--for all distinct values for column that has maimum gain value.
      
      --If it is a numeric attribute(column)
      ELSE
      
        --Only 2 iteration is done.There is only one value but it can be either "<=" or ">"
        FOR m IN 1 .. 2 LOOP
        
          --Use template below to calculate count(number of instances or rows) for numeric value.
          vs_SqlClassCount := '
                  SELECT COUNT(*)
                    FROM (SELECT DISTINCT <>
                            FROM <>
                           WHERE <> AND <>)';
                     
          IF m = 1 THEN
            vs_GainedWhere := vt_ColumnNames(vn_MaxGainIndex) || '<=' || vs_GainedWhere3;
          ELSE
            vs_GainedWhere := vt_ColumnNames(vn_MaxGainIndex) || '>' ||  vs_GainedWhere3;
          END IF;
        
          vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<>', vs_GainedWhere);        
          vs_SqlClassCount := REPLACE(vs_SqlClassCount,  '<>', pis_TableName);
          vs_SqlClassCount := REPLACE(vs_SqlClassCount,  '<>', pis_ClassFieldName);
          vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<>', nvl(pis_WhereStatement, ' 1=1 '));        
        
      
          EXECUTE IMMEDIATE vs_SqlClassCount  INTO vn_Count;
      
          --If count is 1 then classification has finished over this value of numeric attribute
          IF vn_Count = 1 THEN
        
            --Use template below to find out instance(row) value for classification has done.
            vs_SqlClassValue := '
              SELECT DISTINCT <>
                          FROM <>
                         WHERE <> AND <> ';
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', pis_TableName);
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', pis_ClassFieldName);
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', pis_WhereStatement);
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', vs_GainedWhere);
          
            EXECUTE IMMEDIATE vs_SqlClassValue INTO vs_ClassValue;
          
            dbms_output.put_line(lpad(' ', pin_Level * 2, ' ') || vs_GainedWhere || ' =>>> ' || vs_ClassValue);
          
          --If count is more than 1 then classification has not finished over this value of attribute.
          --This value of attribute can not distingusih instances.So make a recursive call in order to
          -- find again a decision tree with new dataset(that is specified with where condition)
          ELSIF vn_Count > 1 THEN
            dbms_output.put_line(lpad(' ', pin_Level * 2, ' ') || vs_GainedWhere);
            GenerateDecisionTree(pis_TableName,
                                 pis_ClassFieldName,
                                  nvl(pis_WhereStatement, '1 = 1') || ' AND ' || vs_GainedWhere,
                                 pin_Level + 1,
                                 pis_PassedAttributes || vt_ColumnNames(vn_MaxGainIndex) || ',',
                                 pib_IsDebugMode);
          END IF;
      
        END LOOP;--Only 2 iteration is done.There is only one value but it can be either "<=" or ">"
      END IF;
    END IF;--If at least one column is processed "vn_MaxGainIndex" will be differnt than "-1"
 
  END GenerateDecisionTree;

No comments: