Oracle Experience: April 2007

19 April 2007

Data Minig :: Generating Decision Trees On Oracle

As wikipedia says;

In data mining and machine
learning, a decision tree is a predictive model; that is, a mapping
from observations about an item to conclusions about its target value. More
descriptive names for such tree models are classification tree or reduction
tree. In these tree structures, leaves represent classifications and
branches represent conjunctions of features that lead to those classifications
[1]. The machine learning technique for inducing a decision tree from data is
called decision tree learning, or (colloquially) decision
trees.

Decision trees is widely used in data mining and/or machine
learning applications in order predict or classify samples. The prediction or
classification is done by the help of decison trees.

Suppose data set below is given:

snow	weather	season	physical condition	go skiing
sticky	foggy	low	rested	no
fresh	sunny	low	injured	no
fresh	sunny	low	rested	yes
fresh	sunny	high	rested	yes
fresh	sunny	mid	rested	yes
frosted	windy	high	tired	no
sticky	sunny	low	rested	yes
frosted	foggy	mid	rested	no
fresh	windy	low	rested	yes
fresh	windy	low	rested	yes
fresh	foggy	low	rested	yes
fresh	foggy	low	rested	yes
sticky	sunny	mid	rested	yes
frosted	foggy	low	injured	no

To be able to generate a decision tree, we can use some
applications. The way that i implemented, is an Oracle based approach, PL/SQL
coded application. You can analyze the code at the end of this article. For
this application, table data must be loaded into Oracle:

drop table t;
create table t(
snow               VARCHAR2(8),
weather               VARCHAR2(8),
season             VARCHAR2(8),
physical_condition VARCHAR2(8),
go_skiing          VARCHAR2(3)
);
insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('sticky', 'foggy', 'low', 'rested', 'no');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'sunny', 'low', 'injured', 'no');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'sunny', 'low', 'rested', 'yes');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'sunny', 'high', 'rested', 'yes');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'sunny', 'mid', 'rested', 'yes');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('frosted', 'windy', 'high', 'tired', 'no');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('sticky', 'sunny', 'low', 'rested', 'yes');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('frosted', 'foggy', 'mid', 'rested', 'no');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'windy', 'low', 'rested', 'yes');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'windy', 'low', 'rested', 'yes');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'foggy', 'low', 'rested', 'yes');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('fresh', 'foggy', 'low', 'rested', 'yes');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('sticky', 'sunny', 'mid', 'rested', 'yes');

insert into HR.T (SNOW, WEATHER, SEASON, PHYSICAL_CONDITION, GO_SKIING)
values ('frosted', 'foggy', 'low', 'injured', 'no');

commit;

Firstly, gain is calculated for each attributes. As you see, all atributes are categorical. So we can calculate gain with some statistical and matematical calculations. With Oracle's analytical functions, it will be more easy to make calculations.(calculations can be investigated within code )

The call spec for this application is shown below:

BEGIN
dm.GenerateDecisionTree('t', 'go_skiing',NULL,0,',' );
END;

Print-out of this application is:

INFO>Gain for Categorical Attribute "SNOW" is 5,8331
INFO>Gain for Categorical Attribute "WEATHER" is 5,5184
INFO>Gain for Categorical Attribute "SEASON" is 5,4819
INFO>Gain for Categorical Attribute "PHYSICAL_CONDITION" is 5,803
INFO>MAX Gain for Categorical Attribute "SNOW" is MAX
SNOW='fresh'
  INFO>Gain for Categorical Attribute "WEATHER" is 1,1493
  INFO>Gain for Categorical Attribute "SEASON" is 1,0674
  INFO>Gain for Categorical Attribute "PHYSICAL_CONDITION" is 1,5549
  INFO>MAX Gain for Categorical Attribute "PHYSICAL_CONDITION" is MAX
   PHYSICAL_CONDITION='injured'.
    PHYSICAL_CONDITION='rested'.
SNOW='frosted'.
SNOW='sticky'
  INFO>Gain for Categorical Attribute "WEATHER" is 1,3082
  INFO>Gain for Categorical Attribute "SEASON" is ,9749
  INFO>Gain for Categorical Attribute "PHYSICAL_CONDITION" is ,3899
  INFO>MAX Gain for Categorical Attribute "WEATHER" is MAX
   WEATHER='foggy'.
    WEATHER='sunny'.

Summarization of data set above is:

SNOW='fresh'
    PHYSICAL_CONDITION='injured'.
    PHYSICAL_CONDITION='rested'.
SNOW='frosted'.
SNOW='sticky'
    WEATHER='foggy'.
    WEATHER='sunny'.

With this decision tree, you can predict If SNOW is sticky then go_skiing would be NO.

Suppose tuple below will be added to data set.:

fresh

sunny

mid

rested

Let's analyze what will happen

SNOW='fresh'
PHYSICAL_CONDITION='injured'.
PHYSICAL_CONDITION='rested'
    SEASON='high'.
    SEASON='low'.
    SEASON='mid'
      WEATHER='sunny'
SNOW='frosted'.
SNOW='sticky'
WEATHER='foggy'.
WEATHER='sunny'.

If your data set contains continous(numeric ) attributes, some additional calculations have to be done. Suppose data set below that contains numerical attributes:

A	Class
15	C₁
20	C₂
25	C₁
30	C₁
35	C₂
25	C₁
15	C₂
20	C₂

If you generate decision tree of data set above

INFO>Gain for Numerical Attribute "A<=15" is 1,5
INFO>Gain for Numerical Attribute "A<=20" is 2,5661
INFO>Gain for Numerical Attribute "A<=25" is 1,5
INFO>Gain for Numerical Attribute "A<=30" is 1,01
INFO>Gain for Numerical Attribute "A<=35" is 0

A<=20
A>20

So, 20 can be considered as a treshold value for sample above.

Suppose data set above that contains both categorical and continous samples:

A	B	C	Class
15	1	A	C₁
20	3	B	C₂
25	2	A	C₁
30	4	A	C₁
35	2	B	C₂
25	4	A	C₁
15	2	B	C₂
20	3	B	C₂

Generated decision tree will be

INFO>Gain for Numerical Attribute "A<=15" is 1,5
INFO>Gain for Numerical Attribute "A<=20" is 2,5661
INFO>Gain for Numerical Attribute "A<=25" is 1,5
INFO>Gain for Numerical Attribute "A<=30" is 1,01
INFO>Gain for Numerical Attribute "A<=35" is 0
INFO>Gain for Numerical Attribute "B<=1" is 1,01
INFO>Gain for Numerical Attribute "B<=2" is 2
INFO>Gain for Numerical Attribute "B<=3" is 2,0375
INFO>Gain for Numerical Attribute "B<=4" is 0
INFO>Gain for Categorical Attribute "C" is 4
INFO>MAX Gain for Categorical Attribute "C" is MAX

C='A'.
C='B'.

So, C attribute can be a dominat attribute.

Let's make a prediction with data set above. Suppose test sample is below:

A	B	C	D	Prediction	Truth
10	2	A	C₂	C₁	N
20	1	B	C₁	C₂	N
30	3	A	C₂	C₁	N
40	2	B	C₂	C₂	Y
15	1	B	C₁	C₂	N

As it shown below only %20 of samples classified correctly. If we have more sample, it will be possible to make more accurately prediction.

Application code is below(Updated at 29-Apr-07):

PROCEDURE GenerateDecisionTree(pis_TableName        IN VARCHAR2,
                                 pis_ClassFieldName   IN VARCHAR2,
                                 pis_WhereStatement   IN VARCHAR2,
                                 pin_Level            IN NUMBER DEFAULT 1,
                                 pis_PassedAttributes IN VARCHAR2 DEFAULT ',',
                                 pib_IsDebugMode      IN BOOLEAN DEFAULT FALSE)
/**************************************************************************************
    * Author     : Mennan Tekbir
    * Date       : 06-April-2007
    * Location   :
    * Notes      :
    * -------------------------------------------------------------------------------------
    * Purpose    :
    * Parameters :
    * Return     :
    * Exceptions :
    * -------------------------------------------------------------------------------------
    * History    :
     | Author        | Date                 | Purpose
     |-------        |-----------           |----------------------------------------------
     | MTE           | 06-Apr-2007          | Funciton creation.
     | MTE           | 29-Apr-2007          | Revised.Some comments added.
    **************************************************************************************/
   IS

    vs_SqlInfoClass            VARCHAR2(1024);
    vs_SqlInfoAttribute        VARCHAR2(1024);
    vs_SqlClassCount           VARCHAR2(1024);
    vs_SqlClassValue           VARCHAR2(1024);
    vn_InfoClass               NUMBER;
    vn_InfoClassInn            NUMBER;
    vt_ColumnNames             t_StringTab;
    vt_DistinctColumnValues    t_StringTab;
    vt_DistinctColumnValuesInn t_StringTab;
    vn_AttributeCount          NUMBER;
    vn_InfoAttribute           NUMBER;
    vn_InfoAttributeInn        NUMBER;
    vn_Gain                    NUMBER;
    vt_GainList                t_NumberTab;
    vn_MaxGain                 NUMBER;
    vn_MaxGainIndex            NUMBER;
    vn_GainInn                 NUMBER;
    vt_GainListInn             t_NumberTab;
    vn_MaxGainInn              NUMBER;
    vn_MaxGainIndexInn         NUMBER;
    vn_Count                   NUMBER;
    vs_GainedWhere             VARCHAR2(64);
    vs_GainedWhere2            VARCHAR2(64);
    vs_GainedWhere3            VARCHAR2(64);
    vb_IsNumericColumn         BOOLEAN;
    vs_ClassValue              VARCHAR2(4000);

BEGIN

    vt_ColumnNames    := GetColumnNames(pis_TableName);
    vn_AttributeCount := vt_ColumnNames.COUNT;
    vt_GainList       := t_NumberTab();
    vt_GainList.EXTEND(vn_AttributeCount);

    vs_SqlInfoClass := '
     SELECT SUM(-1 * p * log(2, p)) i
      FROM (SELECT COUNT(*) over(PARTITION BY <> ORDER BY NULL) / COUNT(*) over(ORDER BY NULL) p
              FROM <> WHERE <>)';

    vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<>', pis_TableName);
    vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<>', pis_ClassFieldName);
    vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<>', nvl(pis_WhereStatement, ' 1=1 '));

    EXECUTE IMMEDIATE vs_SqlInfoClass INTO vn_InfoClass;

    vn_MaxGain         := -1;
    vn_MaxGainIndex    := -1;
    vb_IsNumericColumn := FALSE;

    -- for all columns(attributes)
    FOR i IN 1 .. vn_AttributeCount LOOP

      --If column is not class field(class field is not considered for decision tree)
      IF upper(pis_ClassFieldName) != vt_ColumnNames(i) THEN

        --If column is not processed before(if processed ignore it)
        IF instr(pis_PassedAttributes, ',' || vt_ColumnNames(i) || ',') = 0 THEN

          --Take out whether column is numerical(nominal) or categoric.(Categoric and numeric columns are calculated differently)
          vb_IsNumericColumn := IsNumneriColumn(pis_TableName, vt_ColumnNames(i));

          --If column is a categoric such as consist of VARCHAR data type
          IF vb_IsNumericColumn = FALSE THEN

            --Calculate attribute gain(this a generic calculation with "EXECUTE IMMEDIATE" and "REPLACE"s)
            --Use template of SQL statement
            vs_SqlInfoAttribute := '
               SELECT SUM((b / d) * ((a / b) * log(2, (a / b)) * -1)) i
                FROM (SELECT DISTINCT COUNT(*) over(PARTITION BY <>, <> ORDER BY NULL) a
                                     ,COUNT(*) over(PARTITION BY <> ORDER BY NULL) b
                                     ,COUNT(*) over(ORDER BY NULL) d
                       FROM <> WHERE <>)';

            vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',pis_TableName);
            vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',pis_ClassFieldName);
            vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',vt_ColumnNames(i));
            vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',nvl(pis_WhereStatement, ' 1=1 '));

            EXECUTE IMMEDIATE vs_SqlInfoAttribute INTO vn_InfoAttribute;

            --Extract attribute gain from class gain that is calculated before
            vn_Gain := vn_InfoClass - vn_InfoAttribute;

            --If this debug mode, print out values.
            IF pib_IsDebugMode = TRUE THEN
              dbms_output.put_line(lpad(' ', pin_Level * 1) || 'INFO>Gain for Categorical Attribute "' || vt_ColumnNames(i) || '" is ' || trunc(vn_Gain, 4));
            END IF;

          --If column is a numeric such as consist of NUMBER data type
          ELSE
            --Use template below to calculate gain for numeric attribute
            vs_SqlInfoClass := '
             SELECT SUM(-1 * p * log(2, p)) i
              FROM (SELECT COUNT(*) over(PARTITION BY <> ORDER BY NULL) / COUNT(*) over(ORDER BY NULL) p
                      FROM <> WHERE <>)';

            vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<>', pis_TableName);
            vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<>', pis_ClassFieldName);
            vs_SqlInfoClass := REPLACE(vs_SqlInfoClass, '<>', nvl(pis_WhereStatement, ' 1=1 '));

            EXECUTE IMMEDIATE vs_SqlInfoClass INTO vn_InfoClassInn;

            --Get distinct values of numeric attribute.
            vt_DistinctColumnValuesInn := GetDistinctColumnValues(pis_TableName, vt_ColumnNames(i), pis_WhereStatement);

            --Assign initially some values
            vn_MaxGainInn      := -1;
            vn_MaxGainIndexInn := -1;

            --Create a in-memory table to hold every gain for distinct values
            vt_GainListInn     := t_NumberTab();
            vt_GainListInn.EXTEND(vt_DistinctColumnValuesInn.COUNT);

            -- for all distinct value for attr.
            FOR k IN 1 .. vt_DistinctColumnValuesInn.LAST LOOP

              --Use template below to calculate gain for numeric attribute
              --The template consist of a CASE statement that includes a "<=" and ">" operators
              vs_SqlInfoAttribute := '
                SELECT SUM(((a / d) * log(2, (a / b)) * -1)) s
                  FROM (SELECT COUNT(*) over(PARTITION BY p, <> ORDER BY NULL) a
                              ,COUNT(*) over(PARTITION BY p ORDER BY NULL) b
                              ,COUNT(*) over(ORDER BY NULL) d
                          FROM (SELECT <>
                               ,CASE WHEN <> <= <> THEN 1 ELSE 0 END p
                                  FROM <>
                         WHERE <>))';

              vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>', pis_TableName);
              vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',pis_ClassFieldName);
              vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',vt_ColumnNames(i));
              vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>', vt_DistinctColumnValuesInn(k));
              vs_SqlInfoAttribute := REPLACE(vs_SqlInfoAttribute, '<>',nvl(pis_WhereStatement, ' 1=1 '));

              EXECUTE IMMEDIATE vs_SqlInfoAttribute INTO vn_InfoAttributeInn;

              --Extract gain for a value of attribute from class gain that is calculated before
              vn_GainInn := vn_InfoClassInn - vn_InfoAttributeInn;
              --Assign information gain to table.
              vt_GainListInn(k) := vn_GainInn;
              --If debug mode, print out
              IF pib_IsDebugMode = TRUE THEN
                dbms_output.put_line(lpad(' ', pin_Level * 1) || 'INFO>Gain for Numerical Attribute "' || vt_ColumnNames(i) || '<=' || vt_DistinctColumnValuesInn(k) || '" is ' || trunc(vn_GainInn, 4));
              END IF;

              --If newly calculated gain is higher, then assign it as new Maximum gain
              IF vn_GainInn > vn_MaxGainInn THEN
                vn_MaxGainIndexInn := k;
                vn_MaxGainInn      := vn_GainInn;
                vn_Gain            := vn_GainInn;
                vs_GainedWhere2    := vt_DistinctColumnValuesInn(vn_MaxGainIndexInn);
              END IF;

            END LOOP;-- for all distinct value for attr.


            --Now, the max gain for numeric attribute is stored into "vn_Gain" and
            -- max value index for gain is stored into "vn_MaxGainIndexInn" and
            -- "vs_GainedWhere2" contains max gained value. This will use later


          END IF;--If column is a numeric such as consist of NUMBER data type

          --"vn_Gain" contains gain for attribute either attribute is categoric or numeric
          vt_GainList(i) := vn_Gain;

          --If newly calculated gain is higher, then assign it as new Maximum gain
          IF vn_Gain > vn_MaxGain THEN
            vn_MaxGainIndex := i;
            vn_MaxGain      := vn_Gain;
            vs_GainedWhere3 := vs_GainedWhere2;
          END IF;

        END IF; --If column is not processed before(if processed ignore it)

      END IF;--If column is not class field(class field is not considered for decision tree)

    END LOOP;--for all columns(attributes)

    --If at least one column is processed "vn_MaxGainIndex" will be differnt than "-1"
    IF vn_MaxGainIndex != -1 THEN

      --Is it a numeric or categoric column?Both of two will be processed differently
      vb_IsNumericColumn := IsNumneriColumn(pis_TableName, vt_ColumnNames(vn_MaxGainIndex));

      --If it is a categoric attribute(column)
      IF vb_IsNumericColumn = FALSE THEN

        IF pib_IsDebugMode = TRUE THEN
          dbms_output.put_line(lpad(' ', pin_Level * 1) || 'INFO>MAX Gain for Categorical Attribute "' || vt_ColumnNames(vn_MaxGainIndex) || '" is MAX');
        END IF;

        --Get distinct values of column that has maimum gain value.
        --These values will be used in order to find out which value is most suitable for classification
        vt_DistinctColumnValues := GetDistinctColumnValues(pis_TableName, vt_ColumnNames(vn_MaxGainIndex), pis_WhereStatement);

        -- for all distinct values for column that has maimum gain value.
        FOR i IN 1 .. vt_DistinctColumnValues.LAST LOOP

          --Use template below to calculate count(number of instances or rows) for distinct value.
          vs_SqlClassCount := '
             SELECT COUNT(*)
              FROM (SELECT DISTINCT <>
                      FROM <>
                     WHERE <> AND <>)';

          vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<>', pis_TableName);
          vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<>', pis_ClassFieldName);
          vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<>', nvl(pis_WhereStatement, ' 1=1 '));
          vs_GainedWhere   := vt_ColumnNames(vn_MaxGainIndex) || '=''' || vt_DistinctColumnValues(i) || '''';
          vs_SqlClassCount := REPLACE(vs_SqlClassCount,'<>', vs_GainedWhere);

          EXECUTE IMMEDIATE vs_SqlClassCount INTO vn_Count;


          --If count is 1 then classification has finished over this value of attribute
          IF vn_Count = 1 THEN

            --Use template below to find out instance(row) value for classification has done.
            vs_SqlClassValue := '
              SELECT DISTINCT <>
                          FROM <>
                         WHERE <> AND <> ';
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', pis_TableName);
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', pis_ClassFieldName);
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', nvl(pis_WhereStatement, ' 1=1 '));
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', vs_GainedWhere);

            EXECUTE IMMEDIATE vs_SqlClassValue INTO vs_ClassValue;

            --Print out information
            dbms_output.put_line(lpad(' ', pin_Level * 2, ' ') || vs_GainedWhere || ' =>>> ' || vs_ClassValue);

          --If count is more than 1 then classification has not finished over this value of attribute.
          --This value of attribute can not distingusih instances.So make a recursive call in order to
          -- find again a decision tree with new dataset(that is specified with where condition)
          ELSIF vn_Count > 1 THEN
            dbms_output.put_line(lpad(' ', pin_Level * 2, ' ') || vs_GainedWhere);
            GenerateDecisionTree(pis_TableName,
                                 pis_ClassFieldName,
                                 nvl(pis_WhereStatement, '1 = 1') || ' AND ' || vs_GainedWhere,
                                 pin_Level + 1,
                                 pis_PassedAttributes || vt_ColumnNames(vn_MaxGainIndex) || ',',
                                 pib_IsDebugMode);
          END IF;

        END LOOP;--for all distinct values for column that has maimum gain value.

      --If it is a numeric attribute(column)
      ELSE

        --Only 2 iteration is done.There is only one value but it can be either "<=" or ">"
        FOR m IN 1 .. 2 LOOP

          --Use template below to calculate count(number of instances or rows) for numeric value.
          vs_SqlClassCount := '
                  SELECT COUNT(*)
                    FROM (SELECT DISTINCT <>
                            FROM <>
                           WHERE <> AND <>)';

          IF m = 1 THEN
            vs_GainedWhere := vt_ColumnNames(vn_MaxGainIndex) || '<=' || vs_GainedWhere3;
          ELSE
            vs_GainedWhere := vt_ColumnNames(vn_MaxGainIndex) || '>' || vs_GainedWhere3;
          END IF;

          vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<>', vs_GainedWhere);
          vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<>', pis_TableName);
          vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<>', pis_ClassFieldName);
          vs_SqlClassCount := REPLACE(vs_SqlClassCount, '<>', nvl(pis_WhereStatement, ' 1=1 '));


          EXECUTE IMMEDIATE vs_SqlClassCount INTO vn_Count;

          --If count is 1 then classification has finished over this value of numeric attribute
          IF vn_Count = 1 THEN

            --Use template below to find out instance(row) value for classification has done.
            vs_SqlClassValue := '
              SELECT DISTINCT <>
                          FROM <>
                         WHERE <> AND <> ';
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', pis_TableName);
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', pis_ClassFieldName);
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', pis_WhereStatement);
            vs_SqlClassValue := REPLACE(vs_SqlClassValue, '<>', vs_GainedWhere);

            EXECUTE IMMEDIATE vs_SqlClassValue INTO vs_ClassValue;

            dbms_output.put_line(lpad(' ', pin_Level * 2, ' ') || vs_GainedWhere || ' =>>> ' || vs_ClassValue);

          --If count is more than 1 then classification has not finished over this value of attribute.
          --This value of attribute can not distingusih instances.So make a recursive call in order to
          -- find again a decision tree with new dataset(that is specified with where condition)
          ELSIF vn_Count > 1 THEN
            dbms_output.put_line(lpad(' ', pin_Level * 2, ' ') || vs_GainedWhere);
            GenerateDecisionTree(pis_TableName,
                                 pis_ClassFieldName,
                                  nvl(pis_WhereStatement, '1 = 1') || ' AND ' || vs_GainedWhere,
                                 pin_Level + 1,
                                 pis_PassedAttributes || vt_ColumnNames(vn_MaxGainIndex) || ',',
                                 pib_IsDebugMode);
          END IF;

        END LOOP;--Only 2 iteration is done.There is only one value but it can be either "<=" or ">"
      END IF;
    END IF;--If at least one column is processed "vn_MaxGainIndex" will be differnt than "-1"

END GenerateDecisionTree;

Establishing Refential Integrity With Materialized Views In Oracle

RI(Refential Integrity ) is established with FK(foreign key) and PK(primary key) in Oracle. If you do not have a primary key on a column you are not allowed to create a FK. If your table has a non-unique column, how can you create a FK? Example below shows how to establish RI with MVs:

SQL> create table a(i number);

Table created

SQL> insert into a values(1);

1 row inserted

SQL> insert into a values(1);

1 row inserted

SQL> create table b(k number);

Table created

SQL> alter table b add constraint fk_ba foreign key(k) references a(i);

alter table b add constraint fk_ba foreign key(k) references a(i)

ORA-02270: no matching unique or primary key for this column-list

SQL> create materialized view mv_a

2 parallel

3 build immediate as

4 select distinct i from a;

Materialized view created

SQL> alter materialized view mv_a add constraint pk_mv_a primary key(i);

Materialized view altered

SQL> alter table b add constraint fk_ba foreign key(k) references mv_a(i);

Table altered

SQL> insert into b values(1);

1 row inserted

SQL> insert into b values(2);

insert into b values(2)

ORA-02291: integrity constraint (HR.FK_BA) violated - parent key not found

SQL>

02 April 2007

Workaround for ORA-00283, ORA-01610, ORA-00279, ORA-00339 Errors When Duplicating Database

In my earlier posts i described how to clon an oracle database.Today when i clon database for development environment, i struggled some oracle errors. I want to share this experince by this entry:

After creatimg control files....

oracle@aur:/data/oradata> sqlplus / as sysdba

SQL*Plus: Release 10.1.0.5.0 - Production on Fri Dec 8 16:38:06 2006

Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to an idle instance.

SQL> startup nomount
ORACLE instance started.

Total System Global Area 293601280 bytes
Fixed Size                  1302824 bytes
Variable Size              94117592 bytes
Database Buffers          197132288 bytes
Redo Buffers                1048576 bytes
SQL> alter database recover database until cancel;
alter database recover database until cancel
*
ERROR at line 1:
ORA-01507: database not mounted

SQL> alter database mount;

Database altered.

SQL> alter database recover database until cancel;
alter database recover database until cancel
*
ERROR at line 1:
ORA-00283: recovery session canceled due to errors
ORA-01610: recovery using the BACKUP CONTROLFILE option must be done

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 9058471059996 generated at 12/08/2006 14:18:39 needed for
thread 1
ORA-00289: suggestion :
/data06/app/oracle/product/10.1/dbs/arch1_1_608653118.dbf
ORA-00280: change 9058471059996 for thread 1 is in sequence #1

Specify log: {=suggested | filename | AUTO | CANCEL}
/data/oradata/test/redo01.log
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/data/oradata/test/redo01.log'

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/data/oradata/test/system01.dbf'

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 9058471059996 generated at 12/08/2006 14:18:39 needed for
thread 1
ORA-00289: suggestion :
/data06/app/oracle/product/10.1/dbs/arch1_1_608653118.dbf
ORA-00280: change 9058471059996 for thread 1 is in sequence #1

Specify log: {=suggested | filename | AUTO | CANCEL}
/data/oradata/test/redo02.log
ORA-00339: archived log does not contain any redo
ORA-00334: archived log: '/data/oradata/test/redo02.log'

ORA-01547: warning: RECOVER succeeded but OPEN RESETLOGS would get error below
ORA-01194: file 1 needs more recovery to be consistent
ORA-01110: data file 1: '/data/oradata/test/system01.dbf'

SQL> recover database using backup controlfile until cancel;
ORA-00279: change 9058471059996 generated at 12/08/2006 14:18:39 needed for
thread 1
ORA-00289: suggestion :
/data06/app/oracle/product/10.1/dbs/arch1_1_608653118.dbf
ORA-00280: change 9058471059996 for thread 1 is in sequence #1

Specify log: {RET=suggested | filename | AUTO | CANCEL}
/data/oradata/test/redo03.log
Log applied.
Media recovery complete.
SQL> RECOVER DATABASE
ORA-00283: recovery session canceled due to errors
ORA-01610: recovery using the BACKUP CONTROLFILE option must be done

SQL> ALTER DATABASE OPEN RESETLOGS;

Database altered.

SQL> ALTER TABLESPACE TEMP ADD TEMPFILE '/data/oradata/test/temp01.dbf'
2       SIZE 1470M REUSE AUTOEXTEND ON NEXT 655360 MAXSIZE 32767M;

Tablespace altered.

SQL> select * from dual;

D
-
X

SQL>

Solution of ORA-06502

When you work on associative arrays in Oracle you must consider key lengths. When you create an associative arrays with 4 chars length, it means that you canuse key names up to 4 letters.If you convert keys expilicitly with to_char built-in function, you must trim to eliminate spaces; otherwise you are a potential person that get ORA-06502 error as i demonstrated below:

Connected to Oracle Database 10g Express Edition Release 10.2.0.1.0
Connected as hr

SQL> set serverout on 5000
SQL>
SQL> DECLARE
2    i NUMBER;
3    k VARCHAR2(3);
4    TYPE t_StringHashTable IS TABLE OF VARCHAR2(256) INDEX BY VARCHAR2(3);
5    vt_List t_StringHashTable;
6 BEGIN
7    i := 1;
8    k := to_char(i, '000');
9    vt_List(k) := 'Test item...';
10    dbms_output.put_line(vt_List(k));
11 END;
12 /

DECLARE
i NUMBER;
k VARCHAR2(3);
TYPE t_StringHashTable IS TABLE OF VARCHAR2(256) INDEX BY VARCHAR2(3);
vt_List t_StringHashTable;
BEGIN
i := 1;
k := to_char(i, '000');
vt_List(k) := 'Test item...';
dbms_output.put_line(vt_List(k));
END;

ORA-06502: PL/SQL: numeric or value error: character string buffer too small
ORA-06512: at line 8

SQL> DECLARE
2    i NUMBER;
3    k VARCHAR2(1000);
4    TYPE t_StringHashTable IS TABLE OF VARCHAR2(256) INDEX BY VARCHAR2(3);
5    vt_List t_StringHashTable;
6 BEGIN
7    i := 1;
8    k := to_char(i, '000');
9    dbms_output.put_line(length(k));
10    dbms_output.put_line('k is "' || k || '"');
11    vt_List(k) := 'Test item...';
12    dbms_output.put_line(vt_List(k));
13 END;
14 /

4
k is " 001"

DECLARE
i NUMBER;
k VARCHAR2(1000);
TYPE t_StringHashTable IS TABLE OF VARCHAR2(256) INDEX BY VARCHAR2(3);
vt_List t_StringHashTable;
BEGIN
i := 1;
k := to_char(i, '000');
dbms_output.put_line(length(k));
dbms_output.put_line('k is "' || k || '"');
vt_List(k) := 'Test item...';
dbms_output.put_line(vt_List(k));
END;

ORA-06502: PL/SQL: numeric or value error: associative array key violates its type constraints
ORA-06512: at line 11

SQL> DECLARE
2    i NUMBER;
3    k VARCHAR2(3);
4    TYPE t_StringHashTable IS TABLE OF VARCHAR2(256) INDEX BY VARCHAR2(3);
5    vt_List t_StringHashTable;
6 BEGIN
7    i := 1;
8    k := TRIM(to_char(i, '000'));
9    vt_List(k) := 'Test item...';
10    dbms_output.put_line(vt_List(k));
11 END;
12 /

Test item...

PL/SQL procedure successfully completed

SQL> DECLARE
2    i NUMBER;
3    TYPE t_StringHashTable IS TABLE OF VARCHAR2(256) INDEX BY VARCHAR2(3);
4    vt_List t_StringHashTable;
5 BEGIN
6    i := 1;
7    vt_List(i) := 'Test item...';
8    dbms_output.put_line(vt_List(i));
9 END;
10 /

Test item...

PL/SQL procedure successfully completed

SQL>

Pages

19 April 2007

Data Minig :: Generating Decision Trees On Oracle

Establishing Refential Integrity With Materialized Views In Oracle

02 April 2007

Workaround for ORA-00283, ORA-01610, ORA-00279, ORA-00339 Errors When Duplicating Database

Solution of ORA-06502