Decision Tree Algorithm in pl/sql

articles: 

Decision Tree Algorithm in pl/sql
Author JP Vijaykumar
Date Sept 14th 2019

This article/script is provided for educational purpose ONLY.
The readers are advised to make necessary changes to the script as may be required for their use in
their respective environments. Pls test this script thoroughly in a lower environment, ONLY if you
are satisfied with the tests' results, then ONLY use it in any higher environments. It is advisable
to monitor the load on the db/server, while you are testing this script in your environment.This
script was generated/tested in Oracle 12C environment. This script may not work in other versions
of Oracle, pls modify the script as may be needed using your own discretion/ingenuity.
I had the following data, and want to predict output using decision tree algorithm.
I am using pl/sql scripting to process the data. This can also be processed using oracle data miner.

select * from xtern_decision_jp;

OUTLOOK TEMPERATUR HUMIDITY WIND PLAY_BALL
---------- ---------- ---------- ---------- ----------
Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
Rain Mild High Weak Yes
Rain Cool Normal Weak Yes
Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes
Sunny Mild High Weak No
Sunny Cool Normal Weak Yes
Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rain Mild High Strong No

I can generate the rules with the attribute conditions from the above table using decision tree algorithm
in python. I executed python script from gitbash console.

python /c/jpscripts/datamining/python/dtreeentropy.py
Rule: OUTLOOK -> Overcast -> PLAYBALL -> Yes -> 4
Rule: OUTLOOK -> Rain -> WIND -> Strong -> PLAYBALL -> No -> 2
Rule: OUTLOOK -> Rain -> WIND -> Weak -> PLAYBALL -> Yes -> 3
Rule: OUTLOOK -> Sunny -> HUMIDITY -> High -> PLAYBALL -> No -> 3
Rule: OUTLOOK -> Sunny -> HUMIDITY -> Normal -> PLAYBALL -> Yes -> 2

Can I produce a similar output, on the attribute conditions from the above table using decision tree
algorithm in pl/sql? Tough task! Strain the brain.

Implementing the complex entropy calculations in pl/sql are challenging. Python is rich with advanced
features / libraries, and comparable features/libraries are not available in pl/sql. It took me quite
long time, working off and on, to formulate the logic and implement the Decision tree algorithm in pl/sql.
I used the below mentioned urls, which were of great help, in understanding the complex algorithm and code it.
There are few other ways in pl/sql to solve this problem. I used entropy method in my script. Besides
scripting, I love literature, this is one of my favourite idioms “leaving no stone unturned”.

Scripting is fun.

Setting up the environment.

export ORACLE_SID=orcl
export ORACLE_HOME=c:/app/oracle/product/12.2.0/dbhome_1
sqlplus
/ as sysdba

alter session set container=orclpdb;
alter database orclpdb open;

In any db, I prefer not to create objects in SYS/SYSTEM schemas.

Created a user and granted required privileges to create an external table and execute my procedure.

connect veeksha/saketh

create a directory to access the csv file.

create or replace directory dp_dir as 'c:/jpscripts';

create an external table.

drop table xtern_decision_jp;

create table xtern_decision_jp (
outlook varchar2(10),
temperature varchar2(10),
humidity varchar2(10),
wind varchar2(10),
play_ball varchar2(10))
organization external
( default directory dp_dir
access parameters
( records delimited by newline
fields terminated by '|'
)
location ('decisiontree_data.txt')
);

During the script’s execution, if table or view does not exist error or the following error keeps occurring,
then create temp_vw_jp as shown in the following sql statement.
ORA-06502: PL/SQL: numeric or value error: Bulk Bind: Truncated Bind
create or replace view temp_vw_jp as select outlook cl1,humidity cl2,play_ball cl3 from xtern_decision_jp;
Run the script from sqlplus prompt as shown below (pls note the script’s location may change as per your
environment/file location).

Entropy .9403
Root Node OUTLOOK .2468
OUTLOOK -> Overcast(4) -> PLAY_BALL -> Yes(4)
Rain .971
Sunny .971
OUTLOOK -> Sunny HUMIDITY -> Normal(2) PLAY_BALL -> Yes (2)
OUTLOOK -> Sunny HUMIDITY -> High(3) PLAY_BALL -> No (3)
Rain .971
Rain .971
OUTLOOK -> Rain WIND -> Strong(2) PLAY_BALL -> No (2)
OUTLOOK -> Rain WIND -> Weak(3) PLAY_BALL -> Yes (3)

PL/SQL procedure successfully completed.

Formatting the output between python script and pl/sql script is a bit different. But the decision tree is
implemented on the given data using pl/sql.

Pls note: I tested this script on a small dataset in 12c db. On a larger dataset, the script may take longer
time and generate heavy system overhead. Readers’ discretion is advised.

Pls note: This script may not work on datasets with more attributes and needs modifications.
All I was interested is to see, how can I predict outcome from a given dataset using pl/sql.

Pls note: I highlighted only comments/notes/directions, leaving commands/scripts to differentiate.
This comment is intended for pdf file only

Happy scripting.

References:
References:
https://www.codeproject.com/Articles/259241/%2FArticles%2F259241%2FID3-Decision-Tree-Algorithm-Part-1 --ENTROPY EXPLAINED
Decision tree Naïve bayes www1.se.cuhk.edu.hk/~seem4630/tuto/Tutorial02.pptx --ENTROPY CALCULATIONS
https://www.youtube.com/watch?v=Zze7SKuz9QQ&t=185s --GINI INDEX CALCULATIONS
http://www.morganslibrary.org/hci/hci015.html
https://www.scribd.com/document/426303825/Decision-Tree-Algorithm-in-Pl-Sql --COMPLETE ARTICLE WITH SCREEN SHOTS IN PDF FORMAT

SCRIPT
set serverout on size 1000000 timing on
declare
v_table varchar2(30):='XTERN_DECISION_JP';
v_classifier varchar2(30):='PLAY_BALL';
--v_table varchar2(30):='DECISION_LP_JP';
--v_classifier varchar2(30):='R';
v_view varchar2(30):='TEMP_VW_JP';
v_numrows number(10,4);
v_entropy number(10,4);
v_centropy number(10,4);
v_gain number(10,4):=0;
v_gainm number(10,4):=0;
v_rootnode varchar2(30):='';
v_rootvalue varchar2(30):='';
v_createview varchar2(4000):='create or replace view temp_vw_jp as select ';
v_predicate varchar2(4000);
begin
execute immediate 'select count(*) from '||v_table||' ' into v_numrows;
execute immediate
'select round((-1)*(sum((val)*(log(2,(val))))),4) '||
' from ( '||
' select '||v_classifier||',count(*)/'||v_numrows||' val '||
'from '||v_table||' group by '||v_classifier||')' into v_entropy;
dbms_output.put_line('Entropy '||v_entropy);
for c1 in (select column_name col from all_tab_columns where 1=1
and table_name=v_table and column_name not in (
SELECT regexp_substr(v_classifier ,'[^,]+', 1, level)
FROM dual CONNECT BY regexp_substr(v_classifier ,'[^,]+', 1, level) IS NOT NULL
) order by column_id) loop
execute immediate 'with t1 as (select '||c1.col||' cl1 ,count(*) val from '||v_table||
' group by '||c1.col||'), t2 as (select '||c1.col||' cl1,'||v_classifier||
' cl2,count(*) val from '||v_table||' group by '||c1.cOl||','||v_classifier||
'), t3 as (select t1.cl1,t1.val val1 ,t2.cl2,t2.val val2 from t1,t2 where t1.cl1 = t2.cl1 ),
t4 as (select cl1,(t3.val1/'||v_numrows||')*(sum((-1)*(val2/val1)*(log(2,(val2/val1))))) sum_val
from t3 group by cl1,t3.val1,'||v_numrows||')
select '||v_entropy||' - round(sum(sum_val),4) from t4' into v_gain;
if( v_gain > v_gainm) then
v_gainm:=v_gain;
v_rootnode:=c1.col;
end if;
end loop;
dbms_output.put_line('Root Node '||v_rootnode||' '||v_gainm);
execute immediate v_createview||v_rootnode||' as cl1, '||v_classifier||' as cl2 from '||v_table;
for c2 in ( select col.cl1,col.val total_count,clsfr.cl2,clsfr.val classifier_count from
(select cl1,count(*) val from temp_vw_jp group by cl1) col,
(select cl1,cl2,count(*) val from temp_vw_jp group by cl1,cl2) clsfr
where col.cl1 = clsfr.cl1 ) loop
if (c2.total_count = c2.classifier_count) then
dbms_output.put_line(v_rootnode||' -> '||c2.cl1||'('||c2.total_count||') -> '||v_classifier||' -> '||
c2.cl2||'('||c2.classifier_count||')');
v_rootvalue:=c2.cl1;
v_predicate:=' and '||v_rootnode||' not in ('''||v_rootvalue||''')';
else
null;
end if;
end loop;
for c3 in (select column_name from dba_tab_columns where table_name=v_table
and column_name not in (v_rootnode,v_classifier) order by 1) loop
v_centropy:=0;
execute immediate v_createview||v_rootnode||' cl1,'||c3.column_name||' cl2,'||v_classifier||
' cl3 from '||v_table||' where 1=1 '||v_predicate;
v_gain:=0;
for c31 in (select distinct cl1 from temp_vw_jp order by 1) loop
execute immediate 'with t1 as (select count(*) val from temp_vw_jp where cl1='''||c31.cl1||'''),
t2 as (select cl3,count(*) val from temp_vw_jp where cl1='''||c31.cl1||''' group by cl3)
select sum((t2.val/t1.val)*((-1)*(log(2,(t2.val/t1.val))))) from t1,t2' into v_centropy;
dbms_output.put_line(c31.cl1||' '||v_centropy);
for c32 in (
with t1 as (select count(*) val from temp_vw_jp where cl1=c31.cl1),
t2 as (select cl1,cl2,count(*) val from temp_vw_jp where cl1=c31.cl1 group by cl1,cl2),
t3 as (select cl1,cl2,cl3,count(*) val from temp_vw_jp where cl1=c31.cl1 group by cl1,cl2,cl3),
t4 as (select sum((-1)*(t3.val/t1.val)*(log(2,(t3.val/t2.val)))) val from t1,t2,t3 where t2.cl1=t3.cl1
and t2.cl2=t3.cl2) select t2.cl1,t2.cl2,t3.cl3,t2.val val2,t3.val val3,t4.val val4
from t2,t3,t4 where t2.cl1=t3.cl1 and t2.cl2=t3.cl2) loop
if (c32.val2 = c32.val3) then
dbms_output.put_line(v_rootnode||' -> '||c32.cl1||' '||c3.column_name||
' -> '||c32.cl2||'('||c32.val2||') '||v_classifier||' -> '||c32.cl3||' ('||c32.val3||')');
v_predicate:=v_predicate||' and '||v_rootnode||' not in ('''||c32.cl1||''')';
execute immediate v_createview||v_rootnode||' cl1,'||c3.column_name||' cl2,'||v_classifier||
' cl3 from '||v_table||' where 1=1 '||v_predicate;
end if;
end loop;
end loop;
end loop;
end;
/