OraFAQ Forum: Text & interMedia » Stoplists and Nickname lists

Home » Server Options » Text & interMedia » Stoplists and Nickname lists (Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production)

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

Stoplists and Nickname lists [message #670903]

Wed, 01 August 2018 08:12

vka2b
Messages: 21
Registered: June 2018

Junior Member

Now that I've got a working NDATA name matching solution, I am trying to introduce some additional functionality into it, namely the use of stopword lists and nickname substitution lists. Despite performing the steps I think are necessary, these lists don't seem to be taken into account, so I'm guessing the same issue exists for both. I've pasted examples below that demonstrate the issue, using similar steps to the tests from my previous post, assuming Barbara is the one who'll be responding. Wink

I created a name in one table "John Smith" and "John almost Smith" under the assumption that 'almost' wouldn't be considered because it is part of the default English stoplist (according to the following link: https://docs.oracle.com/cd/B28359_01/text.111/b28304/astopsup.htm#CCREF1400 ). When that didn't occur, I tried to explicitly add it to my own stoplist, but experienced the same issue:


-- tables and data:

DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage values ('John Smith');

DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage values ('John almost Smith', 'P');

-- index on candidate_name_stage:
exec ctx_ddl.drop_preference     ('c_ds');
exec ctx_ddl.create_preference   ('c_ds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attribute       ('c_ds', 'COLUMNS', 'candidate_name name');
exec ctx_ddl.drop_section_group  ('c_secgrp');
exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
exec ctx_ddl.add_ndata_section   ('c_secgrp', 'name', 'name');

-- add stoplist
exec ctx_ddl.drop_stoplist('c_sl');
exec ctx_ddl.create_stoplist('c_sl', 'BASIC_STOPLIST');
exec ctx_ddl.add_stopword('c_sl', 'almost');

CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
  parameters ('datastore c_ds  section group c_secgrp stoplist c_sl');


-- query:
column candidate_name format a25 word_wrapped;
column target_name    format a25 word_wrapped;

select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from   candidate_name_stage, target_name_stage
where  contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order  by candidate_name, score(1) desc;

This results in "no rows selected" because the score isn't above 70 (I was expecting it to be 100 or close to 100 since without "almost" the names would match). For completeness, I'm also pasting how I tried to use the default stoplist, which also resulted in not the match I was expecting:

-- tables and data:

DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage values ('John Smith');

DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage values ('John almost Smith', 'P');

-- index on candidate_name_stage:
exec ctx_ddl.drop_preference     ('c_ds');
exec ctx_ddl.create_preference   ('c_ds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attribute       ('c_ds', 'COLUMNS', 'candidate_name name');
exec ctx_ddl.drop_section_group  ('c_secgrp');
exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
exec ctx_ddl.add_ndata_section   ('c_secgrp', 'name', 'name');

CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
  parameters ('datastore c_ds  section group c_secgrp stoplist CTXSYS.DEFAULT_STOPLIST');

-- query:
column candidate_name format a25 word_wrapped;
column target_name    format a25 word_wrapped;

select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from   candidate_name_stage, target_name_stage
where  contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order  by candidate_name, score(1) desc;

Finally, I will move this to a separate topic if it's not related, but I'm also not seeing my loading of nicknames being taken into account and I'm wondering if it's the same underlying issue. I'm trying to do something like:

-- tables and data:

DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage values ('Jonathan Smith');

DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage values ('John Smith', 'P');

-- index on candidate_name_stage:
exec ctx_ddl.drop_preference     ('c_ds');
exec ctx_ddl.create_preference   ('c_ds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attribute       ('c_ds', 'COLUMNS', 'candidate_name name');
exec ctx_ddl.drop_section_group  ('c_secgrp');
exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
exec ctx_ddl.add_ndata_section   ('c_secgrp', 'name', 'name');

-- Load thesaurus

HOST ctxload -user testuser/testuser -thes -name nicknames -file nicknames.txt

exec ctx_ddl.drop_preference('name_wl');
exec ctx_ddl.create_preference('name_wl', 'BASIC_WORDLIST');
exec ctx_ddl.set_attribute('name_wl', 'NDATA_ALTERNATE_SPELLING', 'FALSE');
exec ctx_ddl.set_attribute('name_wl', 'NDATA_BASE_LETTER', 'TRUE');
exec ctx_ddl.set_attribute('name_wl', 'NDATA_THESAURUS', 'nicknames');
exec ctx_ddl.set_attribute('name_wl', 'NDATA_JOIN_PARTICLES',
   'de:di:la:da:el:del:qi:abd:los:la:dos:do:an:li:yi:yu:van:jon:un:sai:ben:al');

CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
  parameters ('datastore c_ds  section group c_secgrp wordlist name_wl');

-- query:
column candidate_name format a25 word_wrapped;
column target_name    format a25 word_wrapped;

select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from   candidate_name_stage, target_name_stage
where  contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order  by candidate_name, score(1) desc;

My nicknames.txt thesaurus contains:

Jonathan
SYN John (Jonathan)

But, I get "no rows returned" because "John" and "Jonathan" are not being considered as a match despite the thesaurus entry.

Any ideas as to why this might be? Thank you in advance.

Report message to a moderator

Re: Stoplists and Nickname lists [message #670904 is a reply to message #670903]

Wed, 01 August 2018 11:41

vka2b
Messages: 21
Registered: June 2018

Junior Member

I just realized that the first test of mine has a bit of a flaw (I added the stopword to the target table, but I indexed the candidate table). That said, I still experience the same behavior when I do things the right way. However, I will double-check this once I'm back at my machine. The nickname test should be accurate as is.

Report message to a moderator

Re: Stoplists and Nickname lists [message #670905 is a reply to message #670904]

Wed, 01 August 2018 13:03

vka2b
Messages: 21
Registered: June 2018

Junior Member

I got the stoplist issue worked out, it was in fact my mistake, and I apologize for not catching it earlier. The issue with the nickname list still remains though, but I'm still actively trying to work though that myself as well unless anybody can easily see something wrong with what I'm doing.

Report message to a moderator

Re: Stoplists and Nickname lists [message #670907 is a reply to message #670904]

Wed, 01 August 2018 13:30

Barbara Boehmer
Messages: 9090
Registered: November 2002
Location: California, USA

Senior Member

I will address the stoplist issue first in this reply, then the other issue in a separate reply later.

I am still scratching my head a bit on the stoplist issue. I can demonstrate how to confirm that "almost" is in the default stoplist and "almost" is in your user-defined stoplist and that your stoplist with the word "almost" is part of your index. However, I can then also demonstrate that the word "almost" is tokenized and indexed. Then, I can demonstrate that searching for "John Smith" or "John almost Smith" or "John however Smith" will find all three with lower scores for some. I tested these because, under normal circumstances, there is a pecuiliarity with stopwords, such that even though it does not index it, it still expects a word, any word, in its place. This did not seem to hold true here as it did not consider "almost" and "however" equivalent. So, it appears that if a stopword is not in the search string, then it ignores any stopwords in the results and scores them the same. However, it also appears that if a stopword is in the search string, then it includes that stopword in the search and in the scoring.

Although I was unable to locate any section of the online documentation that mentions it, according to Oracle Text product manager Roger Ford, "NDATA fields do not use the stopword list." He stated this in his response in the following thread:

https://community.oracle.com/message/14582153#14582153

Please see my test below. I made several modifications to what you posted, in order to test and demonstrate some things.

-- check for almost in default stoplist:

SCOTT@orcl_12.1.0.2.0> select spw_word from ctx_stopwords where spw_stoplist = 'DEFAULT_STOPLIST' and spw_word like 'al%' order by spw_word
  2  /

SPW_WORD
--------------------------------------------------------------------------------
all
almost
also
although

4 rows selected.

-- add stoplist with almost and check for it:

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_stoplist('c_sl');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_stoplist('c_sl', 'BASIC_STOPLIST');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_stopword('c_sl', 'almost');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> select spw_word from ctx_user_stopwords where spw_stoplist = 'C_SL'
  2  /

SPW_WORD
--------------------------------------------------------------------------------
almost

1 row selected.

-- tables and data:

SCOTT@orcl_12.1.0.2.0> DROP TABLE candidate_name_stage;

Table dropped.

SCOTT@orcl_12.1.0.2.0> CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));

Table created.

SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('John Smith');

1 row created.

SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('John almost Smith');

1 row created.

SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('John however Smith');

1 row created.

SCOTT@orcl_12.1.0.2.0> DROP TABLE target_name_stage;

Table dropped.

SCOTT@orcl_12.1.0.2.0> CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));

Table created.

SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('John Smith', 'P');

1 row created.

SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('John almost Smith', 'P');

1 row created.

SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('John however Smith', 'P');

1 row created.

-- index on candidate_name_stage:

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference	('c_ds');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference	('c_ds', 'MULTI_COLUMN_DATASTORE');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute	('c_ds', 'COLUMNS', 'candidate_name name');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_section_group	('c_secgrp');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_ndata_section	('c_secgrp', 'name', 'name');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
  2    parameters ('datastore c_ds  section group c_secgrp stoplist c_sl');

Index created.

-- describe index and confirm that stoplist and stopword almost are included:

SCOTT@orcl_12.1.0.2.0> select ctx_report.describe_index ('CANDIDATE_NAME_IDX') from dual
  2  /

CTX_REPORT.DESCRIBE_INDEX('CANDIDATE_NAME_IDX')
--------------------------------------------------------------------------------
===========================================================================
                             INDEX DESCRIPTION
===========================================================================
index name:                    "SCOTT"."CANDIDATE_NAME_IDX"
index id:                      6398
index type:                    context

base table:                    "SCOTT"."CANDIDATE_NAME_STAGE"
primary key column:
text column:                   CANDIDATE_NAME
text column type:              VARCHAR2(350)
language column:
format column:
charset column:
configuration column:
Query Stats Enabled:           NO


status:                        INDEXED
full optimize token:
full optimize count:
docid count:                   3
nextid:                        4

===========================================================================
                               INDEX OBJECTS
===========================================================================
datastore:                     MULTI_COLUMN_DATASTORE
   columns:                       candidate_name name

filter:                        NULL_FILTER

section group:                 BASIC_SECTION_GROUP
   ndata section:                 NAME
      section tag:                   NAME

lexer:                         BASIC_LEXER

wordlist:                      BASIC_WORDLIST
   stemmer:                       ENGLISH
   fuzzy_match:                   GENERIC

stoplist:                      BASIC_STOPLIST
   stop_word:                     almost

storage:                       BASIC_STORAGE
   r_table_clause:                lob (data) store as (cache)
   i_index_clause:                compress 2



1 row selected.

-- what words are tokenized and indexed:

SCOTT@orcl_12.1.0.2.0> select token_text from dr$candidate_name_idx$i order by token_text
  2  /

TOKEN_TEXT
----------------------------------------------------------------
^alm
^alo
^amo
^hoe
^how
^hwe
^jhn
^joh
^jon
^lmo
^mit
^ohn
^owe
^sit
^smi
^smt
almo
alms
alos
amos
eer$
eve$
ever
evr$
hoev
howe
howv
hwev
ith$
jhn$
joh$
john
jon$
lmos
lmot
lmst
lost
mih$
mit$
mith
mos$
most
mot$
mst$
mth$
oeve
ohn$
ost$
owee
owev
owve
sith
smih
smit
smth
ver$
weer
weve
wevr
wver

60 rows selected.

-- query:

SCOTT@orcl_12.1.0.2.0> column searched_for   format a25 word_wrapped
SCOTT@orcl_12.1.0.2.0> column found	     format a25 word_wrapped
SCOTT@orcl_12.1.0.2.0> select distinct target_name searched_for, candidate_name found, type_cd, score(1) as fuzzy_match_score
  2  from   candidate_name_stage, target_name_stage
  3  where  contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 0
  4  order  by target_name, score(1) desc
  5  /

SEARCHED_FOR              FOUND                     T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
John Smith                John Smith                P               100
John Smith                John almost Smith         P               100
John Smith                John however Smith        P               100
John almost Smith         John almost Smith         P               100
John almost Smith         John Smith                P                65
John almost Smith         John however Smith        P                65
John however Smith        John however Smith        P               100
John however Smith        John Smith                P                62
John however Smith        John almost Smith         P                62

9 rows selected.

Report message to a moderator

Re: Stoplists and Nickname lists [message #670908 is a reply to message #670907]

Wed, 01 August 2018 14:02

vka2b
Messages: 21
Registered: June 2018

Junior Member

I'm thinking perhaps you posted your response before you saw my follow-ups indicating that I got the results I expected -- I apologize that you spent time on this if so (though I'm still interested in hearing your thoughts on the other issue). The following works:

-- tables and data:

DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage values ('John almost Smith');

DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage values ('John Smith', 'P');

-- index on candidate_name_stage:
exec ctx_ddl.drop_preference     ('c_ds');
exec ctx_ddl.create_preference   ('c_ds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attribute       ('c_ds', 'COLUMNS', 'candidate_name name');
exec ctx_ddl.drop_section_group  ('c_secgrp');
exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
exec ctx_ddl.add_ndata_section   ('c_secgrp', 'name', 'name');

-- add stoplist
exec ctx_ddl.drop_stoplist('c_sl');
exec ctx_ddl.create_stoplist('c_sl', 'BASIC_STOPLIST');
exec ctx_ddl.add_stopword('c_sl', 'almost');

CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
  parameters ('datastore c_ds  section group c_secgrp stoplist c_sl');


-- query:
column candidate_name format a25 word_wrapped;
column target_name    format a25 word_wrapped;

select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from   candidate_name_stage, target_name_stage
where  contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order  by candidate_name, score(1) desc;

Result:

CANDIDATE_NAME            TARGET_NAME               T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
John almost Smith         John Smith                P               100

Report message to a moderator

Re: Stoplists and Nickname lists [message #670909 is a reply to message #670908]

Wed, 01 August 2018 14:05

Barbara Boehmer
Messages: 9090
Registered: November 2002
Location: California, USA

Senior Member

Please post the result of the following:

select thp_phrase from ctx_thes_phrases where thp_thesaurus ='NICKNAMES';

Report message to a moderator

Re: Stoplists and Nickname lists [message #670910 is a reply to message #670908]

Wed, 01 August 2018 14:42

Barbara Boehmer
Messages: 9090
Registered: November 2002
Location: California, USA

Senior Member

Please compare what you did to what I did below. In the nicknames.txt the second line begins with a tab. I made sure that any pre-existing nicknames thesaurus was droppedd. I loaded the thesaurus as the user ctxsys and added it to the knowledge base using ctxkbtc. The query after including that in the index returns a higher score than the query without the thesaurus.

-- c:\my_oracle_files\nicknames.txt:

Jonathan
	SYN John (Jonathan)

-- tables and data:

SCOTT@orcl_12.1.0.2.0> DROP TABLE candidate_name_stage;

Table dropped.

SCOTT@orcl_12.1.0.2.0> CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));

Table created.

SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('Jonathan Smith');

1 row created.

SCOTT@orcl_12.1.0.2.0> DROP TABLE target_name_stage;

Table dropped.

SCOTT@orcl_12.1.0.2.0> CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));

Table created.

SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('John Smith', 'P');

1 row created.

-- index on candidate_name_stage without thesaurus:

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference     ('c_ds');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference   ('c_ds', 'MULTI_COLUMN_DATASTORE');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute       ('c_ds', 'COLUMNS', 'candidate_name name');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_section_group  ('c_secgrp');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_ndata_section   ('c_secgrp', 'name', 'name');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference('name_wl');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference('name_wl', 'BASIC_WORDLIST');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute('name_wl', 'NDATA_ALTERNATE_SPELLING', 'FALSE');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute('name_wl', 'NDATA_BASE_LETTER', 'TRUE');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute('name_wl', 'NDATA_JOIN_PARTICLES', 'de:di:la:da:el:del:qi:abd:los:la:dos:do:an:li:yi:yu:van:jon:un:sai:ben:al');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
  2    parameters ('datastore c_ds  section group c_secgrp wordlist name_wl');

Index created.

-- query without thesaurus:

SCOTT@orcl_12.1.0.2.0> column candidate_name format a25 word_wrapped;
SCOTT@orcl_12.1.0.2.0> column target_name    format a25 word_wrapped;
SCOTT@orcl_12.1.0.2.0> select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
  2  from   candidate_name_stage, target_name_stage
  3  where  contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
  4  order  by candidate_name, score(1) desc;

CANDIDATE_NAME            TARGET_NAME               T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
Jonathan Smith            John Smith                P                88

1 row selected.

-- Load thesaurus

SCOTT@orcl_12.1.0.2.0> connect ctxsys/ctxsys
Connected.
CTXSYS@orcl_12.1.0.2.0> exec ctx_thes.drop_thesaurus ('NICKNAMES')

PL/SQL procedure successfully completed.

CTXSYS@orcl_12.1.0.2.0> host ctxload -user ctxsys/ctxsys -thes -name nicknames -file C:\my_oracle_files\nicknames.txt
Connecting...
Creating thesaurus nicknames...
Thesaurus nicknames created...
Processing...
2 lines processed successfully
Beginning insert...2 lines inserted successfully
Disconnected

CTXSYS@orcl_12.1.0.2.0> host ctxkbtc -user ctxsys/ctxsys -name nicknames
Oracle Text knowledge base extension: thesaurus compiler
Connecting...
Deleting old extended knowledge base.
Deleting C:\app\baboehme\product\12.1.0\dbhome_2\ctx\data\enlx\drelsUS.dat
Deleting C:\app\baboehme\product\12.1.0\dbhome_2\ctx\data\enlx\dreldUS.dat
Deleting C:\app\baboehme\product\12.1.0\dbhome_2\ctx\data\enlx\dreliUS.dat
Deleting C:\app\baboehme\product\12.1.0\dbhome_2\ctx\data\enlx\drelkUS.dat
Deleting C:\app\baboehme\product\12.1.0\dbhome_2\ctx\data\enlx\drefdUS.dat
Deleting C:\app\baboehme\product\12.1.0\dbhome_2\ctx\data\enlx\drefiUS.dat
Processing thesaurus: NICKNAMES
Processed 2 terms.
Done processing thesaurus: NICKNAMES
Compiling and writing new flat files.
Writing extended knowledge base to files.
.............................................
.......................................................................................................................
Oracle Text knowledge base successfully extended.
Disconnected

CTXSYS@orcl_12.1.0.2.0> connect scott/tiger
Connected.
SCOTT@orcl_12.1.0.2.0> select thp_phrase from ctx_thes_phrases where thp_thesaurus ='NICKNAMES';

THP_PHRASE
--------------------------------------------------------------------------------
JOHN
JONATHAN

2 rows selected.

-- add thesaurus to index:

SCOTT@orcl_12.1.0.2.0> drop index candidate_name_idx
  2  /

Index dropped.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute('name_wl', 'NDATA_THESAURUS', 'NICKNAMES');

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
  2    parameters ('datastore c_ds  section group c_secgrp wordlist name_wl');

Index created.

-- query with thesaurus:

SCOTT@orcl_12.1.0.2.0> column candidate_name format a25 word_wrapped;
SCOTT@orcl_12.1.0.2.0> column target_name    format a25 word_wrapped;
SCOTT@orcl_12.1.0.2.0> select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
  2  from   candidate_name_stage, target_name_stage
  3  where  contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
  4  order  by candidate_name, score(1) desc;

CANDIDATE_NAME            TARGET_NAME               T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
Jonathan Smith            John Smith                P                97

1 row selected.

Report message to a moderator

Re: Stoplists and Nickname lists [message #670923 is a reply to message #670910]

Thu, 02 August 2018 19:17

vka2b
Messages: 21
Registered: June 2018

Junior Member

I'm not sure if it was due to dropping a potentially existing thesaurus or making sure that second line in nicknames.txt was a tab, but everything is working as expected now. Thanks again for your help!

Report message to a moderator

Re: Stoplists and Nickname lists [message #671772 is a reply to message #670903]

Tue, 18 September 2018 20:59

vka2b
Messages: 21
Registered: June 2018

Junior Member

Again I have to resurrect an old thread, as I appear to have discovered something related to this thread. When we were discussing stopwords, you had pointed out that Roger Ford had said "NDATA fields do not use the stopword list." I think this is correct, and I think because of some other peculiarity it appeared as if our test was actually using the stopword list, but I don't think it actually was. We had compared "John Smith" and "John almost Smith" and assumed that because "almost" was part of the default stoplist, the reason the score came back as 100 is that "almost" was being removed from the comparison. However, I discovered that regardless of that "almost," the fact that "John" and "John" match and "Smith" and "Smith" match is what is causing that score to come back as 100. In other words, run the test with "John Smith" and "John garbage Smith" The score still comes back as 100! But this isn't because "garbage" is a stopword. For some other reason it isn't being considered in the comparison, resulting in a 100 score. Do you know why this is? To demonstrate this another way, try a test with 2 tokens that don't match at all, along with a stopword, e.g. "M almost" compared to "O almost" You would expect a score of 0 (since "almost" would be ignored, and "M vs "O" don't match at all). But instead, you get a score of 71, because "almost" is actually taken into account!

Report message to a moderator

Re: Stoplists and Nickname lists [message #671773 is a reply to message #671772]

Tue, 18 September 2018 22:02

Barbara Boehmer
Messages: 9090
Registered: November 2002
Location: California, USA

Senior Member

Part of what you are seeing is what I described in an answer in a previous thread. "John" is contained in "John almost Smith" and "John garbage Smith" and "Smith" is contained in "John almost Smith" and "John garbage Smith", so searching for "John Smith" using a "contains" query that uses ndata matches both 100%. It checks whether those words are contained, not whether the entire group of words matches.

You can create an index without ndata and then drop and recreate the same index with ndata and see the differences in the tokens created. Without ndata, stopwords like almost are not tokenized. With ndata, stopwords like almost are tokenized and therefore included in searching. So, if you compare "m almost" and "o almost", the two "almost" match, but the m and o do not match each other, so it is a partial match.

When tokenization using ndata is done, it creates multiple strings where ^ indicates starting and $ indicates ending and others may be in the middle. So, when you search for something using ndata, it checks to see if what you searched for is similar to any of those little strings.

Please see the demonstration below that shows indexing and resulting tokens without ndata, then indexing and resulting tokens with ndata.

-- table and data:

SCOTT@orcl_12.1.0.2.0> DROP TABLE candidate_name_stage;

Table dropped.

SCOTT@orcl_12.1.0.2.0> CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));

Table created.

SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('John almost Smith');

1 row created.

-- index on candidate_name_stage using stop list with "almost" WITHOUT ndata, and resulting tokens:

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_stoplist('c_sl')

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_stoplist('c_sl', 'BASIC_STOPLIST')

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_stopword('c_sl', 'almost')

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
  2    parameters ('stoplist c_sl');

Index created.

SCOTT@orcl_12.1.0.2.0> select token_text from dr$candidate_name_idx$i;

TOKEN_TEXT
----------------------------------------------------------------
JOHN
SMITH

2 rows selected.

-- index on candidate_name_stage using stop list with "almost" WITH ndata, and resulting tokens:

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference	('c_ds')

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference	('c_ds', 'MULTI_COLUMN_DATASTORE')

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute	('c_ds', 'COLUMNS', 'candidate_name name')

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_section_group	('c_secgrp')

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP')

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_ndata_section	('c_secgrp', 'name', 'name')

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> drop index candidate_name_idx;

Index dropped.

SCOTT@orcl_12.1.0.2.0> CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
  2    parameters ('datastore c_ds  section group c_secgrp stoplist c_sl');

Index created.

SCOTT@orcl_12.1.0.2.0> select token_text from dr$candidate_name_idx$i;

TOKEN_TEXT
----------------------------------------------------------------
^alm
^alo
^amo
^jhn
^joh
^jon
^lmo
^mit
^ohn
^sit
^smi
^smt
almo
alms
alos
amos
ith$
jhn$
joh$
john
jon$
lmos
lmot
lmst
lost
mih$
mit$
mith
mos$
most
mot$
mst$
mth$
ohn$
ost$
sith
smih
smit
smth

39 rows selected.

Report message to a moderator

Re: Stoplists and Nickname lists [message #671784 is a reply to message #671773]

Wed, 19 September 2018 08:26

vka2b
Messages: 21
Registered: June 2018

Junior Member

Thank you for your response. Your example clearly illustrates what you have stated: "Without ndata, stopwords like almost are not tokenized. With ndata, stopwords like almost are tokenized and therefore included in searching." So does the fact that with ndata stopwords are tokenized mean that the following statement is true: "NDATA fields do not use the stopword list." If so, that changes my solution slightly as I thought I could use the stopword list with ndata. As I mentioned, I think I was thrown off because I took the fact that "John almost Smith" and "John Smith" matched at 100 to mean that "almost" was being ignored, but as you have stated, the reason for the 100 is because of the CONTAINS, not because "almost" is not being tokenized. Correct?

Report message to a moderator

Re: Stoplists and Nickname lists [message #671785 is a reply to message #671784]

Wed, 19 September 2018 08:46

Barbara Boehmer
Messages: 9090
Registered: November 2002
Location: California, USA

Senior Member

vka2b wrote on Wed, 19 September 2018 06:26

Thank you for your response. Your example clearly illustrates what you have stated: "Without ndata, stopwords like almost are not tokenized. With ndata, stopwords like almost are tokenized and therefore included in searching." So does the fact that with ndata stopwords are tokenized mean that the following statement is true: "NDATA fields do not use the stopword list." If so, that changes my solution slightly as I thought I could use the stopword list with ndata. As I mentioned, I think I was thrown off because I took the fact that "John almost Smith" and "John Smith" matched at 100 to mean that "almost" was being ignored, but as you have stated, the reason for the 100 is because of the CONTAINS, not because "almost" is not being tokenized. Correct?

As Roger Ford said, "NDATA fields do not use the stopword list" and we can see that this is true.

With NDATA, stopwords like "almost" are not being ignored and are being tokenized and included in searches.

It is the combination of CONTAINS and NDATA that causes "John Smith" to match "John anything Smith" 100%. It is like searching for "John AND Smith" without NDATA.

So, although you can create a stoplist and include it in indexing along with NDATA without raising an error, the stoplist is ignored. So, functionally you cannot use a stoplist with NDATA.

Report message to a moderator

Re: Stoplists and Nickname lists [message #671790 is a reply to message #671785]

Wed, 19 September 2018 09:36

vka2b
Messages: 21
Registered: June 2018

Junior Member

Thank you for confirming.

Report message to a moderator

Previous Topic:	Oracle Text Performance
Next Topic:	Uncertainty while searching string which contains æøå characters

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Fri Apr 26 11:41:18 CDT 2024