Home » Server Options » Text & interMedia » Stoplists and Nickname lists (Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production)
Stoplists and Nickname lists [message #670903] Wed, 01 August 2018 08:12 Go to previous message
vka2b
Messages: 21
Registered: June 2018
Junior Member
Now that I've got a working NDATA name matching solution, I am trying to introduce some additional functionality into it, namely the use of stopword lists and nickname substitution lists. Despite performing the steps I think are necessary, these lists don't seem to be taken into account, so I'm guessing the same issue exists for both. I've pasted examples below that demonstrate the issue, using similar steps to the tests from my previous post, assuming Barbara is the one who'll be responding. Wink

I created a name in one table "John Smith" and "John almost Smith" under the assumption that 'almost' wouldn't be considered because it is part of the default English stoplist (according to the following link: https://docs.oracle.com/cd/B28359_01/text.111/b28304/astopsup.htm#CCREF1400 ). When that didn't occur, I tried to explicitly add it to my own stoplist, but experienced the same issue:


-- tables and data:

DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage values ('John Smith');

DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage values ('John almost Smith', 'P');

-- index on candidate_name_stage:
exec ctx_ddl.drop_preference     ('c_ds');
exec ctx_ddl.create_preference   ('c_ds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attribute       ('c_ds', 'COLUMNS', 'candidate_name name');
exec ctx_ddl.drop_section_group  ('c_secgrp');
exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
exec ctx_ddl.add_ndata_section   ('c_secgrp', 'name', 'name');

-- add stoplist
exec ctx_ddl.drop_stoplist('c_sl');
exec ctx_ddl.create_stoplist('c_sl', 'BASIC_STOPLIST');
exec ctx_ddl.add_stopword('c_sl', 'almost');

CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
  parameters ('datastore c_ds  section group c_secgrp stoplist c_sl');


-- query:
column candidate_name format a25 word_wrapped;
column target_name    format a25 word_wrapped;

select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from   candidate_name_stage, target_name_stage
where  contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order  by candidate_name, score(1) desc;


This results in "no rows selected" because the score isn't above 70 (I was expecting it to be 100 or close to 100 since without "almost" the names would match). For completeness, I'm also pasting how I tried to use the default stoplist, which also resulted in not the match I was expecting:

-- tables and data:

DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage values ('John Smith');

DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage values ('John almost Smith', 'P');

-- index on candidate_name_stage:
exec ctx_ddl.drop_preference     ('c_ds');
exec ctx_ddl.create_preference   ('c_ds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attribute       ('c_ds', 'COLUMNS', 'candidate_name name');
exec ctx_ddl.drop_section_group  ('c_secgrp');
exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
exec ctx_ddl.add_ndata_section   ('c_secgrp', 'name', 'name');

CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
  parameters ('datastore c_ds  section group c_secgrp stoplist CTXSYS.DEFAULT_STOPLIST');

-- query:
column candidate_name format a25 word_wrapped;
column target_name    format a25 word_wrapped;

select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from   candidate_name_stage, target_name_stage
where  contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order  by candidate_name, score(1) desc;

Finally, I will move this to a separate topic if it's not related, but I'm also not seeing my loading of nicknames being taken into account and I'm wondering if it's the same underlying issue. I'm trying to do something like:

-- tables and data:

DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage values ('Jonathan Smith');

DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage values ('John Smith', 'P');

-- index on candidate_name_stage:
exec ctx_ddl.drop_preference     ('c_ds');
exec ctx_ddl.create_preference   ('c_ds', 'MULTI_COLUMN_DATASTORE');
exec ctx_ddl.set_attribute       ('c_ds', 'COLUMNS', 'candidate_name name');
exec ctx_ddl.drop_section_group  ('c_secgrp');
exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP');
exec ctx_ddl.add_ndata_section   ('c_secgrp', 'name', 'name');

-- Load thesaurus

HOST ctxload -user testuser/testuser -thes -name nicknames -file nicknames.txt

exec ctx_ddl.drop_preference('name_wl');
exec ctx_ddl.create_preference('name_wl', 'BASIC_WORDLIST');
exec ctx_ddl.set_attribute('name_wl', 'NDATA_ALTERNATE_SPELLING', 'FALSE');
exec ctx_ddl.set_attribute('name_wl', 'NDATA_BASE_LETTER', 'TRUE');
exec ctx_ddl.set_attribute('name_wl', 'NDATA_THESAURUS', 'nicknames');
exec ctx_ddl.set_attribute('name_wl', 'NDATA_JOIN_PARTICLES',
   'de:di:la:da:el:del:qi:abd:los:la:dos:do:an:li:yi:yu:van:jon:un:sai:ben:al');

CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
  parameters ('datastore c_ds  section group c_secgrp wordlist name_wl');

-- query:
column candidate_name format a25 word_wrapped;
column target_name    format a25 word_wrapped;

select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from   candidate_name_stage, target_name_stage
where  contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order  by candidate_name, score(1) desc;

My nicknames.txt thesaurus contains:

Jonathan
SYN John (Jonathan)

But, I get "no rows returned" because "John" and "Jonathan" are not being considered as a match despite the thesaurus entry.

Any ideas as to why this might be? Thank you in advance.
 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: Oracle Text Performance
Next Topic: Uncertainty while searching string which contains æøå characters
Goto Forum:
  


Current Time: Thu Apr 25 22:30:44 CDT 2024