Home » SQL & PL/SQL » SQL & PL/SQL » Term Match with Jaro-Winkler logic (11)
Term Match with Jaro-Winkler logic [message #666934] Fri, 01 December 2017 13:52 Go to next message
OracleUser2017
Messages: 7
Registered: December 2017
Junior Member
I'm looking for a solution that works for the speed and does not need days or even months / years.

«SEARCH_Term_tbl» has about 300 thousand records and
«TEXT_Term_tbl» has about 10 million records

Would like to search every single word in table «SEARCH_Term_tbl» against every single word in table «TEXT_Term_tbl» and compare it with the Jaro-Winkler logic. The output should be, count every single word found in table «TEXT_Term_tbl» which matches with Jaro-Winkler logic > 96%.

Table Example:

Insert into SEARCH_Term_tbl
SELECT 1 as ST_ID 'Géraldine Jerome dos Santos' as Search_Term FROM DUAL UNION ALL
SELECT 2 as ST_ID 'Magitta Evolet Jerome van der Vart' as Search_Term FROM DUAL UNION ALL
SELECT 3 as ST_ID 'Adi Putera Messi' as Search_Term FROM DUAL UNION ALL
SELECT 4 as ST_ID 'Ronaldo Cristiano Perez Portugal' as Search_Term FROM DUAL
SELECT 5 etc etc

Insert into TEXT_Term_tbl
SELECT 1 as ST_ID ' Rafael Ferdinand van der Vaart ist ein niederländischer Fußballspieler. Der Mittelfeldspieler steht seit August 2016 beim FC Midtjylland in der dänischen Superliga unter Vertrag' as Text_Term FROM DUAL
SELECT 2 as ST_ID 'Cristiano Ronaldo dos Santos Aveiro ist ein portugiesischer Fußballspieler. Er steht nach dem seinerzeit teuersten Transfer der Fußballgeschichte seit Sommer 2009 bei Real Madrid unter Vertrag und ist seit 2015 Rekordtorschütze des Vereins' as Text_Term FROM DUAL
SELECT 3 etc etc
Re: Term Match with Jaro-Winkler logic [message #666935 is a reply to message #666934] Fri, 01 December 2017 14:50 Go to previous messageGo to next message
BlackSwan
Messages: 26766
Registered: January 2009
Location: SoCal
Senior Member
Welcome to this forum

Please read and follow the forum guidelines, to enable us to help you:
OraFAQ Forum Guide
How to use {code} tags and make your code easier to read
Re: Term Match with Jaro-Winkler logic [message #666937 is a reply to message #666934] Fri, 01 December 2017 23:09 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9088
Registered: November 2002
Location: California, USA
Senior Member
The following separates the words in each table, then compares and counts them. It counts occurrences of words, not distinct words, so for example it counts der in the search_term_tbl matching 3 occurrences of der in the text_term_tbl. If you want to count distinct words, then you could add the distinct keyword, but that would slow it down significantly. Anything that you use is going to be slow. You will have to test the following to see how fast or slow it is.

-- test data:
SCOTT@orcl_12.1.0.2.0> create table SEARCH_Term_tbl as
  2  SELECT 1 as ST_ID, 'Géraldine Jerome dos Santos' as Search_Term FROM DUAL UNION ALL
  3  SELECT 2 as ST_ID, 'Magitta Evolet Jerome van der Vart' as Search_Term FROM DUAL UNION ALL
  4  SELECT 3 as ST_ID, 'Adi Putera Messi' as Search_Term FROM DUAL UNION ALL
  5  SELECT 4 as ST_ID, 'Ronaldo Cristiano Perez Portugal' as Search_Term FROM DUAL
  6  /

Table created.

SCOTT@orcl_12.1.0.2.0> create table TEXT_Term_tbl as
  2  SELECT 1 as ST_ID, ' Rafael Ferdinand van der Vaart ist ein niederländischer Fußballspieler. Der Mittelfeldspieler steht seit August 2016 beim FC Midtjylland in der dänischen Superliga unter Vertrag' as Text_Term FROM DUAL UNION ALL
  3  SELECT 2 as ST_ID, 'Cristiano Ronaldo dos Santos Aveiro ist ein portugiesischer Fußballspieler. Er steht nach dem seinerzeit teuersten Transfer der Fußballgeschichte seit Sommer 2009 bei Real Madrid unter Vertrag und ist seit 2015 Rekordtorschütze des Vereins' as Text_Term FROM DUAL
  4  /

Table created.

-- query:
SCOTT@orcl_12.1.0.2.0> column  term format a30
SCOTT@orcl_12.1.0.2.0> column  text format a30
SCOTT@orcl_12.1.0.2.0> select  terms.term, count(texts.text)
  2  from    (select  regexp_substr (search_term, '[^ ]+', 1, column_value) term
  3  	      from    search_term_tbl,
  4  		      table
  5  			(cast
  6  			  (multiset
  7  			    (select level
  8  			     from   dual
  9  			     connect by level <= regexp_count (search_term, ' ') + 1)
 10  			 as sys.odcinumberlist))) terms,
 11  	     (select  regexp_substr (text_term, '[^ ]+', 1, column_value) text
 12  	      from    text_term_tbl,
 13  		      table
 14  			(cast
 15  			  (multiset
 16  			    (select level
 17  			     from   dual
 18  			     connect by level <= regexp_count (text_term, ' ') + 1)
 19  			   as sys.odcinumberlist))) texts
 20  where  utl_match.jaro_winkler_similarity(terms.term, texts.text) > 96
 21  group  by terms.term
 22  /

TERM                           COUNT(TEXTS.TEXT)
------------------------------ -----------------
van                                            1
dos                                            1
Cristiano                                      1
Santos                                         1
Ronaldo                                        1
der                                            3

6 rows selected.

Re: Term Match with Jaro-Winkler logic [message #666942 is a reply to message #666937] Sat, 02 December 2017 07:11 Go to previous messageGo to next message
OracleUser2017
Messages: 7
Registered: December 2017
Junior Member
Hi, thanks for the feedback, the script works if you have small data to compare, but if you run 10 terms against 10 million text records (ca have 1:n terms), then it taks too long.
Think the "connect by level" taks to long.
No idea how I can run 300 thousand against 10 million?
Is there a way to run the script in parallel or maybe with a smal data dump?
Thanks

[Updated on: Sat, 02 December 2017 07:52]

Report message to a moderator

Re: Term Match with Jaro-Winkler logic [message #666949 is a reply to message #666942] Sat, 02 December 2017 11:32 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9088
Registered: November 2002
Location: California, USA
Senior Member
The connect by level is part of separating each word from each string. There may be more efficient methods. You can find several on this forum. However, the main problem is the shear bulk. Comparing every word from one table to every word in another table requires a huge Cartesian product. I once responded to a similar problem using Levenshtein distance instead of jaro_winkler, with the same speed problem. Tom Kyte said he had a similar problem and speeded it up by first creating the Cartesian product in one table with a blank score, then doing the comparison in sections in parallel. You might be able to do something similar. You can view his comments in the response to my response at the following link, which includes a reference to another link:

https://asktom.oracle.com/pls/asktom/asktom.search?tag=soundex-searches#10577325169281

It might also help, if you would state here what your ultimate goal is, as there may be a simpler solution, perhaps involving Oracle Text features and Oracle Text indexes instead of jaro_winkler which is devised purely for names and is slow. The main cause of the slowness is not the process of splitting the strings into words, but the huge Cartesian product and the jaro_winkler that does not use an index.


Re: Term Match with Jaro-Winkler logic [message #666956 is a reply to message #666949] Sun, 03 December 2017 04:37 Go to previous messageGo to next message
OracleUser2017
Messages: 7
Registered: December 2017
Junior Member
Hi, thanks for the reply
the goal is to check if the first, last name or company names in the table appear in the second table and if so, how exact is the match

It still raises the question of whether the search of names in the second table can not be performed in parallel?

At the moment my script is running the first term over the second table and only when this is finish it goes on with
the next term etc. etc.

I'll work again on this topic tomorrow.....

[Updated on: Sun, 03 December 2017 04:49]

Report message to a moderator

Re: Term Match with Jaro-Winkler logic [message #666957 is a reply to message #666956] Sun, 03 December 2017 11:33 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9088
Registered: November 2002
Location: California, USA
Senior Member
It sounds like an Oracle Text index and Oracle Text queries using the CONTAINS operator with soundex and fuzzy matching might be a good solution for you. I have provided some examples below, using your sample data. This is just a small sampling of what can be done. The fuzzy matching can be limited to scores above anywhere from 1 to 80. The overall text score can be limited as well. The simple queries that I posted typically raise an error when comparing two large tables, which is why I posted the looping examples, which still use the index. It does fine with one large table, one search term or search phrase at a time. I think you will find that looping with a text index and contains and soundex and fuzzy will be far faster than looping with jaro_winkler. There is also Oracle Text ndata, which is specifically intended for names and classification, but those are much more complex and probably more than you need. I don't know whether it is sufficient to count matched names or if you really need to count matched individual parts of names, so I included both options. I used dbms_output for simple display of results, but you could just as easily insert the results into a table using insert into ... select ... instead of the second loop, which should be faster.

-- test tables and data:
SCOTT@orcl_12.1.0.2.0> create table SEARCH_Term_tbl as
  2  SELECT 1 as ST_ID, 'Géraldine Jerome dos Santos' as Search_Term FROM DUAL UNION ALL
  3  SELECT 2 as ST_ID, 'Magitta Evolet Jerome van der Vart' as Search_Term FROM DUAL UNION ALL
  4  SELECT 3 as ST_ID, 'Adi Putera Messi' as Search_Term FROM DUAL UNION ALL
  5  SELECT 4 as ST_ID, 'Ronaldo Cristiano Perez Portugal' as Search_Term FROM DUAL
  6  /

Table created.

SCOTT@orcl_12.1.0.2.0> create table TEXT_Term_tbl as
  2  SELECT 1 as ST_ID, ' Rafael Ferdinand van der Vaart ist ein niederländischer Fußballspieler. Der Mittelfeldspieler steht seit August 2016 beim FC Midtjylland in der dänischen Superliga unter Vertrag' as Text_Term FROM DUAL UNION ALL
  3  SELECT 2 as ST_ID, 'Cristiano Ronaldo dos Santos Aveiro ist ein portugiesischer Fußballspieler. Er steht nach dem seinerzeit teuersten Transfer der Fußballgeschichte seit Sommer 2009 bei Real Madrid unter Vertrag und ist seit 2015 Rekordtorschütze des Vereins' as Text_Term FROM DUAL
  4  /

Table created.

-- Oracle Text context index:
SCOTT@orcl_12.1.0.2.0> create index text_idx on text_term_tbl (text_term) indextype is ctxsys.context
  2  /

Index created.

-- queries that use Oracle Text context index to search for soundex or fuzzy matches:

-- soundex match or close fuzzy match (score >= 80) on any word in name:
SCOTT@orcl_12.1.0.2.0> select s.search_term, count(*)
  2  from   search_term_tbl s, text_term_tbl t
  3  where  contains
  4  	      (t.text_term,
  5  	       '!' || s.search_term ||
  6  	       ' or fuzzy(' || replace (s.search_term, ' ', ',80,5000,w) or fuzzy(') || ',80,5000,w)',
  7  	       1) > 0
  8  group  by s.search_term
  9  /

SEARCH_TERM                                     COUNT(*)
--------------------------------------------- ----------
Magitta Evolet Jerome van der Vart                     2
Géraldine Jerome dos Santos                            1
Ronaldo Cristiano Perez Portugal                       1

3 rows selected.

SCOTT@orcl_12.1.0.2.0> select s.search_term as search_term, count(*)
  2  from   (select  regexp_substr (search_term, '[^ ]+', 1, column_value) search_term
  3  	     from    search_term_tbl,
  4  		     table
  5  		       (cast
  6  			 (multiset
  7  			   (select level
  8  			    from   dual
  9  			    connect by level <= regexp_count (search_term, ' ') + 1)
 10  			  as sys.odcinumberlist))) s,
 11  		     text_term_tbl t
 12  where  contains (t.text_term, '!' || s.search_term || ' or fuzzy(' || s.search_term || ',80,5000,w)', 1) > 0
 13  group  by s.search_term
 14  /

SEARCH_TERM                                     COUNT(*)
--------------------------------------------- ----------
van                                                    1
dos                                                    1
Cristiano                                              1
Santos                                                 1
Vart                                                   1
Ronaldo                                                1
der                                                    2

7 rows selected.

-- soundex match or fuzzy match (score >= 1) on any word in name:
SCOTT@orcl_12.1.0.2.0> select s.search_term, count(*)
  2  from   search_term_tbl s, text_term_tbl t
  3  where  contains
  4  	      (t.text_term,
  5  	       '!' || s.search_term ||
  6  	       ' or fuzzy(' || replace (s.search_term, ' ', ',1,5000,w) or fuzzy(') || ',1,5000,w)',
  7  	       1) > 0
  8  group  by s.search_term
  9  /

SEARCH_TERM                                     COUNT(*)
--------------------------------------------- ----------
Magitta Evolet Jerome van der Vart                     2
Géraldine Jerome dos Santos                            2
Ronaldo Cristiano Perez Portugal                       1

3 rows selected.

SCOTT@orcl_12.1.0.2.0> select s.search_term as search_term, count(*)
  2  from   (select  regexp_substr (search_term, '[^ ]+', 1, column_value) search_term
  3  	     from    search_term_tbl,
  4  		     table
  5  		       (cast
  6  			 (multiset
  7  			   (select level
  8  			    from   dual
  9  			    connect by level <= regexp_count (search_term, ' ') + 1)
 10  			  as sys.odcinumberlist))) s,
 11  		     text_term_tbl t
 12  where  contains (t.text_term, '!' || s.search_term || ' or fuzzy('||s.search_term||',1,5000,w)', 1) > 0
 13  group  by s.search_term
 14  /

SEARCH_TERM                                     COUNT(*)
--------------------------------------------- ----------
van                                                    1
dos                                                    1
Cristiano                                              1
Santos                                                 1
Vart                                                   2
Ronaldo                                                1
der                                                    2
Géraldine                                              1

8 rows selected.

-- examples of looping using Oracle text context index:
-- soundex match or close fuzzy match (score >= 80) on any word in name:
SCOTT@orcl_12.1.0.2.0> begin
  2    for s in
  3  	 (select search_term from search_term_tbl)
  4    loop
  5  	 for tt in
  6  	   (select s.search_term, count(*) cnt
  7  	    from   text_term_tbl t
  8  	    where  contains
  9  		     (t.text_term,
 10  		      '!' || s.search_term ||
 11  		      ' or fuzzy(' || replace (s.search_term, ' ', ',80,5000,w) or fuzzy(') || ',80,5000,w)',
 12  		      1) > 0
 13  	    group  by s.search_term)
 14  	 loop
 15  	   dbms_output.put_line (rpad (tt.search_term, 60) || tt.cnt);
 16  	 end loop;
 17    end loop;
 18  end;
 19  /
Géraldine Jerome dos Santos                                 1
Magitta Evolet Jerome van der Vart                          2
Ronaldo Cristiano Perez Portugal                            1

PL/SQL procedure successfully completed.

SCOTT@orcl_12.1.0.2.0> begin
  2    for s in
  3  	 (select  regexp_substr (search_term, '[^ ]+', 1, column_value) search_term
  4  	  from	  search_term_tbl,
  5  		     table
  6  		       (cast
  7  			 (multiset
  8  			   (select level
  9  			    from   dual
 10  			    connect by level <= regexp_count (search_term, ' ') + 1)
 11  			  as sys.odcinumberlist)))
 12    loop
 13  	 for tt in
 14  	   (select s.search_term, count(*) cnt
 15  	    from   text_term_tbl t
 16  	    where  contains (t.text_term, '!' || s.search_term || ' or fuzzy('||s.search_term||',80,5000,w)', 1) > 0
 17  	    group  by s.search_term)
 18  	 loop
 19  	   dbms_output.put_line (rpad (tt.search_term, 60) || tt.cnt);
 20  	 end loop;
 21    end loop;
 22  end;
 23  /
dos                                                         1
Santos                                                      1
van                                                         1
der                                                         2
Vart                                                        1
Ronaldo                                                     1
Cristiano                                                   1

PL/SQL procedure successfully completed.



Re: Term Match with Jaro-Winkler logic [message #666985 is a reply to message #666957] Mon, 04 December 2017 05:41 Go to previous messageGo to next message
OracleUser2017
Messages: 7
Registered: December 2017
Junior Member
Hi, get not the smae results, maybe in cas of the index?
I I create --> indextype is ctxsys.context then it shows me the indextype domain ?
Re: Term Match with Jaro-Winkler logic [message #667000 is a reply to message #666985] Mon, 04 December 2017 13:05 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9088
Registered: November 2002
Location: California, USA
Senior Member
OracleUser2017 wrote on Mon, 04 December 2017 03:41
Hi, get not the smae results, maybe in cas of the index?
I I create --> indextype is ctxsys.context then it shows me the indextype domain ?

Are you saying that when you create the same table and same index, and run the same queries as I did that you do not get the same results that I posted? If so, then please post a copy and paste of a run from SQL*Plus of the complete process as I did. If this is not the problem, then please clarify, preferably with a copy and paste of an example.

Oracle Text context indexes are a type of domain index. If you are using autotrace or some such thing to show the execution plan that the optimizer has chosen, then the context index will be listed as a domain index.

Re: Term Match with Jaro-Winkler logic [message #667002 is a reply to message #667000] Tue, 05 December 2017 01:12 Go to previous messageGo to next message
OracleUser2017
Messages: 7
Registered: December 2017
Junior Member
Hi, yes it was in case of the index, thanks, but now, if I run the script over huge dataset, I receive a
Oracle Text Error:
text query parser syntax error on line 1, column 1,
error in executing ODCIndexStart() routine
Do you know why?
Re: Term Match with Jaro-Winkler logic [message #667003 is a reply to message #667002] Tue, 05 December 2017 01:20 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9088
Registered: November 2002
Location: California, USA
Senior Member
I can't diagnose what I can't see. You need to post a complete copy and paste of a run of what you did, complete with line numbers and error message. Did you just run a query or a loop? As I said previously, running the query alone comparing two tables will produce an error, which is why you need to loop.

Re: Term Match with Jaro-Winkler logic [message #667006 is a reply to message #667003] Tue, 05 December 2017 01:55 Go to previous messageGo to next message
OracleUser2017
Messages: 7
Registered: December 2017
Junior Member
I do it with a loop, but maybe there is a possibility to loop next, if an error appears?
Re: Term Match with Jaro-Winkler logic [message #667008 is a reply to message #667006] Tue, 05 December 2017 02:01 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9088
Registered: November 2002
Location: California, USA
Senior Member
I still need to see copy and paste.
Re: Term Match with Jaro-Winkler logic [message #667013 is a reply to message #667008] Tue, 05 December 2017 03:39 Go to previous messageGo to next message
OracleUser2017
Messages: 7
Registered: December 2017
Junior Member
unfortunately I can not find the record, there are too many,
I can simply re-run the procedure until it interrupts again etc etc
work on it to find out the error
Re: Term Match with Jaro-Winkler logic [message #667036 is a reply to message #667013] Tue, 05 December 2017 10:48 Go to previous message
Barbara Boehmer
Messages: 9088
Registered: November 2002
Location: California, USA
Senior Member
As the error message says you have a syntax error, you probably have a syntax error. If you will just post a copy and paste of the code you are actually running that produces the error, I might be able to tell what the problem is. You can certainly add a loop within your loop to log errors and log the values. I suspect that some of your names that you are searching for may have some special characters in them that have special meaning to Oracle Text and cause a syntax error. You can eliminate this by either removing those characters or enclosing the string within double quotes or escaping the characters with a backslash. If you use double quotes it only looks for an exact match and will not use soundex or fuzzy. I cannot help you if you will not post your code. I don't know why you keep ignoring that.

Previous Topic: remove duplicate values by date
Next Topic: Number Conversion
Goto Forum:
  


Current Time: Thu Apr 18 08:06:24 CDT 2024