Home » Server Options » Text & interMedia » Oracle BLOB PDF Text Search Question (Oracle9i Enterprise Edition Release 9.2.0.4.0 - 64bit Production)
Oracle BLOB PDF Text Search Question [message #416454] Sun, 02 August 2009 23:52 Go to next message
mike_s_6
Messages: 2
Registered: August 2009
Junior Member
Good day!

I would like to ask for help regarding searching for strings inside a PDF in Oracle. Let's say I have a table named "documents", which has "id" as the primary key and a field named "document" which is the blob. The contains function is used to search for strings inside the document:

SELECT id FROM documents WHERE CONTAINS(document, 'value') > 0;

Now here's a sample of how a part of the PDF might look like:

/forum/fa/6641/0/

The issue is that when the string "9518 9502" (the first two values in the first column) is searched, it returns true:

SELECT id FROM documents WHERE CONTAINS(document, '9518 9502') > 0;

But as you see, in the document, visibly there isn't a 9518(space)9502, instead there's a table break.

I have explained that it is the PDF's formatting that does this, but I think they still want to be able to determine that there's no '9518 9502' visible in the PDF. Now my question is, since the user seems to want the search to return false, is there a way for the code to discern this?

  • Attachment: sample.jpg
    (Size: 7.92KB, Downloaded 3299 times)
Re: Oracle BLOB PDF Text Search Question [message #416468 is a reply to message #416454] Mon, 03 August 2009 00:49 Go to previous messageGo to next message
Michel Cadot
Messages: 68625
Registered: March 2007
Location: Nanterre, France, http://...
Senior Member
Account Moderator
PDF file is binary, Oracle functions works on char datatype family.
If you want some functions on binary data, you have to write them.

Regards
Michel
Re: Oracle BLOB PDF Text Search Question [message #416515 is a reply to message #416468] Mon, 03 August 2009 03:18 Go to previous messageGo to next message
Frank
Messages: 7901
Registered: March 2000
Senior Member
Michel Cadot wrote on Mon, 03 August 2009 07:49
PDF file is binary, Oracle functions works on char datatype family.
If you want some functions on binary data, you have to write them.

Regards
Michel


Not true, since Text indexes can also search in Word documents.
I guess it's up to Barbara, our Text Index expert.
Re: Oracle BLOB PDF Text Search Question [message #416516 is a reply to message #416515] Mon, 03 August 2009 03:20 Go to previous messageGo to next message
Michel Cadot
Messages: 68625
Registered: March 2007
Location: Nanterre, France, http://...
Senior Member
Account Moderator
OK. So I move to "Text & interMedia" forum.

Regards
Michel
Re: Oracle BLOB PDF Text Search Question [message #416661 is a reply to message #416454] Mon, 03 August 2009 14:32 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
I don't think you can change how that works. If the document reads from the top of one column to the bottom, then from the top of the next column to the bottom, I believe Oracle Text filtering will result in the tokens being ordered in that manner. If you think about it CONTAINS(document, '9518 9502') means 9518 followed by 9502 and that's what is there, just top to bottom, not left to right.
Re: Oracle BLOB PDF Text Search Question [message #416679 is a reply to message #416661] Mon, 03 August 2009 21:23 Go to previous message
mike_s_6
Messages: 2
Registered: August 2009
Junior Member
Ah, is that so? Thank you all for your replies. If anyone has an idea on how to get around this, then please tell me too Smile
Previous Topic: A question about multi_lexer
Next Topic: phonetically search
Goto Forum:
  


Current Time: Fri Mar 29 08:28:50 CDT 2024