Home » Server Options » Text & interMedia » Problem with filtering binary documents (.doc, .pdf, etc...)
Problem with filtering binary documents (.doc, .pdf, etc...) [message #287618] Wed, 12 December 2007 23:31 Go to next message
fatennn
Messages: 1
Registered: December 2007
Junior Member
Hi, I have a problem with filtering binary documents (.doc, .pdf, etc...). I use SQL*PLUS for remote access to Oracle 10.2 on Linux and I create table:

CREATE TABLE test (id NUMBER PRIMARY KEY, text VARCHAR2(100));

I insert to this table:

INSERT into test values(1,'PATH/text1.doc‘);
INSERT into test values(2,'PATH/text2.doc‘);

and then:

CREATE INDEX test_index ON test(text) indextype is ctxsys.context
parameters (’datastore ctxsys.file_datastore
filter ctxsys.auto_filter’);

Message "Index created" is displayed, but objects: DR$test_index$I, DR$test_index$K, DR$test_index$N, DR$test_index$R and DR$test_index$P are empty => index wasn´t created probably.

I don´t know, where is bug, either bug is somewhere in this code or on the server (wrong installation oracle or constraint privileges). Do you know in what is bug?

Re: Problem with filtering binary documents (.doc, .pdf, etc...) [message #287717 is a reply to message #287618] Thu, 13 December 2007 04:31 Go to previous messageGo to next message
Maaher
Messages: 7065
Registered: December 2001
Senior Member
What happens if you gather statistics?

MHE
Re: Problem with filtering binary documents (.doc, .pdf, etc...) [message #287840 is a reply to message #287618] Thu, 13 December 2007 14:21 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
The following is an excerpt from the 10g online documentation. Note the items that I have put in bold.

FILE_DATASTORE

The FILE_DATASTORE type is used for text stored in files accessed through the local file system.

Note:
FILE_DATASTORE may not work with certain types of remote mounted file systems.

FILE_DATASTORE has the following attribute(s):

Table 2-4 FILE_DATASTORE Attributes
Attribute Attribute Value
path path1:path2:pathn

path

Specify the full directory path name of the files stored externally in a file system. When you specify the full directory path as such, you need only include file names in your text column.

You can specify multiple paths for path, with each path separated by a colon (:) on UNIX and semicolon(;) on Windows. File names are stored in the text column in the text table.

If you do not specify a path for external files with this attribute, Oracle Text requires that the path be included in the file names stored in the text column.

PATH Attribute Limitations

The PATH attribute has the following limitations:

*

If you specify a PATH attribute, you can only use a simple filename in the indexed column. You cannot combine the PATH attribute with a path as part of the filename. If the files exist in multiple folders or directories, you must leave the PATH attribute unset, and include the full file name, with PATH, in the indexed column.
*

On Windows systems, the files must be located on a local drive. They cannot be on a remote drive, whether the remote drive is mapped to a local drive letter.

Re: Problem with filtering binary documents (.doc, .pdf, etc...) [message #287841 is a reply to message #287717] Thu, 13 December 2007 14:24 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
Maaher wrote on Thu, 13 December 2007 02:31

What happens if you gather statistics?

MHE


Gathering of statistics is not necessary to populate the tables associated with a context index as demonstrated below.

SCOTT@orcl_11g> CREATE TABLE test (id NUMBER PRIMARY KEY, text VARCHAR2(100));

Table created.

SCOTT@orcl_11g> 
SCOTT@orcl_11g> 
SCOTT@orcl_11g> INSERT into test values(1,'c:\oracle11g\banana.pdf');

1 row created.

SCOTT@orcl_11g> INSERT into test values(2,'c:\oracle11g\cranberry.pdf');

1 row created.

SCOTT@orcl_11g> 
SCOTT@orcl_11g> CREATE INDEX test_index ON test(text) indextype is ctxsys.context
  2  parameters ('datastore ctxsys.file_datastore
  3  filter ctxsys.auto_filter');

Index created.

SCOTT@orcl_11g> 
SCOTT@orcl_11g> select count(*) from dr$test_index$i
  2  /

  COUNT(*)
----------
       608

SCOTT@orcl_11g> 
Re: Problem with filtering binary documents (.doc, .pdf, etc...) [message #287844 is a reply to message #287841] Thu, 13 December 2007 15:01 Go to previous messageGo to next message
Maaher
Messages: 7065
Registered: December 2001
Senior Member
Aha, thanks for the correction, it has been a while since I've worked with Oracle Text. I remember we did a nightly refresh of the indexes and that must have triggered my reply.

I should have checked.

MHE

[Updated on: Thu, 13 December 2007 15:01]

Report message to a moderator

Re: Problem with filtering binary documents (.doc, .pdf, etc...) [message #287861 is a reply to message #287841] Thu, 13 December 2007 18:38 Go to previous message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
Here is a little further demonstration just to show what is happening. In the following, I used a non-existent path and non-existent file name, which produces the same results as when you use a remote path that does not exist locally.

SCOTT@orcl_11g> CREATE TABLE test (id NUMBER PRIMARY KEY, text VARCHAR2(100));

Table created.

SCOTT@orcl_11g> 
SCOTT@orcl_11g> 
SCOTT@orcl_11g> INSERT into test values(3,'c:\nosuchpath\nosuchfile.pdf');

1 row created.

SCOTT@orcl_11g> 
SCOTT@orcl_11g> CREATE INDEX test_index ON test(text) indextype is ctxsys.context
  2  parameters ('datastore ctxsys.file_datastore
  3  filter ctxsys.auto_filter');

Index created.

SCOTT@orcl_11g> 
SCOTT@orcl_11g> select count(*) from dr$test_index$i
  2  /

  COUNT(*)
----------
         0

SCOTT@orcl_11g> 

[Updated on: Thu, 13 December 2007 18:38]

Report message to a moderator

Previous Topic: contains query
Next Topic: How to search the word in pdf/word document
Goto Forum:
  


Current Time: Thu Mar 28 04:23:17 CDT 2024