Indexes and Deadlocks: indexing document stored in IMAGE fields

Monday, March 19, 2012

indexing document stored in IMAGE fields

Hi guys,
in the Full Text Retrieval documentation of MS SQLserver 2000 it's said that the document stored in IMAGE
fields are "filtered" using the Microsoft provided filters (for these file extensions: .doc, .xls, .ppt, .txt
and .htm) or third party filters (e.g. Adobe for .pdf).
Little after there's a note stating that "For full-text indexing, a document must be less than 16 megabytes (MB)
in size and must not contain more than 256 kilobytes (KB) of filtered text.".
While I can check if a file is larger than the max supported size, how can I check if a document contains more
than 256 KB of filtered text ? Is this information "exported" in some way by the filter applied to the document ?
If I store a document that does not satisfy the MSSearch requirements (size > 16MB or "filtered size" > 256 KB), which
actions are made by MSSearch ? Does indexing simply ignore it ?
Many THXS for your kind reply
MadMax
The best way to do this is to get filtdump from the platform sdk and do this
filtdump -b mydoc.doc >c:\out.out and then measure the size of the output.
In other versions of Microsoft Search products there were limits of the amount fo text per document that would be indexed. You could adjust this with a registry key setting. Any bytes over this interval would not be retrieved or indexed.
There are settings within MSSearch which allows you to control the maximum raw size of a document you are indexing but AFAIK there is no setting to allow you to increase the maximum amount of textual data it will index.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
|||THX very much Hilary.
Do you know what happen if the indexed document is bigger than the MSSearch limits (size 16MB, filtered size 256KB) ?
Is the document indexed (may be only for the part <16MB / filtered size < 256) or the indexing process fails completely ?
Max
|||My understanding is that only the first 16 M is extracted and only the first 256 k of text indexed.
The rest is ingnored.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
|||Max, Hilary,
I posted this reply in Nov. 2003, on this same subject... "I believe that
this is a DOC bug, i.e., a bug or incorrect information in
Books Online (BOL). To the best of my knowledge, no public KB exists for
this bug, however a related KB article - 308771 (Q308771) "PRB: A Full-Text
Search May Not Return Any Hits If It Fails to Index a File" at
http://support.microsoft.com/default...b;en-us;308771 has
information on the use of the Registry key: FilterProcessMemoryQuota.
You can control the size of the FT Index doc via Registry Key
FilterProcessMemoryQuota and setting it's value. Specifically,
HKLM\Software\Microsoft\Search\1.0\gathering
manager\filterProcessMemoryQuota, (DWORD).
It should default to 25MB and you can make it larger as it only affects the
limit for memory usage in the daemon [MSSdmn] process. In addition to the
memory allocated to SQL Server, it is recommended that a minimum of 15 MB of
RAM be reserved for the Microsoft Search service and a maximum of 512 MB of
RAM be allocated for the Microsoft Search service. If you plan on FT
Indexing large documents, you will need to set aside more memory for the
"Microsoft Search" (mssearch.exe) service and TEST the performance FT
Indexing very large documents as well as ensure that you have enough free
disk space on you system drive and the drive where your FT Catalogs reside
at all times."
Regards,
John
"Hilary Cotter" <hilaryk@.att.net> wrote in message
news:32335399-ADC7-45AD-B179-8D97C46CBC73@.microsoft.com...
> My understanding is that only the first 16 M is extracted and only the
first 256 k of text indexed.
> The rest is ingnored.
> Looking for a SQL Server replication book?
> http://www.nwsu.com/0974973602.html
>

Monday, March 19, 2012

indexing document stored in IMAGE fields

No comments:

Post a Comment

Indexes and Deadlocks

Blog Archive

About Me