Wednesday, March 21, 2012

Indexing Service and hyphens

I am trying to search for a word such as "e-business" using the Indexing
Service Query object (CissoQuery). Now what I would like to do is to be abl
e
to search for e-bus and return results of variations of this term, e.g.
e-business, e-busi. So effectively, I would like to a do a wildcard search.
Unfortunately, when I search for this term, it returns to me documents that
do not have e-business in them but variations of e (I have modified the nois
e
list to remove noise words) and business as well as ebusiness. I don't want
this to happen. I can search for the phrase "e-business" and it returns the
correct results back. However if I search for "e-bus" it returns no results
back because it is looking for the entire phrase. If I search for e-busines
s
without the quotes, I get the variations of which I talked about earlier for
documents that don't contain that phrase. How do I configure Indexing
Service to return me results with hyphens back. I have yet to find any
answer on the web anywhere where this question has been asked sufficiently.
If this is a bug and cannot be done in indexing service, please tell me and
I
will stop attempting to try and figure this out. I am aware that this is a
general indexing service question but I know sql server uses the service
internally or something like it, so I am posting this question to this
newsgroup.Hammad,
It might be best to post this question to
microsoft.public.sqlserver.fulltext or
microsoft.public.inetserver.indexserver newsgroups as this is a somewhat
specialized area...
The Indexing Service (IS) uses the same OS-supplied word breakers that
determine the language specific breaking of words into tokens. For example,
using a URL, such as 'http://jtkane.com?search=what#is#my+name' that
includes punctuation characters such as :, /, ?, =, and + will be tokenized
as follows on Windows Server 2003 and Windows XP using the LangWrbk.dll
wordbreaker:
Original text: 'http://jtkane.com?search=what#is#my+name'
IWordSink::PutWord: cwcSrcLen 4, cwcSrcPos 0, cwc 4, 'http'
IWordSink::PutWord: cwcSrcLen 6, cwcSrcPos 7, cwc 6, 'jtkane'
IWordSink::PutWord: cwcSrcLen 3, cwcSrcPos 14, cwc 3, 'com'
IWordSink::PutWord: cwcSrcLen 6, cwcSrcPos 18, cwc 6, 'search'
IWordSink::PutWord: cwcSrcLen 4, cwcSrcPos 25, cwc 4, 'what'
IWordSink::PutWord: cwcSrcLen 2, cwcSrcPos 30, cwc 2, 'is'
IWordSink::PutWord: cwcSrcLen 2, cwcSrcPos 33, cwc 2, 'my'
IWordSink::PutWord: cwcSrcLen 4, cwcSrcPos 36, cwc 4, 'name'
However, on Windows 2000 Server the same URL will be tokenized as a single
token using the infosoft.dll wordbreaker:
Original text: 'http://jtkane.com?search=what#is#my+name'
IWordSink::PutWord: cwcSrcLen 40, cwcSrcPos 0, cwc 39,
'http://jtkane.com?searchwhat#is#my+name'
The same is true for SQL Server's Full Text Search (FTS) component as is
true for the Indexing Service as both depend upon the OS-supplied
wordbreakers. Could you post the full output of -- SELECT @.@.version -- as
this would be most helpful in troubleshooting your questions.
Thanks,
John
--
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/
"Hammad" <Hammad@.discussions.microsoft.com> wrote in message
news:80C23DCB-6475-4555-93D8-DD30DF5EA337@.microsoft.com...
> I am trying to search for a word such as "e-business" using the Indexing
> Service Query object (CissoQuery). Now what I would like to do is to be
able
> to search for e-bus and return results of variations of this term, e.g.
> e-business, e-busi. So effectively, I would like to a do a wildcard
search.
> Unfortunately, when I search for this term, it returns to me documents
that
> do not have e-business in them but variations of e (I have modified the
noise
> list to remove noise words) and business as well as ebusiness. I don't
want
> this to happen. I can search for the phrase "e-business" and it returns
the
> correct results back. However if I search for "e-bus" it returns no
results
> back because it is looking for the entire phrase. If I search for
e-business
> without the quotes, I get the variations of which I talked about earlier
for
> documents that don't contain that phrase. How do I configure Indexing
> Service to return me results with hyphens back. I have yet to find any
> answer on the web anywhere where this question has been asked
sufficiently.
> If this is a bug and cannot be done in indexing service, please tell me
and I
> will stop attempting to try and figure this out. I am aware that this is
a
> general indexing service question but I know sql server uses the service
> internally or something like it, so I am posting this question to this
> newsgroup.|||Hi John,
Thanks for your quick response. The version I obtained from using that
command is the following:
Microsoft SQL Server 2000 - 8.00.760 (Intel X86) Dec 17 2002 14:22:05
Copyright (c) 1988-2003 Microsoft Corporation Developer Edition on Windows
NT 5.1 (Build 2600: Service Pack 2)
I've done a little bit of reading on word breakers but I'm not sure how to
actually configure programatically a word breaker to use for indexing or
whether this is even necessary. I'm not exactly sure how the indexing
service works but I assume if it finds a word e-business in a document, it
will index e, business, ebusiness, and e-business, because when I do use the
CissoQuery object and specify the exact phrase "e-business" using Dialect 2,
it does find it. The only issue I have is how to specify a wildcard type
search such that if I type in "e-bus" it will find all variations of words
with e-bus as a prefix. If I don't specify quotes around e-business then it
will find documents that contain variations of e-business like I detailed
previously, so documents that don't have e-business in them actually show up
because they have those variations. If I specify just "e-bus" in quotes
then it looks for the exact phrase and not prefix based words and so it won'
t
find documents that contain that variations of words that start with that
prefix. Is it possible to do such a thing?
Thanks,
Hammad
"John Kane" wrote:

> Hammad,
> It might be best to post this question to
> microsoft.public.sqlserver.fulltext or
> microsoft.public.inetserver.indexserver newsgroups as this is a somewhat
> specialized area...
> The Indexing Service (IS) uses the same OS-supplied word breakers that
> determine the language specific breaking of words into tokens. For example
,
> using a URL, such as 'http://jtkane.com?search=what#is#my+name' that
> includes punctuation characters such as :, /, ?, =, and + will be tokenize
d
> as follows on Windows Server 2003 and Windows XP using the LangWrbk.dll
> wordbreaker:
> Original text: 'http://jtkane.com?search=what#is#my+name'
> IWordSink::PutWord: cwcSrcLen 4, cwcSrcPos 0, cwc 4, 'http'
> IWordSink::PutWord: cwcSrcLen 6, cwcSrcPos 7, cwc 6, 'jtkane'
> IWordSink::PutWord: cwcSrcLen 3, cwcSrcPos 14, cwc 3, 'com'
> IWordSink::PutWord: cwcSrcLen 6, cwcSrcPos 18, cwc 6, 'search'
> IWordSink::PutWord: cwcSrcLen 4, cwcSrcPos 25, cwc 4, 'what'
> IWordSink::PutWord: cwcSrcLen 2, cwcSrcPos 30, cwc 2, 'is'
> IWordSink::PutWord: cwcSrcLen 2, cwcSrcPos 33, cwc 2, 'my'
> IWordSink::PutWord: cwcSrcLen 4, cwcSrcPos 36, cwc 4, 'name'
> However, on Windows 2000 Server the same URL will be tokenized as a single
> token using the infosoft.dll wordbreaker:
> Original text: 'http://jtkane.com?search=what#is#my+name'
> IWordSink::PutWord: cwcSrcLen 40, cwcSrcPos 0, cwc 39,
> 'http://jtkane.com?searchwhat#is#my+name'
> The same is true for SQL Server's Full Text Search (FTS) component as is
> true for the Indexing Service as both depend upon the OS-supplied
> wordbreakers. Could you post the full output of -- SELECT @.@.version -- as
> this would be most helpful in troubleshooting your questions.
> Thanks,
> John
> --
> SQL Full Text Search Blog
> http://spaces.msn.com/members/jtkane/
>
> "Hammad" <Hammad@.discussions.microsoft.com> wrote in message
> news:80C23DCB-6475-4555-93D8-DD30DF5EA337@.microsoft.com...
> able
> search.
> that
> noise
> want
> the
> results
> e-business
> for
> sufficiently.
> and I
> a
>
>

No comments:

Post a Comment