Monday, March 26, 2012

influence the length of sub-tokens indexed in fuzzy look-up

can I influence the lenghth of sub-tokens that are indexed in fuzzy lookup? Is it just fixed as 4?

ThanksAt this time the size is fixed. There is a performance trade-off between Error Tolerant Index size and the length and number of sub-tokens indexed per record. Using a fixed length may cause errors in short tokens to be missed if there are no other tokens in common between the input and target records. One approach, if your reference table is small, is to set the Exhaustive property to True. This will make Fuzzy Lookup skip the ETI and compare against each and every record in the reference table. Again, this is an expensive operation for large ref tables, so you might consider only doing it if the input record has only one short token. Likewise, you might also create a view of your reference table that contains only records of short length and do the Exhaustive match on just the view. You could have this as a separate branch in your Data Flow pipeline and use the Conditional Split transform to direct only short input records down it.

Hope this helps,
-Kris

No comments:

Post a Comment