Showing posts with label lookup. Show all posts
Showing posts with label lookup. Show all posts

Monday, March 26, 2012

influence the length of sub-tokens indexed in fuzzy look-up

can I influence the lenghth of sub-tokens that are indexed in fuzzy lookup? Is it just fixed as 4?

ThanksAt this time the size is fixed. There is a performance trade-off between Error Tolerant Index size and the length and number of sub-tokens indexed per record. Using a fixed length may cause errors in short tokens to be missed if there are no other tokens in common between the input and target records. One approach, if your reference table is small, is to set the Exhaustive property to True. This will make Fuzzy Lookup skip the ETI and compare against each and every record in the reference table. Again, this is an expensive operation for large ref tables, so you might consider only doing it if the input record has only one short token. Likewise, you might also create a view of your reference table that contains only records of short length and do the Exhaustive match on just the view. You could have this as a separate branch in your Data Flow pipeline and use the Conditional Split transform to direct only short input records down it.

Hope this helps,
-Kris

Wednesday, March 7, 2012

Indexes on datasets created froma Union All task

Hi

I have a package that takes data from 3 different sources, unions them all together and then performs a lookup against a tabel on a different server. The 3 sources of data creates a dataset with quite a large volume of rows (approx 100,000) and the look up takes an age to complete.

Is there anyway of adding an index of sorts to the dataset created by the union to speed the process up? I guess i could output the union all to a temp table and look up against this but wondered if there was another way

thanks

scott

100,000 rows is not much in my experience and fully-cached lookups are very fast once the reference data is loaded. I assume you're not running the lookup in cache mode?

If you are running in non-cached mode, I would think you would be focusing your indexing efforts on the reference table, not the rows in the pipeline. And is that really necessary?

I don't follow you on the temp table idea. It sounds like the data from the union all is the reference data. I can't visualize the scenario.

If the lookup is cached, are you sure the slowdown is in the Lookup and not downstream? A component cannot output rows if the next component is not ready to receive them. Try replacing the everything below the Lookup with a Union All and no destination. It should scream.
|||

Hi Jay

thanks for the reply

I made a mistake it is only 10,000 rows so even less!

I have 3 sources going into a union all which takes about a 2 seconds to 'go green'. next the Union and the look up go yellow and it takes about 90 seconds to complete. After this I simply have 2 destinations - success from look up goes to one tabel and error output goes to another table. Once the look up is complete the population of the 2 destination tables takes the blink of an eye.

I am not sure what you mean by caching though?

cheers

scott

|||

On the Advanced tab of the Lookup component, make sure that "Enable Memory Restriction" is not checked. Also, make sure that you are not using the table selection method in the lookup - you should specify a SQL statement that returns the smallest number of columns possible. So ideally you would return two columns, the value to match on, and the value to return.

How many rows in your lookup?

|||

Racsco wrote:

I have 3 sources going into a union all which takes about a 2 seconds to 'go green'. next the Union and the look up go yellow and it takes about 90 seconds to complete. After this I simply have 2 destinations - success from look up goes to one tabel and error output goes to another table. Once the look up is complete the population of the 2 destination tables takes the blink of an eye.

90 seconds for 10,000 rows is 111 rows/second. That could be indicative of not caching. In this situation, the Lookup would issue 10,000 select statements to your reference table. When caching is enabled, the lookup should be almost as fast as the union.

I want to make sure you understand my point about the downstream components. If one of those destinations is slow (i.e. not using fast load), then the lookup and the union will appear slow as they wait for the destination to accept new rows.
|||

Hi John

I checked the advanced tab and enable memory restriction was checked. I unchecked it and ran it again and it was rapid to say the least. However out of the 10000 rows only 770 matched the lookup condition and the rest errored. I tested it again and checked the enable memory restriction box, ran it and all passed the look up condition

any ideas as to why?

If i run the look up as a join in query analyser all 10000 rows pass the look up condition

|||

Hi Jay

both destinations are set to table or view fast load and both are tables. Please see my previous reply above your most recent post as i got it going much faster but now have a problem with the look up

cheers

scott

|||

Racsco wrote:

Hi John

I checked the advanced tab and enable memory restriction was checked. I unchecked it and ran it again and it was rapid to say the least. However out of the 10000 rows only 770 matched the lookup condition and the rest errored. I tested it again and checked the enable memory restriction box, ran it and all passed the look up condition

any ideas as to why?

If i run the look up as a join in query analyser all 10000 rows pass the look up condition

I bet it's the collation. SQL Server performs case-insensitive comparisons by default. The lookup is case-sensitive. The would enable SQL Server to make matches that the lookup can't. You can try converting everything to uppercase.