Wednesday, March 21, 2012

Indexing Question

I am somewhat new to SQL and have an indexing question

I have a table that consists of 60 columns of mostly numeric data.Included in the table are fields called Customerid, CustDate and CustTime.Every hour data is posted to the table with the current time and date as well as customerID.The table is about 6 million rows.

I have a query that reads:

SELECT CustomerID, custdate + custtime as date

FROM Mytable

WHERE CustID = 905 and (date between ‘1/1/7’ and ‘1/17/7).

The query works but it takes longer than I would like so I am wondering if it’s an indexing issue.

Currently my table is indexed as CustomerID + CustDate + CustTime.None of the fields are unique, but the combination of the 3 is unique.

Would my query run faster if I add the index to CustDate + CustTime + CustomerID?

Should I have both indices in the table?

How do I specify which index the query should use?

If I ad the new index, do I need to change my where statement to be:

WHERE date between ‘1/1/7’ and ‘1/17/7 and CustID = 905

Thoughts?

Hi,

Did you create a normal index or a clustered index?

How many rows do you have per customer (average). If the number of rows per customers is not to big, I would recommend to create a clustered index on CustomerID.

Greetz,

Geert

Geert Verhoeven
Consultant @. Ausy Belgium

My Personal Blog

|||

Hi

Normally, it's best to use the most selective where-contition. if the customer_id is more selective, it's better to use it in the first WHERE-Condition, as you do.

note that you cannot refer to the date-column in the where-clause, because you give it's name in the select clause. You have to use (custdate between ......). Or maybe it's better to use a statement like

where custdate >= '1/1/7' and custdate <= '1/17/7'

Maybe you can use the database engine tuning advisor utility to get a hint from sql server itself.

Probably your statistics are not up to date?

|||

Thanks for the reply. Currently the indices are nonClustered.

Normally we receive data from customerID's on an hourly basis, and we have about 75,000 customerID's as far as using statistics, the DB is too new at this point

How do I create as clustrered, what will be the impact, and should I cluster on Date or CustomerID?

|||

Actually my where clause is as you identify, I just typeed it wrong on the post.

As I read the responses, I think the Clustering might be the best approach. That being the case. Do I want to create a clustered index on Customer ID and on Date, or just on CustomerID?

|||

Jim:

Something here to me doesn't add up. It looks to me that you have described a cover index fo the query that you stated. I would think that this is optimal or nearly so. With 6000000 rows and 75000 customer ID this query figures to return on average about 80 rows per customer ID. This is a very low cardinality for a cover index. I would think that this would normally come back in 0 ms. To me it looks like something has been left out of the description. What is missing here? What are the datatypes of your customerID and date/time fields?


Dave

|||ID is int, and dates are SmDate|||

It is less of an indexing issue than an issue of using the wrong datatypes.

It appears that CustDate and CustTime are character fields. Your query would be faster if you changed CustDate and CustTime to be a single datetime field.

And you are using 'non-standard date notations, e.g., '1/1/7'. I think that the WHERE clause: [ date between ‘1/1/7’ and ‘1/17/7’ ] would be a very inefficient search. This seems extremely odd.

I can't imagine that this will work properly:

SELECT CustomerID, custdate + custtime as date

FROM Mytable

WHERE CustID = 905 and (date between ‘1/1/7’ and ‘1/17/7).

The WHERE clause is referring to [date], which appears to be a composed field in the SELECT statement.

|||

>>Normally, it's best to use the most selective where-contition. if the customer_id is more selective, it's better to use it in the first WHERE-Condition, as you do.<<

This is unlikely to make any difference. SQL Server uses a cost based optimizer that is smart enough to determine which column is the most selective. It takes a really large query to get to the point where SQL Server doesn't make good choices based on order of columns in the where.

Then index on the other hand should be ordered from most selective to least selective.

>>It appears that CustDate and CustTime are character fields. Your query would be faster if you changed CustDate and CustTime to be a single datetime field. <<

If Arnie is right about this, then this is definitely something to fix. Range checks on dates that are stored in non-fixed standard formats would be at worst impossibly incorrect, and at best not nearly as performant as if they were date columns

Either way, check the plan of the query and make sure it is using the index. It probably not using an index, or is using a primary key if you have a clustered one on a surrogate (idenitity key). Consider posting the actual table structure, if possible with keys, constraints, etc.

|||

Thanks for the commentary and info. The CustDate and CustTime are both smalldatetime. I am executing the query from ASP, for purposes of simplifying my post I have minimized the actual select statement -

but here is the actual statement:

SELECT SiteControlData.SITEID, SiteControlData.TS1MIN, SiteControlData.TS1MAX, SiteControlData.TS1AVG, Site.TS1, SiteControlData.VIEWTS1,SiteControlData.R1PROGRAM, SiteControlData.R1PROGRAM AS ModifierSolStorBoi, SiteControlData.R1PROGRAM * SiteControlData.RSOFPROGTEMP AS ModifierRet, ((SiteControlData.R1PROGRAM - 100) * SiteControlData.DSMULT)+100 AS ModifierDel,SiteControlData.R2PROGRAM, SiteControlData.M1, SiteControlData.M2, SiteControlData.M3, SiteControlData.M4, SiteControlData.CALC_DATE, SiteControlData.DSMAXTEMP, SiteControlData.DSMINTEMP, SiteControlData.DSMULT, SiteControlData.DSDEGPROG, SiteControlData.RSMINTEMP, SiteControlData.RSOFPROGTEMP, SiteControlData.BHMAXTEMP, SiteControlData.BHMINTEMP, SiteControlData.SOLARMAXTEMP, SiteControlData.CALC_DATE + SiteControlData.CALC_TIME AS DATE, ISNULL(Site.SETSENSORR1,'0') as SETSENSORR1, ISNULL(Site.SETSENSORR2,'0') as SETSENSORR2, ISNULL(Site.SETSENSORR3,'0') as SETSENSORR3, ISNULL(Site.SETSENSORR4,'0') as SETSENSORR4 FROM SiteControlData INNER JOIN Site ON SiteControlData.SITEID = Site.SITEID WHERE (SiteControlData.SITEID = @.SITEID) AND (SiteControlData.CALC_DATE BETWEEN @.txtStartDate AND @.txtEndDate) ORDER BY DATE DESC

|||

Jim:

I just mocked this up with 75000 SITE table entries and 6 million SiteControlData entries. The siteIDs were initially sprayed out evenly in the siteControlData table. I then added an additional 32767 rows for siteID 950 between 1/1/7 and 1/31/7 to badly skew the data for this particular siteID. I made sure that I had primary keys on both tables that were not clustered. In addition, I added an index to the SiteControlData table based on (1) siteID, (2) calc_date and (3) calc_time. My record size for the siteControlData table is about 1200 bytes per record. This requires a LOT of space!

Finally, I punched up a test select based on (1) siteID = 950, (2) date range 1/7/7 through 1/17/7. This query returns 17969 rows with results in 440 ms. This query consumed 18057 logical reads on the siteControlData table and 3 logical reads on the site table. The query plan is based on index seeks with bookmark lookups for both tables. Under these conditions in my view it is performing quite well.

I then killed the siteControlData and rebuilt the table with a clustered index based on (1) siteId, (2) calc_date, (3) calc_time and (4) an identity field. I knew that this would improve the performance by eliminating the bookmark lookups on the siteControlData. I then re-tested. The same 17969 rows are returned in 326 ms. The logical reads on the siteControlData table was reduced from 18057 to 3805. HOWEVER! This might by itself might not be sufficient motivation to build this as clustered index.

My real question would ask if overall performance of this table would be improved by adding this clustered index. In this case it might be, but there is no way that I can make that determination because I don't know the real picture of how this table is used. Since this clustered index might be based primarily on SiteID it might be of benefit, but I really don't want to more-or-less blindly add a clustered index to this table.

Now, all of the above is to some extent a bunch of baloney without getting a look at the query plan your are getting for your query and also without getting a look at your table definition -- basically, Louis Davidson's suggestion. Which version of SQL Server are you running? And do you have access to the database with either Query Analyzer (for SQL 2000) or Server Management Studio (for SQL 2005)?


Dave

|||

Dave -

I can't thank you enough for taking the time to look at my situation. I have participated in many forums but the responses I have received here have been very helpful!

Here's some more of the scenario that might help. I have migrated my application from a DBF environment to SQL Server 2005 w/Server Management Studio.

More about my application: Each site records data every 1/2 hour. That data is sent to a server where it resides in CSV format until uploaded into SQL The CSV uploads into SQL take place every 2-4 hours as scheduled via a C# application. There is little consistency on what is available for import into SQL. We just upload what is there. The data is always queried the same way... find site XYZ for the date range identified and return a dataset. In a typical ASP page we return 7-8 different datasets from the same siteID and date range and than all of the datasets are charted using softwareFX charting software

Thanks again for the insight and help

No comments:

Post a Comment