Sunday, February 19, 2012

indexed views

Hello,
I'm currently performance tuning a table with 100 million rows in it
(about 18 GB of data) and would like to know if -

1. Is the table too large to be performance tuned. Is it better to just
redesign the schema ?
2. Can using techniques as indexed views really help me with tuning such a table.
3. How long would it take to create a clustered, non clustered index on
a varchar column (for instance) on a table with 100 million rows ?
(i know this is a function of hardware as well - let's assume i'm using
afairly maxed out DL 360 - i.e. dual processor with 4 GB of memory)

Thank you very much
Alan1. Nope, that's a good size for tuning.
2. Yes.
3. It usually takes me about 20 seconds to type a CREATE INDEX statement.

-PatP|||My question wasn't about your typing skills.
It's about how long it takes to create a clustered / non-clustered
index on a varchar column of a table with 100 million rows.
In this case I'm not joining other tables. (the index view question)
I'm only selecting from this one table with 100 million rows.|||For clustered index: it really depends on how fragmented the data is, what other indexes are on the same table. In my experience, 150M records takes less than half an hour for a regular index.|||Thank you Kaiowas. That's good to know.
How long does it typically take to perform a query on your indexed,
150 million row table ?

In this case, the fragmentation is low. The table serves as a user activity log.
This is currently a 20 GB table with 100 million rows and I'd like to
decide between re-modelling the database design (normalize the table into
a few more tables) or focusing on tuning the table.|||Sorry, I answered your question as asked instead of what you meant. It has been one of those weeks.

Building indexes isn't really CPU bound, and is rarely RAM bound (although for your table the RAM could be a constraint). Disk performance is usually the key component for building indexes. The DL 360 can be configured a lot of different ways, and disk performance is wildly variable depending on how the system is configured.

Normalizing the table is usually a good idea. When dealing with 20 Gb of data, normalization is really important. If you normalize the data, you might be able to turn one table with 20 Gb of data into six tables with 8 Gb (total) of data. That will almost always improve your performance all by itself, just because the machine needs to schlepp a lot less data from point A to point B in order to resolve any query.

-PatP|||Just out of curiosity, how wide is this 20GB table? For me, that would be one of the places to look. Since creating an index on this table could take up to half an hour, this is something that you will not be able to do by trial and error. You are certain there is table scanning going on? And more importantly, that the queries you are tuning can use indexes?|||Hi MCrowley,
What do you mean by wide ? I'm sorry I didn't quite understand.
By wide do you mean - The size of the largest row in the table ?

Thank you for your response|||Also MCrowley,
The queries on this log table are pretty straight-forward and simple.
The kind of vanilla queries you'd typically run on a log table that
tracks online utilization.|||Sorry. I meant how many columns.

As for the queries, if they are of the form:

select *
from [really hugenormous table]
where logmessage like '%error%'

then no amount of indexing is going to help. Kinda like looking for all people in the phone book who are named '%son'.|||MCrowley,
The queries are not wildcarded. There are 15 columns in the table.
Queries are vanilla, of the form -

select ...
from ...
where col1 = val1
and col2 = val2
...

no like clauses etc.|||As PatP points out, normalization will give you significant performance benefits. I'd say it would trump disk speed and will improve performance for caching and index seeking and searching. Then, index those tables to benefit your application.|||"Normalizing the table is usually a good idea. When dealing with 20 Gb of data, normalization is really important. If you normalize the data, you might be able to turn one table with 20 Gb of data into six tables with 8 Gb (total) of data. That will almost always improve your performance all by itself, just because the machine needs to schlepp a lot less data from point A to point B in order to resolve any query."

Ok, I'm gonna stick my neck out on this one and disagree with Pat.

Normalizing data does NOT necessarily result in an increase in performance, because many of your queries will now require joins which would not be required in a normalized schema. The primary purpose of denormalizing data is increase the performance of output queries. Witness the common Star Schema used (all too frequently) in Data Warehouses.

What you lose by denormalizing data is control over relational integrity. When you have hierarchical relationships between datasets greater than two levels, and then try to represent this in a flat table, enforcing business rules for input becomes cumbersome, and you will find yourself doing a lot of coding to make up for it. That is why Star Schemas are best used for applications that focus on reporting, and which recieve their data from external sources which can be trusted to verify the integrity of the data they submit.

That said, I never use Star Schemas (the idiot's "One Size Fits All" solution to data warehousing), and the databases I design are almost always fully normalized.

Pat?|||If ROLAP is an option, then MOLAP is probably the right answer. The extra cost/effort is well worth the additional benefit.

Need to understand how current the data need to be.|||As a general rule, I find that normalized databases will outperform non-normalized databases almost every time.

If you think about what work goes on at a primative level, shlepping about 8 Gb of data which you join at the last minute to produce 20 Gb of results is going to be faster than hauling the entire 20 Gb from start to finish. The overhead needed to support joins when indicies are available is trivial, well less than 1% of total query cost for each joined table. When you compare the cost of the join to the cost of the raw I/O, the normalized design will run roughshod over the non-normalized design.

The reason that data warehouses store non-normalized data doesn't really have anything to do with machine performance... The intent of a data warehouse is to make data available to the end user. End users don't do joins naturally, so a star schema suits them well. When you look at the actual performance issues, a normalized design usually outperforms the warehouse design from the machine's perspective.

In a warehouse situation, people time is a lot more expensive than machine time... If the subject matter experts using a data warehouse have to spend an extra minute composing a query, that delay can cost more than the query can save. If the delay impacts a decision (for example in a board meeting), it will definitely cost the business money.

Disk is usually the weakest link in a database chain between a web/application server and the data it needs. Reducing the total amount of disk I/O is the highest payback optimization that I've found. If you can normalize data, it almost always reduces the amount of disk I/O needed, and that speeds up the whole system.

-PatP|||"The reason that data warehouses store non-normalized data doesn't really have anything to do with machine performance... The intent of a data warehouse is to make data available to the end user. End users don't do joins naturally, so a star schema suits them well."

So why don't the popular Data Warehouse "Gurus" (Kimball, for instance) advocate storing data in normalized form and providing users with a view that denormalizes it?|||That's actually an excellent question. I don't have any answer, although I can hazard a guess.

Most data warehouses are implemented using database engines that have been around for a long time (DB2, Oracle, and Sybase), because those systems tend to have lots more data that helps justify the idea of the warehouse. Particularly on the older versions of these products (especially the Z Series versions), the engine still implements joins in ways that are expensive, while the channel mode architecture reduces the cost of hauling around redundant data. That tends to reduce the "machine cost" of the denormalized data, which in turn enhances the reduction in "human cost" (for writing queries) of denormalized data.

If you take a look at some of the advanced presentations of groups like TDMI, they are heading in the direction of normalized structures that either feed directly into cubes or into views for a more traditional warehouse. I personally like the idea of the quasi-normalized atomic area feeding directly into the cube... This gets the benefits of reduced storage (making the warehouse cheaper to build), as well as making the rebuilding of the cubes much faster.

The cube itself is a much more "user friendly" structure than the tradional star. The user still needs to grasp the ideas of the cube dimensions, but the concept of manipulating and managing a cube is so much easier to grasp than the corresponding concepts within a star that even the subject matter experts will quickly make the jump on their own once they see real world examples of both structures. Once you introduce automated cube grooming (to predictively manage the cube rollups), and show them that they can simply manage security, they'll never willingly go back to a star.

-PatP|||My view is much more cynical than yours. My impression is that the Star Schema is being pushed because it is much easier to market a "one-size-fits-all" solution that doesn't require the client to learn even basic database design principles. I think the fundamental ideas of Data Warehousing as espoused by Inmon get lost along the way (flexibility, iterative development, etc...). The Star Schema is simply a much easier "sell" than a true data warehouse, and these data warehousing organizations are more interested in selling contracts than selling solutions. I really think they are like modern Snake Oil salesmen.

I think that there is a lot of confusion over the terms used in data warehousing, and this leads to conflicting expectations within the organization regarding the scope of the project. I am sure this is what leads to the failure of many "data warehousing" projects.

I advise people that a DATA WAREHOUSE synthesizes information from across the Enterprise, both geographically and by business area. If it doesn't contain data from more than one business discipline (HR, Finance, Production, Sales...) then it isn't a data warehouse. The complexity of relating these business areas necessitates a normalized schema.

A DATA MART comprises information targeted to a specific user group, either geographic or department focused. These should be spun off of the data warehouse, and may take the form of a Star or Cube.

A REPORTING DATABASE pulls data from a single source application a redefines it for reporting. Too many people think they are building data warehouses when they are actually only building reporting databases.

My impression is that Inmon espouses true data warehousing, while Kimball (more popular these days) is selling data marts and calling them data warehouses.

You seem to differentiate greatly between Cubes and Star Schemas. What do you see as the fundamental difference? To me, a Cube is basically a highly tuned pivot table, which at its heart is a flatfile and thus relationally equivalent to a Star.|||You seem to differentiate greatly between Cubes and Star Schemas. What do you see as the fundamental difference? To me, a Cube is basically a highly tuned pivot table, which at its heart is a flatfile and thus relationally equivalent to a Star.

To me, a star-schema is exactly that - a schema that represents the logical view of the data. A cube is the physical implementation of that schema, using a storage mechanism (designed for queries against that schema, especially aggregates) that is fundamentally different from what an RDBMS uses.|||A star schema is still a schema. The user needs to think in terms of tables, rows, etc. This is fine for the hard core data anyalyst, and is actually more powerful/flexible/efficient than a cube for them.

A cube is an N-dimensional structure that represents the gesault of the business data in the warehouse. While the cube itself might not be possible to represent in a 3D model, the principles of the cube can be illustrated that way and they can then be logically extrapolated to however many dimensions that the cube needs without breaking a mental sweat.

At least in my experience, a subject matter expert can easily understand a cube. They quickly "get" the idea that you can construct queries that cut through the volume of the cube in all kinds of interesting ways. At first, they often limit themselves to two or three dimensions, but that quickly gives way to wildly complex cube expressions that would be monsters to create in a star schema, and often far worse in a normalized schema.

I don't have any religious attachment to one technology over another. There are some that I'm more comfortable with than others, but I'm in favor of whatever works. In most cases, I value human time/energy over machine time/cost, although sometimes you hit limits that force you to re-evaluate those values. At least so far, I've found cubes to work better than star schemas in all but two cases (and the jury is still out on one of them).

-PatP|||Inmon is correct in theory, Kimball's approach is more easily implemented. Ever try to centralize a very large organiziation intent on decentralization? Forget the battle, win the wars...|||That's the problem. Kimball's approach is becoming popular because it is easy, not because it is good. The result is a lot of data marts masquerading as data warehouses, giving the whole concept of data warehousing a bad name.

It's a cheap-ass sales tactic to me. Used cars and snake oil.|||It may be that the views idea is a bit discouraged, because you can never really predict what the end-user wants to see. This would mean you have to join pretty much all of the tables, even if all the end-user really wanted data from a few tables. I have seen this problem in a few of the applications here. I see a poorly performing query, dig into it, and see a view joining a bunch of tables with data the query does not even use. When I ask the developer about it, they always say "Well, the view had the data I needed."

Whoopsie! Slid into a rant. Sorry.

No comments:

Post a Comment