Friday, March 9, 2012

Question about Clustered Indexes

Using SQL Server 2000 ... hopefully not too dumb a question.

Is there a performance hit using Clustered Index on a table that gets a lot of deletes?

I'm creating a Transaction Log table that will get about 4,000 inserts per day. The value of some of this historical data is worthless after a while, so I delete it.

It occurs to me that this may create a lot of fragmentation. If so, is this cleaned up during weekly "Reorganize data and index pages" in the Maintenance Plan? Do I also need to select "Remove unused space from database files"?

Additional question: I though that care needed to be taken that a clustered key be a value that always increments (datestamp, identity key, etc), yet in this write-up, it shows using randomly generated key values. I'm confused. Wouldn't it have to reorganize everything with greater values to insert the new row into the appropriate spot?
http://www.sql-server-performance.com/gv_clustered_indexes.aspI remember reading that in that article too and it didn't make sense to me either. Using a guid for a pk will certainly cause a lot of fragmentation if your fill factor on the clustered index is high, because page splits will occur often. Note that in 2005, you can use NEWSEQUENTIALID() for a guid pk to avoid this problem. in 2000 you an use Gert Draper's XPGUID.dll for the same feature.

If you set the fill factor on the index high enough, then page splits should occur only rarely if at all since you are deleting data from the table on a regular basis.

I wouldn't use the "remove unused space" task because allocating extents is a rather expensive operation for the server - generally what I do is allocate enough space to your db files to fit the largest amount of data you expect the db to have so that you aren't allocating extents at runtime. if you shrink the files, it seems likely that the server will just have to grow them again (assuming you have autogrowth turned on).|||If you set the fill factor on the index high enough, then page splits should occur only rarely if at all since you are deleting data from the table on a regular basis.Shouldn't that be low enough?


I wouldn't use the "remove unused space" task
+1

Monotonically increasing clustered keys (as you mentioned) are most useful for transactional tables with a high ratio\ number of inserts. With only 4000 a day and an adequte fill factor you should be ok. A reorganise will remove fragmentation. If you are worried, run it nightly (if you have the window) - there's no rule that says you can only reindex on a weekly basis.

Also, deletions will not cause fragmentation but they will reduce the amount of data per page - depending on the question you ask this may be considered a good or bad thing.

HTH|||Jezamine - what on earth is your local time?|||Vich: Clustered indexes on random values are really a bad idea, the indexes got to be fragmented. The only really good candidates to clustered indexes are (semi-)sequential data. Large amount of data with relatively few updates in combination with a suitable fillfactor may also do, but not random data.

With the amount of 4000 inserts a day and a limited lifetime of the data I doubt that you actually need a clustered index, so my advice would be to delete the clustered index and create a nonclustered instead.|||Shouldn't that be low enough?

um, yes, thanks. :o

Jezamine - what on earth is your local time?

i'm in the seattle area, why?

EDIT: here's some justification for why you shouldn't use that "remove unused space" task, which is essentially the same as having autoshrink on, from someone who actually knows what they are talking about - unlike me! ;)

http://blogs.msdn.com/sqlserverstorageengine/archive/2007/03/28/turn-auto-shrink-off.aspx|||i'm in the seattle area, why?Because you were active as I got into work 8:00 am bst. Late one?|||Because you were active as I got into work 8:00 am bst. Late one?

bst?? British Summer Time?

Regards,

hmscott|||yea, i'm up late sometimes I guess. it's an addiction. help me!|||bst?? British Summer Time?

Regards,

hmscottYa - our clocks have gone... um... forward.|||It's not that. I just saw BST and I kinda focused in on the first two letters.

Sorry, not much humor there, but it just kinda caught me...

All right folks, nothing to see here, move along...:o

Regards,

hmscott|||Shouldn't that be low enough?

+1

Monotonically increasing clustered keys (as you mentioned) are most useful for transactional tables with a high ratio\ number of inserts.
I would think that clustered indexes:

1. Saves on index maintenance time/space.

2. Decreases seek time. Last week I changed an index to Clustered and an associated update query I was trying to tune up changed from 100 seconds down to 20 seconds (to process about 15,000 rows). The table in question does not get a lot of inserts or deletes.

I can see how it would improve transactional table inserts for reason #1, but I would think that it would potentially vastly improve rarely manipulated but often linked to master tables (like a SKU list) for reason #2.

With only 4000 a day and an adequte fill factor you should be ok. A reorganise will remove fragmentation. If you are worried, run it nightly (if you have the window) - there's no rule that says you can only reindex on a weekly basis.
Thanks for that info - so any damage to seek performance gets cleaned up by Maintenance, therefore no big deal. My weekly should be fine - I'll run the delete process once a week and delete maybe 10%, and it'll run just before weekly maintenance (that currently only takes 10 minutes for the entire database).

Also, deletions will not cause fragmentation but they will reduce the amount of data per page - depending on the question you ask this may be considered a good or bad thing.
HTHI'm afraid I don't understand this. This reveals that I lack a fundamental understanding of how Clustered Tables are actually implemented. I thought they were just sequential by the key value and therefore a delete would cause a fragment and of course any block read may contain logically deleted records. A reorganize would mean the segments are rewritten to remove the fragments. Writing a row that doesn't fit neatly at the top of an existing segment means it has to be inserted by rewriting everything on top of the insert position to create a space. Is my "understanding" wrong or too oversimplified to be useful?

Vich: Clustered indexes on random values are really a bad idea, the indexes got to be fragmented. The only really good candidates to clustered indexes are (semi-)sequential data. Large amount of data with relatively few updates in combination with a suitable fillfactor may also do, but not random data.

Why "relatively few updates"? Does this have to do with how later updating of null columns are implemented in clustered tables. That's a subject that's always been an entire mystery to me (even for non-clustered tables).

I try to think about what happens internally. I insert a row but leave half the columns as "NULL". Isn't the rule in Relational Databases that null columns don't take space?

Some time later, I come along and create values for some NULL columns. Now; for a clustered table, where does it put the values?

Even a clustered row can't be fully pre-extended (given how big a nvarchar(8000) could be). Does such an update later cause reorganization?

With the amount of 4000 inserts a day and a limited lifetime of the data I doubt that you actually need a clustered index, so my advice would be to delete the clustered index and create a nonclustered instead.
Thanks for that brilliant thought. I just realized that I will NEVER use the unique primary key in this transaction file. Most often, I'll be using it to inquire on all transactions for a particular SKU in reverse chronological sequence back to a particular date.

I'd still like to understand more about Clustered Indexes, but clearly I shouldn't make this index Clustered. Perhaps I'll benifit by making the "Transaction Date" index clustered since I'll often want to do a reverse-sequential-read by Transaction Date and they're always inserted in sequence - also; once created they are never changed.|||I would think that clustered indexes:

1. Saves on index maintenance time/space.2. Decreases seek time. Last week I changed an index to Clustered and an associated update query I was trying to tune up changed from 100 seconds down to 20 seconds (to process about 15,000 rows). The table in question does not get a lot of inserts or deletes.
I can see how it would improve transactional table inserts for reason #1, but I would think that it would potentially vastly improve rarely manipulated but often linked to master tables (like a SKU list) for reason #2.My point was about monotonically increasing clustered indexes, not clustered indexes in general.


I'm afraid I don't understand this. This reveals that I lack a fundamental understanding of how Clustered Tables are actually implemented. I thought they were just sequential by the key value and therefore a delete would cause a fragment and of course any block read may contain logically deleted records. A reorganize would mean the segments are rewritten to remove the fragments. Writing a row that doesn't fit neatly at the top of an existing segment means it has to be inserted by rewriting everything on top of the insert position to create a space. Is my "understanding" wrong or too oversimplified to be useful?
Imagine ten rows with one integer column with values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Clustered index on the integer column.
Rows 1, 2, 3 are in page one. Rows 4, 5, 6, 7 are in page two. Rows 8, 9, 10 are in page three. I delete rows 4, 5, 9. So now:
Page one: 1, 2, 3
Page two: 6, 7
Page three: 8, 10

No fragmentation - the data is still in consecutive pages, order by the values in the key. The pages are just a little less full, which is not the same thing. Performance is still affected but this is not fragmentation. If they were full to begin with and I inserted 5.5 (ok, ok - they were decimals :D) then we will introduce fragmentation because half the rows in page two would have to be moved to page four. Now the page order would be one, two, four, three.

Quickly on some of your other points not addressed at me:
I think roac meant few updates of the clustered index keys.
A nullable column has an extra bit "flag" to indicate when it is null. So no - nulls take up no extra space but nullable columns do take up more space to accommodate the flag.

Have you seen this?
http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/ss2kidbp.mspx|||I would think that clustered indexes.....
.........Actually what were addressing there lol? Just realised there was a bit in the quote about fillfactors and shrinks too.|||EDIT: here's some justification for why you shouldn't use that "remove unused space" task, which is essentially the same as having autoshrink on, from someone who actually knows what they are talking about - unlike me! ;)

http://blogs.msdn.com/sqlserverstorageengine/archive/2007/03/28/turn-auto-shrink-off.aspxIn case you are interested vich Paul Randall is a bit of a guru's guru - he wrote the dbcc dbreindex and a few other console commands and assisted with the paper on fragmentation I linked (though he isn't credited lol). If I see anything by that guy I make sure I read it twice :)|||I think roac meant few updates of the clustered index keys.
Yes I did. And variable-length columns. You may get a page split from updating variable length columns as well, as these are normally stored in-page. So, if you update a varchar(40) from a value 'Test' to 'Test value', that may eventually end up in a page split, since there is not enough room in the page to store 'Test value'. An even better example would of course be to update a variable length columns from NULL to a value.|||Yes I did. And variable-length columns. You may get a page split from updating variable length columns as well, as these are normally stored in-page. So, if you update a varchar(40) from a value 'Test' to 'Test value', that may eventually end up in a page split, since there is not enough room in the page to store 'Test value'. An even better example would of course be to update a variable length columns from NULL to a value.
I'll extend that to mean that even nullable non-variable length columns may cause a page split if populated after more rows are inserted (via an UPDATE statement).

Page split is a new concept for me. In truth, a "page" is a new concept. Is that an SQL managed thing, or is that part of the file system? I guess I need to go read some intro to SQL's physical layer.

It follows that a database is capable of inserting or reordering a page into the logical sequence - by definition deviating from the physical (OS File System) contiguous logical sequence.

I'm making a lot of uninformed leaps of logic here. Anyone got a link for the physical layer ... or at least the layer right above the OS file system? One that focuses on paging would be great.

EDIT: Scratch that ... finally looking at that article referenced above. It seems to cover it. :S Thanks pootle flump.|||I use GUIDs in my databases all the time, often as primary keys, and I have never seen any appreciable performance issues.|||I use GUIDs in my databases all the time, often as primary keys, and I have never seen any appreciable performance issues.Clustered pks? Do you compensate with low fill factors? Batch systems or OLTP? The old GUID or the new sequential one?

I will admit I have never benchmarked a page split cost. Has anyone?|||I'm making a lot of uninformed leaps of logic here. Anyone got a link for the physical layer ... or at least the layer right above the OS file system? One that focuses on paging would be great.

I just HAVE TO recommend Kalen Delaney's Inside SQL Server 2005 - The Storage Engine (http://www.microsoft.com/mspress/books/7436.aspx).|||I just HAVE TO recommend Kalen Delaney's Inside SQL Server 2005 - The Storage Engine (http://www.microsoft.com/mspress/books/7436.aspx).Lol - I was going to do the same until he mentioned he was on top of it. I wonder, though, if she assumes a little knowledge on the part of the reader. I am not sure if it is a good "first pass" on physical storage for someone new to the concept of, for example, data pages. I could be wrong of course. All I know is there are a fair few bits I will need to revisit for it to truly sink in.|||Lol - I was going to do the same until he mentioned he was on top of it. I wonder, though, if she assumes a little knowledge on the part of the reader. I am not sure if it is a good "first pass" on physical storage for someone new to the concept of, for example, data pages. I could be wrong of course. All I know is there are a fair few bits I will need to revisit for it to truly sink in.
Thanks for the recommendation. Amazon has one used for $26. I'll also get T-SQL Querying (http://www.sql.co.il/books/insidetsql2005/), another in the Inside Microsoft SQL Server 2005 series that looks more up my alley (I'm not a DBA ... just a curious programmer who gets saddled with DBA type tasks).

That's a bit more than I was hoping simply to answer this one topic of "the ramifications of clustered indexing" - I guess I can't hope for a single sitting explanation. The linked article is helping to answer a lot however.

If someone has another link to shed another angle on it, I'd appreciate that. However; I think these books will have a chapter or two that will suffice (if I can understand them that is).|||I was honored with a private EMAIL conversation with I Itzik Ben-Gan on this subject. He referenced an interview he did with David Cambell. He noted that he asked a pointed question on Clustered Indexed tables.

Interview here:
http://www.sqlmag.com/article/articleid/96048/96048.html

One quote in particular addresses the concern I was having: ...
Back in the days when you were deeply involved in the shaping of the storage
engine of SQL Server 7.0 you used to participate in some private SQL
newsgroups. Your explanations were so detailed and clear and were considered
pure gold by many of us. For example, I remember how you solved a major
mystery that involved querying a small heap that reported a huge number of
logical reads, after the table underwent an update statement expanding varchar
strings. You explained that the expansion of the varchar strings that didn’t have
room to expand in their hosting pages caused a large number of forwarding
pointers, and that SQL Server had to jump back and forth between the page
holding the pointer and the page holding the pointed record. Suddenly it all
seemed to make sense. For us teachers and students, someone with both a lot
of knowledge and great explanatory skills is a rare sight, and the SQL Server
community can benefit from having this knowledge conveyed through books.
Have you considered/are considering writing a book and passing on what you
know through such means?

I received Itzik's book last night. No time to read his recommendation yet (Chapter 3, Query Tuning - a fairly long chapter).

The question I quote above implies that page splits are done with a pointer to the actual record that must have gotten physically added elsewhere.

I should note that this article references some advice to NOT use incrementing keys. That so thoroughly confused me that I asked him for clarification. He explained that it's for a bigger shop that splits tables across multiple physical drives, and this is a method for taking advantage of simultaneous writing because it's not all getting heaped onto the top.

I think I'm displaying my ignorance here, but that's OK. Just though I'd mention that here in case somebody else found David's recommendations confusing.|||I should note that this article references some advice to NOT use incrementing keys. That so thoroughly confused me that I asked him for clarification. He explained that it's for a bigger shop that splits tables across multiple physical drives, and this is a method for taking advantage of simultaneous writing because it's not all getting heaped onto the top.Isn't that about hot spots and isn't that out of date since SQL 2000? It was an issue in SQL 7.0 but not now?

I only know from my own experience that I recently worked with an inherited medium sized database (~40GB) with about 10 or so very narrow tables (typically 4-5 fields). Not terribly large db but pretty deep tables of several hundred million rows each. Monthly loads in a batch of around 30 million records per batch per table. Changing the clustered keys from "random" to incrementing bought something like a 90% drop in load time.

Good job on getting some first hand knowledge from Itzik Ben-Gan - I haven't read his books yet but some of his articles. Obviously he is very well respected.|||Isn't that about hot spots and isn't that out of date since SQL 2000? It was an issue in SQL 7.0 but not now?

I only know from my own experience that I recently worked with an inherited medium sized database (~40GB) with about 10 or so very narrow tables (typically 4-5 fields). Not terribly large db but pretty deep tables of several hundred million rows each. Monthly loads in a batch of around 30 million records per batch per table. Changing the clustered keys from "random" to incrementing bought something like a 90% drop in load time.

Good job on getting some first hand knowledge from Itzik Ben-Gan - I haven't read his books yet but some of his articles. Obviously he is very well respected.Thanks, he didn't elaborate much - just a couple of sentences to clarify and said that for a single-drive DB ordered keys IS better (darn, my EMAIL is deleted). I can only imagine it would have to be done in a particular way or (even if spread across physical drives) it would go page-split crazy inserting into the center of that section of a heap.

BTW: May I assume that in SQL Server that "Heap" equates to "Cluster Indexed"?

As for Version 7 vs. 2000 (and even 2005) the article mentions that most of the "SQL Engine" is still from version 7, and the subsequent versions have just added Enterprise capability (to expand it beyond just being a DB - mining, reporting, etc). I imagine your question was rhetorical (or directed at some expert) since I'm a comparative rookie.

Surely some aspects of performance have been tuned up (or rewritten) and this could be one of them, do you know if that's the case? Can you tell me if Page Split is done with pointers from the calculated location? Sort of an extension of the calculated (supposed key) location - so the split is physically located close-by or far-away (physically).

If page splitting was and still is handled that way, then the problem described would still be an issue.|||BTW: May I assume that in SQL Server that "Heap" equates to "Cluster Indexed"?No - you may not :) It is EXACTLY the opposite. The definition of a Heap is a table without a clustered index.

I imagine your question was rhetorical (or directed at some expert) since I'm a comparative rookie.Yeah - wondering aloud. Happy for clarification.|||Yeah - wondering aloud. Happy for clarification.
LOL. I have no problem keeping the information flowing mainly in the direction of you guys out. I'm learning to mainly chime in when I have a question or have stumbled across something interesting like Mr. Ben-Gan's interview with Mr. Cambell (aka Mr. SQL Server).

This was the reference to heap and indexing.

3. Watch out for insert rates on clustered indexes with monotonically
increasing keys. (You can distribute inserts sometimes by doing some key
munging tricks)

4. For logging tables with no indexes and high insert rates go with a heap
as we can distribute the inserts across a number of pages.
On #3, I'm guessing that "distribute inserts" means distributing it across multiple drives.

Drrrr, heap! Thanks for the clarification. You can put the brick away now (it was required to get through my thick skull - Hey, I've got lots of stupid questions where that one came from! LOL :p).|||Heh heh - I was tootling around on SQL team and ended back at a classic thread:
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=53085&whichpage=2&SearchTerms=guid%2Cindex

Yup - good old Paul Randall again.

You will get an exam on Merry-Go-Round Scans and their implications shortly ;)

No comments:

Post a Comment