Indexes and Deadlocks: Information of Sequence clustering

Wednesday, March 28, 2012

Information of Sequence clustering

I’m a college student currently studying 10th semester in the Universidad de los Andes, Colombia and I’m working on a data mining project. I need to use the cluster sequence approach; therefore I need to completely understand how it works. In order to understand it, I need to know which inputs it uses, how the algorithm works and which type of outputs does the approach throw. Do you have any idea where I can find this type of information? and examples?

Any help would be appreciated.

Thank-you for your time.

The typical use of the sequence clustering algoithm is to cluster clickstream data i.e. group the navigation patterns of website users.

You can find the algorithm details in this Microsoft Research paper that the implementation is based on: ftp://ftp.research.microsoft.com/pub/tr/tr-2000-18.pdf.

The book "Data Mining with SQL Server 2005" also has a chapter on the sequence clustering algorithm and examples of its usage.

|||

Thanks for the information, it was useful.

I have another two questions about this algorithm.

1) in witch problems the algorithm could be used, the book that you recommended me, use the example of the clicks of a web page, but in witch others environments could be used the sequence clustering?

2) In the sequences of the model it’s possible that the states could be a set of elements, for example, if I want to use the algorithm in a medical set of data, one state of the sequence could be two or more diagnostics.

|||

The algorithm can be used whereever there is a discrete sequence of events. In general, accuracy of the algorithm degrades if there are too many possible sequence states - over 70 or so.

The algorithm considers any descrete state as distinct and will not provide a "this or that" state. However it could end up with a result such as

A->C 40%

A->B 40%

A->D 20%

C->D 100%

B->D 100%

This result says that there are three paths to D, A->D, A->C->D, and A->B->D with varying probabilities.

If you want to explicitly equate two diagnositics, e.g. B and C are functionally identical, you should prepare your data that way and combine the states. You can do so using a calculated column in the data source view. The "Using SQL Server Data Mining" chapter in the Data Mining with SQL Server 2005 book provides an example of this.

Wednesday, March 28, 2012

Information of Sequence clustering

No comments:

Post a Comment

Indexes and Deadlocks

Blog Archive

About Me