Monday, March 12, 2012

indexing

Hello,

I need some help understanding why my indexes do not seem to be affecting my
searches. I would really appreciate help understanding what indexes I need
to make this query run faster. I realize that I use wildcards when searching
for g1.gene_name, but is there anything I can do to make that less of a
problem? I ran EXPLAIN on the search I wanted to optimize and got the
following:

EXPLAIN SELECT c1.SFID FROM Gene g1, cDNA c1, Transcript t1, Refseq r1 WHERE
(c1.SFID = t1.cDNA_SFID AND t1.gene_SFID = g1.SFID AND (g1.gene_sym = 'hh'
OR g1.genbank_acc = 'hh' OR g1.gene_name LIKE '%hh%')) OR (c1.genbank_acc =
'hh' OR c1.SUID = 'hh') OR (c1.SFID = t1.cDNA_SFID AND t1.gene_SFID =
g1.SFID AND g1.locuslink_id = r1.locuslink_id AND (r1.mRNA_acc = 'hh'));
+---+---+--------+--+---+--+---
+--------+
| table | type | possible_keys | key | key_len | ref | rows
| Extra |
+---+---+--------+--+---+--+---
+--------+
| r1 | index | mRNA_acc,llid,rma,rllid | rma | 25 | NULL | 20093
| Using index |
| g1 | ALL | PRIMARY,llid,ggs,gga,gll | NULL | NULL | NULL | 190475
| |
| c1 | ALL | PRIMARY,cga,cs | NULL | NULL | NULL | 43714
| where used |
| t1 | index | gene_SFID,gS,cS,tg,tc | gS | 4 | NULL | 47238
| where used; Using index |
+---+---+--------+--+---+--+---
+--------+

I have the following indexes (which were all added after the database was
populated):

ALTER TABLE cDNA ADD INDEX cga
(genbank_acc, SFID);

ALTER TABLE cDNA ADD INDEX co
(organism, SFID);

ALTER TABLE cDNA ADD INDEX cs
(SUID, SFID);

ALTER TABLE Gene ADD INDEX ggs
(gene_sym, SFID);

ALTER TABLE Gene ADD INDEX gga
(genbank_acc, SFID);

ALTER TABLE Gene ADD INDEX ggn
(gene_name, SFID);

ALTER TABLE Gene ADD INDEX go
(organism, SFID);

ALTER TABLE Gene ADD INDEX gll
(locuslink_id, SFID);

ALTER TABLE Gene ADD INDEX gui
(unigene_id, SFID);

ALTER TABLE Transcript ADD INDEX tg
(gene_SFID, cDNA_SFID);

ALTER TABLE Transcript ADD INDEX tc
(cDNA_SFID);

ALTER TABLE Refseq ADD INDEX rma
(mRNA_acc, locuslink_id);

ALTER TABLE Refseq ADD INDEX rllid
(locuslink_id);"superfly2" <darius_fatakia@.yahoo.com> wrote in message
news:cb030a$11i$1@.news.Stanford.EDU...
> Hello,
> I need some help understanding why my indexes do not seem to be affecting
my
> searches. I would really appreciate help understanding what indexes I need
> to make this query run faster. I realize that I use wildcards when
searching
> for g1.gene_name, but is there anything I can do to make that less of a
> problem? I ran EXPLAIN on the search I wanted to optimize and got the
> following:
> EXPLAIN SELECT c1.SFID FROM Gene g1, cDNA c1, Transcript t1, Refseq r1
WHERE
> (c1.SFID = t1.cDNA_SFID AND t1.gene_SFID = g1.SFID AND (g1.gene_sym = 'hh'
> OR g1.genbank_acc = 'hh' OR g1.gene_name LIKE '%hh%')) OR (c1.genbank_acc
=
> 'hh' OR c1.SUID = 'hh') OR (c1.SFID = t1.cDNA_SFID AND t1.gene_SFID =
> g1.SFID AND g1.locuslink_id = r1.locuslink_id AND (r1.mRNA_acc = 'hh'));
+---+---+--------+--+---+--+---
> +--------+
> | table | type | possible_keys | key | key_len | ref | rows
> | Extra |
+---+---+--------+--+---+--+---
> +--------+
> | r1 | index | mRNA_acc,llid,rma,rllid | rma | 25 | NULL |
20093
> | Using index |
> | g1 | ALL | PRIMARY,llid,ggs,gga,gll | NULL | NULL | NULL |
190475
> | |
> | c1 | ALL | PRIMARY,cga,cs | NULL | NULL | NULL |
43714
> | where used |
> | t1 | index | gene_SFID,gS,cS,tg,tc | gS | 4 | NULL |
47238
> | where used; Using index |
+---+---+--------+--+---+--+---
> +--------+
>
> I have the following indexes (which were all added after the database was
> populated):
> ALTER TABLE cDNA ADD INDEX cga
> (genbank_acc, SFID);
> ALTER TABLE cDNA ADD INDEX co
> (organism, SFID);
> ALTER TABLE cDNA ADD INDEX cs
> (SUID, SFID);
>
> ALTER TABLE Gene ADD INDEX ggs
> (gene_sym, SFID);
> ALTER TABLE Gene ADD INDEX gga
> (genbank_acc, SFID);
> ALTER TABLE Gene ADD INDEX ggn
> (gene_name, SFID);
> ALTER TABLE Gene ADD INDEX go
> (organism, SFID);
> ALTER TABLE Gene ADD INDEX gll
> (locuslink_id, SFID);
> ALTER TABLE Gene ADD INDEX gui
> (unigene_id, SFID);
>
> ALTER TABLE Transcript ADD INDEX tg
> (gene_SFID, cDNA_SFID);
> ALTER TABLE Transcript ADD INDEX tc
> (cDNA_SFID);
>
> ALTER TABLE Refseq ADD INDEX rma
> (mRNA_acc, locuslink_id);
> ALTER TABLE Refseq ADD INDEX rllid
> (locuslink_id);

I believe that EXPLAIN is a MySQL statement - this is a Microsoft SQL Server
nesgroup, so I guess you'll get a better answer in a MySQL forum.

Simon|||[posted and mailed, please reply in news]

superfly2 (darius_fatakia@.yahoo.com) writes:
> EXPLAIN SELECT c1.SFID FROM Gene g1, cDNA c1, Transcript t1, Refseq r1
> WHERE (c1.SFID = t1.cDNA_SFID AND t1.gene_SFID = g1.SFID AND
> (g1.gene_sym = 'hh' OR g1.genbank_acc = 'hh' OR g1.gene_name LIKE
> '%hh%')) OR (c1.genbank_acc = 'hh' OR c1.SUID = 'hh') OR (c1.SFID =
> t1.cDNA_SFID AND t1.gene_SFID = g1.SFID AND g1.locuslink_id =
> r1.locuslink_id AND (r1.mRNA_acc = 'hh'));

Obviously you are in the wrong newsgroup, beause there is no EXPLAIN
command in SQL Server, and you don't use ALTER TABLE to add indexes.

Nevertheless, I might be able to give some input. You have this
condition:

g1.gene_name LIKE '%hh%'

To resolve this condition, the DB engine cannot use the index for a
quick look up. Compare with looking through the index of a book and
try to find all keywords with 'hh' in them. You would have to scan
the entire index.

I don't know about your DBMS, but MS SQL Server could use the index
for a scan, and it would do it, if the index includes all columns in the
table that aer referred to. If I see right, you would have to add
locuslink_id to

ALTER TABLE Gene ADD INDEX ggs
(gene_sym, SFID);

Not that I know if this would help on your engine, but most DBMS's are
fond of using covering indexes.

Then again, I have no idea how your DBMS handle all those OR.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp

No comments:

Post a Comment