[GRLUG] A data organization problem related to tags. Any database experts around?

Adam Tauno Williams awilliam at whitemice.org
Mon Apr 12 21:24:21 EDT 2010


> The simplest way to implement it might be a tag table, or a simple
> key-value pair table that can associate every object with any other
> object. That seems problematic, though, because that table is going to
> get *huge*, and the larger it gets, the more expensive it will be to
> query and update it.

Define "huge".  I see this argument occasionally, and generally I think
it is bogus.  As a wise man said: 'premature optimization is evil'
-Knuth.  

On any, remotely modern, RDBMS you are going to have no problems at all
until that table is well into the millions of records.

And the issue is not really record count - but value cardinality.  A low
cardinality causes indexes trees to be deep and expensive to update, and
slower to search.  A higher cardinality - which is common for a
linking/tagging example - will scale much more efficiently.  

With low cardinality - if you EXPLAIN your queries - you'll frequently
discover that your RDBMS' optimizer is [wisely] discarding your
carefully [or not] crafted indexes anyway, regardless of table size.

In addition to that every modern [which possibly excludes MySQL - but,
seriously, who cares?] supports conditional indexes and fragmented
tables.   These let you keep the simple link-table design [easily
supported in all languages] while having the increased scalability of a
more 'chunky' design.

For example, in OGo, we have an obj_info table;  which is basically
[with some dressing]: link_id INT, source_id INT, target_id INT.  Since
source_id and target_id have a naturally high cardinality you can pretty
much pour as much linking in there as you want.

Worried about index depth or append speed?  Create multiple fragmented
indexes:

create index obj_link_1m on obj_link(target_id) WHERE target_id >
1000000 AND target_id < 2000000;
create index obj_link_2m on obj_link(target_id) WHERE target_id >
1000000 AND target_id < 2000000;
create index obj_link_3m on obj_link(target_id) WHERE target_id >
2000000 AND target_id < 3000000;

.. or whatever ..

Don't worry, the optimizer will choose the correct one.

You can do an equivalent [with slightly different syntax] in Informix,
DB2, or M$-SQL.   Each also provides transparent ways to fragment to
data table itself [though the details of their strategies differ quite a
bit - they all work].

Unless your RDBMS tables balloon into the hundreds of millions of
records most performance problems point to DBA incompetence or just
apathy.  Know thy tools!




More information about the grlug mailing list