[GRLUG] A data organization problem related to tags. Any database experts around?

Mon Apr 12 21:24:21 EDT 2010

> The simplest way to implement it might be a tag table, or a simple
> key-value pair table that can associate every object with any other
> object. That seems problematic, though, because that table is going to
> get *huge*, and the larger it gets, the more expensive it will be to
> query and update it.

Define "huge".  I see this argument occasionally, and generally I think
it is bogus.  As a wise man said: 'premature optimization is evil'
-Knuth.  

On any, remotely modern, RDBMS you are going to have no problems at all
until that table is well into the millions of records.

And the issue is not really record count - but value cardinality.  A low
cardinality causes indexes trees to be deep and expensive to update, and
slower to search.  A higher cardinality - which is common for a
linking/tagging example - will scale much more efficiently.  

With low cardinality - if you EXPLAIN your queries - you'll frequently
discover that your RDBMS' optimizer is [wisely] discarding your
carefully [or not] crafted indexes anyway, regardless of table size.

In addition to that every modern [which possibly excludes MySQL - but,
seriously, who cares?] supports conditional indexes and fragmented
tables.   These let you keep the simple link-table design [easily
supported in all languages] while having the increased scalability of a
more 'chunky' design.

For example, in OGo, we have an obj_info table;  which is basically
[with some dressing]: link_id INT, source_id INT, target_id INT.  Since
source_id and target_id have a naturally high cardinality you can pretty
much pour as much linking in there as you want.

Worried about index depth or append speed?  Create multiple fragmented
indexes:

create index obj_link_1m on obj_link(target_id) WHERE target_id >
1000000 AND target_id < 2000000;
create index obj_link_2m on obj_link(target_id) WHERE target_id >
1000000 AND target_id < 2000000;
create index obj_link_3m on obj_link(target_id) WHERE target_id >
2000000 AND target_id < 3000000;

.. or whatever ..

Don't worry, the optimizer will choose the correct one.

You can do an equivalent [with slightly different syntax] in Informix,
DB2, or M$-SQL.   Each also provides transparent ways to fragment to
data table itself [though the details of their strategies differ quite a
bit - they all work].

Unless your RDBMS tables balloon into the hundreds of millions of
records most performance problems point to DBA incompetence or just
apathy.  Know thy tools!