[GRLUG] A data organization problem related to tags. Any database experts around?

Tue Apr 13 15:21:48 EDT 2010

Ok, I'm going to try this again, a bit humbler, and with a bit clearer head.

On Mon, Apr 12, 2010 at 9:24 PM, Adam Tauno Williams
<awilliam at whitemice.org> wrote:
>> The simplest way to implement it might be a tag table, or a simple
>> key-value pair table that can associate every object with any other
>> object. That seems problematic, though, because that table is going to
>> get *huge*, and the larger it gets, the more expensive it will be to
>> query and update it.
>
> Define "huge".  I see this argument occasionally, and generally I think
> it is bogus.  As a wise man said: 'premature optimization is evil'
> -Knuth.

I'll admit; this got me a little steamed, as not only is at a classic
and known quote, but I felt it completely irrelevant to the point of
what I was to ask. I described my problem, and I've since described
where I expect it to have scalability problems, what with the
potential exponential increase in relationships based on the various
ways of considering the source data.

I'm also not trying to look for optimization so much as consider a
base structure. Once a thing is written, it becomes harder and harder
to change it. With an RDBMS, a tag table would probably be fine where
I am currently, but I expect the record count to grow into the
millions within a year.

I also didn't mentioning versioning, but I don't think the tag table
necessarily needs to be versioned; I think that can be managed as part
of reparsing the example body during a rollback.

>
> On any, remotely modern, RDBMS you are going to have no problems at all
> until that table is well into the millions of records.
>
> And the issue is not really record count - but value cardinality.  A low
> cardinality causes indexes trees to be deep and expensive to update, and
> slower to search.  A higher cardinality - which is common for a
> linking/tagging example - will scale much more efficiently.

I had difficulty thinking through the cardinality issue last night.
Let's say I've got 2000 tags in use, 100,000 examples, each example
averages 30 tags, for a total around 3,000,000 rows. The cardinality
of the index of the exampleID column would be 100,000, while the
cardinality of the tagID column would be 2,000, correct? That does
seem like a comfortably-smaller set of numbers.

>
> With low cardinality - if you EXPLAIN your queries - you'll frequently
> discover that your RDBMS' optimizer is [wisely] discarding your
> carefully [or not] crafted indexes anyway, regardless of table size.
>
> In addition to that every modern [which possibly excludes MySQL - but,
> seriously, who cares?] supports conditional indexes and fragmented
> tables.   These let you keep the simple link-table design [easily
> supported in all languages] while having the increased scalability of a
> more 'chunky' design.
>
> For example, in OGo, we have an obj_info table;  which is basically
> [with some dressing]: link_id INT, source_id INT, target_id INT.  Since
> source_id and target_id have a naturally high cardinality you can pretty
> much pour as much linking in there as you want.
>
> Worried about index depth or append speed?  Create multiple fragmented
> indexes:
>
> create index obj_link_1m on obj_link(target_id) WHERE target_id >
> 1000000 AND target_id < 2000000;
> create index obj_link_2m on obj_link(target_id) WHERE target_id >
> 1000000 AND target_id < 2000000;
> create index obj_link_3m on obj_link(target_id) WHERE target_id >
> 2000000 AND target_id < 3000000;
>
> .. or whatever ..
>
> Don't worry, the optimizer will choose the correct one.
>
> You can do an equivalent [with slightly different syntax] in Informix,
> DB2, or M$-SQL.   Each also provides transparent ways to fragment to
> data table itself [though the details of their strategies differ quite a
> bit - they all work].
>
> Unless your RDBMS tables balloon into the hundreds of millions of
> records most performance problems point to DBA incompetence or just
> apathy.

I do expect the tag relationship table to grow; the relationships I
described earlier are nowhere near where I want to stop.

> Know thy tools!

That last line rubbed a little raw, too, considering the ^&*% I've had
to go through to keep the site working reasonably well in confined
spaces, in spare minutes on lunch breaks and other obligatory time
concerns.

-- 
:wq