[GRLUG] A data organization problem related to tags. Any database experts around?

Tue Apr 13 16:06:35 EDT 2010

Matthew, buddy - Is this a public-facing database? do you spawn 20
worker threads at it? or does it just run a single process - like an
inventory or financial update?
You're correct Mathew 300,000 rows is nothing ... any database can zip
through that, heck it can probably put that whole table into memory if
that's all it's doing.
You really have to re-adjust your thinking when you are writing
web-applications, there may be 10, 20, 30, 1,000 requests per-minute.
You may get hit by the a Search Engine crawler, and it may spawn off
20 or 30 requests at once.

The problem with pivoting relationships through joins is latency...
latency, latency, latency.
It is not the proper way to approach a problem in 2010, there are too
many other good solutions.
Databases like Mysql are good for strong relational tabular data, and
even then mysql falls short quite a lot.

Ids that reference Ids is basically the definition of an INDEX... so
indexing an index or "lookup" table - is a recipe for latency.

In any database as there are more records query and insert times will
grow - that is just physics.

Mike, this is the solution I would suggest.
Put the information in a column-based database such as Solr/Lucene.
Search the solr/lucene database, and return document id's.
When you want to retrieve the documents themselves you can then
perform a query against Mysql
SELECT * FROM documents WHERE document_id IN (1,2,3)
Solr/Lucene auto-warms, and caches, so performance will go through the roof.
It also will do layered navigation for you... which solves your
problem with displaying the "cookbook" style records.

In your scenario this is the correct tool to use.

On Tue, Apr 13, 2010 at 3:21 PM, Michael Mol <mikemol at gmail.com> wrote:
> Ok, I'm going to try this again, a bit humbler, and with a bit clearer head.
>
> On Mon, Apr 12, 2010 at 9:24 PM, Adam Tauno Williams
> <awilliam at whitemice.org> wrote:
>>> The simplest way to implement it might be a tag table, or a simple
>>> key-value pair table that can associate every object with any other
>>> object. That seems problematic, though, because that table is going to
>>> get *huge*, and the larger it gets, the more expensive it will be to
>>> query and update it.
>>
>> Define "huge".  I see this argument occasionally, and generally I think
>> it is bogus.  As a wise man said: 'premature optimization is evil'
>> -Knuth.
>
> I'll admit; this got me a little steamed, as not only is at a classic
> and known quote, but I felt it completely irrelevant to the point of
> what I was to ask. I described my problem, and I've since described
> where I expect it to have scalability problems, what with the
> potential exponential increase in relationships based on the various
> ways of considering the source data.
>
> I'm also not trying to look for optimization so much as consider a
> base structure. Once a thing is written, it becomes harder and harder
> to change it. With an RDBMS, a tag table would probably be fine where
> I am currently, but I expect the record count to grow into the
> millions within a year.
>
> I also didn't mentioning versioning, but I don't think the tag table
> necessarily needs to be versioned; I think that can be managed as part
> of reparsing the example body during a rollback.
>
>>
>> On any, remotely modern, RDBMS you are going to have no problems at all
>> until that table is well into the millions of records.
>>
>> And the issue is not really record count - but value cardinality.  A low
>> cardinality causes indexes trees to be deep and expensive to update, and
>> slower to search.  A higher cardinality - which is common for a
>> linking/tagging example - will scale much more efficiently.
>
> I had difficulty thinking through the cardinality issue last night.
> Let's say I've got 2000 tags in use, 100,000 examples, each example
> averages 30 tags, for a total around 3,000,000 rows. The cardinality
> of the index of the exampleID column would be 100,000, while the
> cardinality of the tagID column would be 2,000, correct? That does
> seem like a comfortably-smaller set of numbers.
>
>>
>> With low cardinality - if you EXPLAIN your queries - you'll frequently
>> discover that your RDBMS' optimizer is [wisely] discarding your
>> carefully [or not] crafted indexes anyway, regardless of table size.
>>
>> In addition to that every modern [which possibly excludes MySQL - but,
>> seriously, who cares?] supports conditional indexes and fragmented
>> tables.   These let you keep the simple link-table design [easily
>> supported in all languages] while having the increased scalability of a
>> more 'chunky' design.
>>
>> For example, in OGo, we have an obj_info table;  which is basically
>> [with some dressing]: link_id INT, source_id INT, target_id INT.  Since
>> source_id and target_id have a naturally high cardinality you can pretty
>> much pour as much linking in there as you want.
>>
>> Worried about index depth or append speed?  Create multiple fragmented
>> indexes:
>>
>> create index obj_link_1m on obj_link(target_id) WHERE target_id >
>> 1000000 AND target_id < 2000000;
>> create index obj_link_2m on obj_link(target_id) WHERE target_id >
>> 1000000 AND target_id < 2000000;
>> create index obj_link_3m on obj_link(target_id) WHERE target_id >
>> 2000000 AND target_id < 3000000;
>>
>> .. or whatever ..
>>
>> Don't worry, the optimizer will choose the correct one.
>>
>> You can do an equivalent [with slightly different syntax] in Informix,
>> DB2, or M$-SQL.   Each also provides transparent ways to fragment to
>> data table itself [though the details of their strategies differ quite a
>> bit - they all work].
>>
>> Unless your RDBMS tables balloon into the hundreds of millions of
>> records most performance problems point to DBA incompetence or just
>> apathy.
>
> I do expect the tag relationship table to grow; the relationships I
> described earlier are nowhere near where I want to stop.
>
>> Know thy tools!
>
> That last line rubbed a little raw, too, considering the ^&*% I've had
> to go through to keep the site working reasonably well in confined
> spaces, in spare minutes on lunch breaks and other obligatory time
> concerns.
>
> --
> :wq
> _______________________________________________
> grlug mailing list
> grlug at grlug.org
> http://shinobu.grlug.org/cgi-bin/mailman/listinfo/grlug