[GRLUG] A data organization problem related to tags. Any database experts around?

Mon Apr 12 18:26:32 EDT 2010

I'm not just talking about relational databases, and I'm not looking
for "Check out *some database server*", I'm wondering specifically
about the elegance and performance of implementing an unknown solution
to a specific data organization problem I'm dealing with (and that
sites like TVTropes also apparently have).

Let's take TVTropes as a simple, familiar example.

TVTropes has two major axis by which you navigate their data--you have
'tropes' and 'works', between which there's a many-many relationship;
a trope will have many works which use it, a work will have many
tropes, and there may be multiple examples for each trope-work
binding.  TVTropes allows you to go to a trope's page, and you'll see
a listing of the works that use it, often multiple instance examples
for occurrence in each work. You can also go to a work's page and see
the tropes that it uses.

There's a major breakdown in their consistency and normalization,
though; many times, you might click on a work link on a trope's page,
take a look at the work's page, and see that the trope you were just
at isn't listed. Other times, you might be on a work's page, click on
a particular trope listed there, and not be able to find the work
listed on that trope's page. It's even very rare that the same wording
that was used to describe the trope usage instance

It seems obvious to me that the fundamental piece of information in
this entire arrangement is the trope usage instance, a particular
example of a particular trope being used in a particular work.
Ideally, you should be able to look at a trope and see a listed
intersection of that trope with all works.

Now let's step the complexity up a dimension.

Each trope is going be categorized. For example, "Badass Longcoat"
sits with "Nice Hat" in the "The Utterly And Completely Definitive
Guide To Cool" category. "Eject... Eject... Eject..." happens to share
space with "If It Swims It Flies" in the "Vehicl Tropes" category.

There's not a lot of difference between a trope being tied to a work,
and a trope being tied to a category, with the possible constraint
that a trope is only in a category once, where it might be in a work
multiple times. (For the relational database crowd, that means one
could optimize by making the paired key between the trope and category
a unique key in a link table, but you couldn't make it unique between
tropes and works, or tropes and actors, or necessarily even between
tropes and characters.)

The simplest way to implement it might be a tag table, or a simple
key-value pair table that can associate every object with any other
object. That seems problematic, though, because that table is going to
get *huge*, and the larger it gets, the more expensive it will be to
query and update it.

I've been mulling this topic over for months, because I face a similar
problem with Rosetta Code. I have programming langauges, tasks and
libraries that should be browseably cross-linked, and someone recently
asked me to extend the site to support Russian as a language. I'd love
to do that, but I can't because of normalization on RC's fundamental
object--the code example; I don't want the existing matrix of code
examples to have to be copied or manually managed each time another
human language is thrown in the mix. (And it gets trickier still,
because while the source code should be common across languages, the
surrounding descriptive text should *not* be--but it would be helpful
to fall back to a secondary language in the case a primary language
isn't available.)

The Question:
Using the different database systems, what would an implementation of
a data structure given the described problems look like, tuned either
for performance or elegance?

-- 
:wq