[GRLUG] A data organization problem related to tags. Any database experts around?

Michael Mol mikemol at gmail.com
Mon Apr 12 23:14:23 EDT 2010


On Mon, Apr 12, 2010 at 10:44 PM, Michael Mol <mikemol at gmail.com> wrote:
> On Mon, Apr 12, 2010 at 9:24 PM, Adam Tauno Williams
> <awilliam at whitemice.org> wrote:
>>> The simplest way to implement it might be a tag table, or a simple
>>> key-value pair table that can associate every object with any other
>>> object. That seems problematic, though, because that table is going to
>>> get *huge*, and the larger it gets, the more expensive it will be to
>>> query and update it.
>>
>> Define "huge".  I see this argument occasionally, and generally I think
>> it is bogus.  As a wise man said: 'premature optimization is evil'
>> -Knuth.
>
> For starters, I'm looking for orthogonal access to code examples. I
> currently can see all language's examples that solve a task.
> ("Chrestomathy style") I want to also be able to pull up a page on a
> language, and see all examples of that language on the site.
> ("Cookbook style")
>
> Let's start with two axes.
>
> The site currently has 391 tasks and 249 programming languages.
> Assuming a single code example for each task and programming language,
> at 100% completion, that's 114,954 code examples for 2*(114954)
> example-tag rows. We will never (and can never) reach 100% completion,
> so that number goes down a bit, but we often see alternate approaches,
> so that number also goes up. 100% completion is actually pretty
> extreme, so I went through and counted the tasks implemented by the
> most popular 150 or so languages. That's inherently low, but it'll
> help: 15,131 examples, for 2*(15,131) example-tag rows. (sum: 30262)
>

[snip]

> There are around 70 additional categories relating to programming
> paradigms and language features that I want to get associated with
> code examples, rather than generally with the tasks and languages, and
> they're not generally inheritable by task or language presence, so
> that's easily 70*(15131) additional example-tag counts, assuming an
> example only gets associated with one such category (which is
> lowballing it significantly). (sum: 1,165,087)

Er, I accidentally highballed that one very, very badly. Not every
example would have an association with every paradigm and task
category. More likely, with the current set of those two classes of
categories, it would work out to around 10 associations per example,
so the sum would come up to 2,659,227.

Extrapolating based on the health and lang/task ratios would bring it
to about 3.78 million rows, not over 17 million.

-- 
:wq


More information about the grlug mailing list