[GRLUG] A data organization problem related to tags. Any database experts around?

Michael Mol mikemol at gmail.com
Mon Apr 12 22:44:02 EDT 2010


On Mon, Apr 12, 2010 at 9:24 PM, Adam Tauno Williams
<awilliam at whitemice.org> wrote:
>> The simplest way to implement it might be a tag table, or a simple
>> key-value pair table that can associate every object with any other
>> object. That seems problematic, though, because that table is going to
>> get *huge*, and the larger it gets, the more expensive it will be to
>> query and update it.
>
> Define "huge".  I see this argument occasionally, and generally I think
> it is bogus.  As a wise man said: 'premature optimization is evil'
> -Knuth.

For starters, I'm looking for orthogonal access to code examples. I
currently can see all language's examples that solve a task.
("Chrestomathy style") I want to also be able to pull up a page on a
language, and see all examples of that language on the site.
("Cookbook style")

Let's start with two axes.

The site currently has 391 tasks and 249 programming languages.
Assuming a single code example for each task and programming language,
at 100% completion, that's 114,954 code examples for 2*(114954)
example-tag rows. We will never (and can never) reach 100% completion,
so that number goes down a bit, but we often see alternate approaches,
so that number also goes up. 100% completion is actually pretty
extreme, so I went through and counted the tasks implemented by the
most popular 150 or so languages. That's inherently low, but it'll
help: 15,131 examples, for 2*(15,131) example-tag rows. (sum: 30262)

Let's throw in another axis, libraries. The same way I can look at all
the languages' implementation of a particular task, I want to be able
cross-reference libraries against languages and libraries against
tasks. For any language which has a standard library, that library
would typically be applied in code examples for non-syntactic tasks,
so that's an additional round of example-tag rows. Then there are the
examples which pull in GDI, xlib, GMP, pthreads, Win32 or some other
common library. (Not adding this to the sum; I don't have a good
estimate of where we sit or where we're going, but it's a direction I
want to expand in.)

There's another big one lurking, too. I want to index code examples by
keyword usage (Conveniently, the syntax highlighter puts keywords in
their own <span></span> blocks, and I want to abuse that for some
automated smart indexing). Assuming all languages have have syntax
highlighting support (they don't, currently, but *that's* certainly
doable), that's an additional set of tag rows, I would estimate a low
average of around 6 keywords per example, adding 15*(15,131) tag-row
counts. (sum: 24096)

There are around 70 additional categories relating to programming
paradigms and language features that I want to get associated with
code examples, rather than generally with the tasks and languages, and
they're not generally inheritable by task or language presence, so
that's easily 70*(15131) additional example-tag counts, assuming an
example only gets associated with one such category (which is
lowballing it significantly). (sum: 1,165,087)

So, starting with current data, lowballing it at every step, I come
out to 1,165,087 rows.

>
> On any, remotely modern, RDBMS you are going to have no problems at all
> until that table is well into the millions of records.

If I look at my ideal example count as 114954 examples, and my current
example count as 15131 examples, that puts me at 13% implementation,
which is nowhere near where I would like to be, and a ratio of 1.57:1
tasks/language.  Assume I stay at that 13%, and see the site grow by
200 more tasks (and, implicitly, 127 more languages), and that 15,131
example count grows to 222,216 examples. The sum, based on the above
estimates, comes to 17,110,632 rows. I'm expecting that kind of
growth.

>
> And the issue is not really record count - but value cardinality.  A low
> cardinality causes indexes trees to be deep and expensive to update, and
> slower to search.  A higher cardinality - which is common for a
> linking/tagging example - will scale much more efficiently.
>
> With low cardinality - if you EXPLAIN your queries - you'll frequently
> discover that your RDBMS' optimizer is [wisely] discarding your
> carefully [or not] crafted indexes anyway, regardless of table size.
>
> In addition to that every modern [which possibly excludes MySQL - but,
> seriously, who cares?] supports conditional indexes and fragmented
> tables.   These let you keep the simple link-table design [easily
> supported in all languages] while having the increased scalability of a
> more 'chunky' design.
>
> For example, in OGo, we have an obj_info table;  which is basically
> [with some dressing]: link_id INT, source_id INT, target_id INT.  Since
> source_id and target_id have a naturally high cardinality you can pretty
> much pour as much linking in there as you want.
>
> Worried about index depth or append speed?  Create multiple fragmented
> indexes:
>
> create index obj_link_1m on obj_link(target_id) WHERE target_id >
> 1000000 AND target_id < 2000000;
> create index obj_link_2m on obj_link(target_id) WHERE target_id >
> 1000000 AND target_id < 2000000;
> create index obj_link_3m on obj_link(target_id) WHERE target_id >
> 2000000 AND target_id < 3000000;
>
> .. or whatever ..
>
> Don't worry, the optimizer will choose the correct one.
>
> You can do an equivalent [with slightly different syntax] in Informix,
> DB2, or M$-SQL.   Each also provides transparent ways to fragment to
> data table itself [though the details of their strategies differ quite a
> bit - they all work].
>
> Unless your RDBMS tables balloon into the hundreds of millions of
> records most performance problems point to DBA incompetence or just
> apathy.  Know thy tools!

You'll have to estimate the cardinality of the described data; I can't
calculate it. I already know I don't know enough about SQL to do the
thing efficiently, hence the original post. (Note I didn't exclude
relational databases)

-- 
:wq


More information about the grlug mailing list