[GRLUG] A data organization problem related to tags. Any database experts around?

Matthew Seeley matthew at threadlight.com
Mon Apr 12 23:10:18 EDT 2010


For the scaling out / performance issue -- Adam's absolutely right.
(Or at least, my experience matches his.)

MySQL on our shared production box at work (an older 2006-era server)
easily breezes through over 500+ megabytes worth of data (one table
has over 300,000 rows, with thirty columns or so on it)

And I'm the incompetent "DBA" so I know we definitely don't do any
kind of special indexing or such -- and it's still very very snappy.

Until you actually hit some kind of performance issue, I wouldn't
worry about MySQL performance. (Or the performance of whatever RDBMS
you use).

And if you do hit a performance issue, it's much most likely to be
your query, rather than the server itself.


-- Matthew Seeley

On Mon, Apr 12, 2010 at 10:44 PM, Michael Mol <mikemol at gmail.com> wrote:
> On Mon, Apr 12, 2010 at 9:24 PM, Adam Tauno Williams
> <awilliam at whitemice.org> wrote:
>>> The simplest way to implement it might be a tag table, or a simple
>>> key-value pair table that can associate every object with any other
>>> object. That seems problematic, though, because that table is going to
>>> get *huge*, and the larger it gets, the more expensive it will be to
>>> query and update it.
>>
>> Define "huge".  I see this argument occasionally, and generally I think
>> it is bogus.  As a wise man said: 'premature optimization is evil'
>> -Knuth.
>
> For starters, I'm looking for orthogonal access to code examples. I
> currently can see all language's examples that solve a task.
> ("Chrestomathy style") I want to also be able to pull up a page on a
> language, and see all examples of that language on the site.
> ("Cookbook style")
>
> Let's start with two axes.
>
> The site currently has 391 tasks and 249 programming languages.
> Assuming a single code example for each task and programming language,
> at 100% completion, that's 114,954 code examples for 2*(114954)
> example-tag rows. We will never (and can never) reach 100% completion,
> so that number goes down a bit, but we often see alternate approaches,
> so that number also goes up. 100% completion is actually pretty
> extreme, so I went through and counted the tasks implemented by the
> most popular 150 or so languages. That's inherently low, but it'll
> help: 15,131 examples, for 2*(15,131) example-tag rows. (sum: 30262)
>
> Let's throw in another axis, libraries. The same way I can look at all
> the languages' implementation of a particular task, I want to be able
> cross-reference libraries against languages and libraries against
> tasks. For any language which has a standard library, that library
> would typically be applied in code examples for non-syntactic tasks,
> so that's an additional round of example-tag rows. Then there are the
> examples which pull in GDI, xlib, GMP, pthreads, Win32 or some other
> common library. (Not adding this to the sum; I don't have a good
> estimate of where we sit or where we're going, but it's a direction I
> want to expand in.)
>
> There's another big one lurking, too. I want to index code examples by
> keyword usage (Conveniently, the syntax highlighter puts keywords in
> their own <span></span> blocks, and I want to abuse that for some
> automated smart indexing). Assuming all languages have have syntax
> highlighting support (they don't, currently, but *that's* certainly
> doable), that's an additional set of tag rows, I would estimate a low
> average of around 6 keywords per example, adding 15*(15,131) tag-row
> counts. (sum: 24096)
>
> There are around 70 additional categories relating to programming
> paradigms and language features that I want to get associated with
> code examples, rather than generally with the tasks and languages, and
> they're not generally inheritable by task or language presence, so
> that's easily 70*(15131) additional example-tag counts, assuming an
> example only gets associated with one such category (which is
> lowballing it significantly). (sum: 1,165,087)
>
> So, starting with current data, lowballing it at every step, I come
> out to 1,165,087 rows.
>
>>
>> On any, remotely modern, RDBMS you are going to have no problems at all
>> until that table is well into the millions of records.
>
> If I look at my ideal example count as 114954 examples, and my current
> example count as 15131 examples, that puts me at 13% implementation,
> which is nowhere near where I would like to be, and a ratio of 1.57:1
> tasks/language.  Assume I stay at that 13%, and see the site grow by
> 200 more tasks (and, implicitly, 127 more languages), and that 15,131
> example count grows to 222,216 examples. The sum, based on the above
> estimates, comes to 17,110,632 rows. I'm expecting that kind of
> growth.
>
>>
>> And the issue is not really record count - but value cardinality.  A low
>> cardinality causes indexes trees to be deep and expensive to update, and
>> slower to search.  A higher cardinality - which is common for a
>> linking/tagging example - will scale much more efficiently.
>>
>> With low cardinality - if you EXPLAIN your queries - you'll frequently
>> discover that your RDBMS' optimizer is [wisely] discarding your
>> carefully [or not] crafted indexes anyway, regardless of table size.
>>
>> In addition to that every modern [which possibly excludes MySQL - but,
>> seriously, who cares?] supports conditional indexes and fragmented
>> tables.   These let you keep the simple link-table design [easily
>> supported in all languages] while having the increased scalability of a
>> more 'chunky' design.
>>
>> For example, in OGo, we have an obj_info table;  which is basically
>> [with some dressing]: link_id INT, source_id INT, target_id INT.  Since
>> source_id and target_id have a naturally high cardinality you can pretty
>> much pour as much linking in there as you want.
>>
>> Worried about index depth or append speed?  Create multiple fragmented
>> indexes:
>>
>> create index obj_link_1m on obj_link(target_id) WHERE target_id >
>> 1000000 AND target_id < 2000000;
>> create index obj_link_2m on obj_link(target_id) WHERE target_id >
>> 1000000 AND target_id < 2000000;
>> create index obj_link_3m on obj_link(target_id) WHERE target_id >
>> 2000000 AND target_id < 3000000;
>>
>> .. or whatever ..
>>
>> Don't worry, the optimizer will choose the correct one.
>>
>> You can do an equivalent [with slightly different syntax] in Informix,
>> DB2, or M$-SQL.   Each also provides transparent ways to fragment to
>> data table itself [though the details of their strategies differ quite a
>> bit - they all work].
>>
>> Unless your RDBMS tables balloon into the hundreds of millions of
>> records most performance problems point to DBA incompetence or just
>> apathy.  Know thy tools!
>
> You'll have to estimate the cardinality of the described data; I can't
> calculate it. I already know I don't know enough about SQL to do the
> thing efficiently, hence the original post. (Note I didn't exclude
> relational databases)
>
> --
> :wq
> _______________________________________________
> grlug mailing list
> grlug at grlug.org
> http://shinobu.grlug.org/cgi-bin/mailman/listinfo/grlug


More information about the grlug mailing list