[GRLUG] A data organization problem related to tags. Any database experts around?

Michael Mol mikemol at gmail.com
Tue Apr 13 00:02:02 EDT 2010


I *frequently* hit performance issues with MediaWiki. Every six
months, a new-heights surge of traffic spurs me to get my hands on a
larger VPS, even after I've gone through and optimized parameters for
the InnoDB and MyISAM tables. I mix and match to get meet the
performance and reliability requirements I need; by this point, I'm
very much accustomed to running on virtual hardware below what would
be 'recommended' for my load. I think I might be good through the end
of the year, this time; my move from Linode to prgmr.com gave me a 4x
ram/dollar improvement, so I'm sitting with a ridiculous degree of
caching capability that I was able to shoehorn benefits in with
memcached, apache's mod_prefork, php-xcache and the kernel file cache.
Still, this, too, shall pass. I'm already seeing slow-period load
15-minute load averages of 0.11, and more IO waiting than I'm
comfortable; I need to get in and tweek InnoDB's settings again.

That's without MediaWiki being capable of representing data the way I
want to be able to represent and browse it. MediaWiki doesn't arrange
its schema in the way I described in my original post; as far as data
organization, it has a pages table and a categories table, but I can't
hammer it into showing things they way I want without contorting it
into something very, very weird.

As far as database size, the InnoDB file is 1.6GB, and the MyISAM
portion of the wiki is 28M. The raw SQL dump sits at 1.1GB and 620489
rows.

I'm looking hard at managing the design and creation a new wiki
software package from the ground up, and I'm looking at it from the
direction of expanding comparison and side-by-side cross-reference,
which is why I'm looking hard at how the content of my site is (or
should be) structured, as well as that of other sites with data
structure properties.

My database has *always* been my bottleneck, and, knowing this and
knowing the nature of my content, I'm looking for the most efficient
way I can represent it.

On Mon, Apr 12, 2010 at 11:10 PM, Matthew Seeley
<matthew at threadlight.com> wrote:
> For the scaling out / performance issue -- Adam's absolutely right.
> (Or at least, my experience matches his.)
>
> MySQL on our shared production box at work (an older 2006-era server)
> easily breezes through over 500+ megabytes worth of data (one table
> has over 300,000 rows, with thirty columns or so on it)
>
> And I'm the incompetent "DBA" so I know we definitely don't do any
> kind of special indexing or such -- and it's still very very snappy.
>
> Until you actually hit some kind of performance issue, I wouldn't
> worry about MySQL performance. (Or the performance of whatever RDBMS
> you use).
>
> And if you do hit a performance issue, it's much most likely to be
> your query, rather than the server itself.
>
>
> -- Matthew Seeley
>
> On Mon, Apr 12, 2010 at 10:44 PM, Michael Mol <mikemol at gmail.com> wrote:
>> On Mon, Apr 12, 2010 at 9:24 PM, Adam Tauno Williams
>> <awilliam at whitemice.org> wrote:
>>>> The simplest way to implement it might be a tag table, or a simple
>>>> key-value pair table that can associate every object with any other
>>>> object. That seems problematic, though, because that table is going to
>>>> get *huge*, and the larger it gets, the more expensive it will be to
>>>> query and update it.
>>>
>>> Define "huge".  I see this argument occasionally, and generally I think
>>> it is bogus.  As a wise man said: 'premature optimization is evil'
>>> -Knuth.
>>
>> For starters, I'm looking for orthogonal access to code examples. I
>> currently can see all language's examples that solve a task.
>> ("Chrestomathy style") I want to also be able to pull up a page on a
>> language, and see all examples of that language on the site.
>> ("Cookbook style")
>>
>> Let's start with two axes.
>>
>> The site currently has 391 tasks and 249 programming languages.
>> Assuming a single code example for each task and programming language,
>> at 100% completion, that's 114,954 code examples for 2*(114954)
>> example-tag rows. We will never (and can never) reach 100% completion,
>> so that number goes down a bit, but we often see alternate approaches,
>> so that number also goes up. 100% completion is actually pretty
>> extreme, so I went through and counted the tasks implemented by the
>> most popular 150 or so languages. That's inherently low, but it'll
>> help: 15,131 examples, for 2*(15,131) example-tag rows. (sum: 30262)
>>
>> Let's throw in another axis, libraries. The same way I can look at all
>> the languages' implementation of a particular task, I want to be able
>> cross-reference libraries against languages and libraries against
>> tasks. For any language which has a standard library, that library
>> would typically be applied in code examples for non-syntactic tasks,
>> so that's an additional round of example-tag rows. Then there are the
>> examples which pull in GDI, xlib, GMP, pthreads, Win32 or some other
>> common library. (Not adding this to the sum; I don't have a good
>> estimate of where we sit or where we're going, but it's a direction I
>> want to expand in.)
>>
>> There's another big one lurking, too. I want to index code examples by
>> keyword usage (Conveniently, the syntax highlighter puts keywords in
>> their own <span></span> blocks, and I want to abuse that for some
>> automated smart indexing). Assuming all languages have have syntax
>> highlighting support (they don't, currently, but *that's* certainly
>> doable), that's an additional set of tag rows, I would estimate a low
>> average of around 6 keywords per example, adding 15*(15,131) tag-row
>> counts. (sum: 24096)
>>
>> There are around 70 additional categories relating to programming
>> paradigms and language features that I want to get associated with
>> code examples, rather than generally with the tasks and languages, and
>> they're not generally inheritable by task or language presence, so
>> that's easily 70*(15131) additional example-tag counts, assuming an
>> example only gets associated with one such category (which is
>> lowballing it significantly). (sum: 1,165,087)
>>
>> So, starting with current data, lowballing it at every step, I come
>> out to 1,165,087 rows.
>>
>>>
>>> On any, remotely modern, RDBMS you are going to have no problems at all
>>> until that table is well into the millions of records.
>>
>> If I look at my ideal example count as 114954 examples, and my current
>> example count as 15131 examples, that puts me at 13% implementation,
>> which is nowhere near where I would like to be, and a ratio of 1.57:1
>> tasks/language.  Assume I stay at that 13%, and see the site grow by
>> 200 more tasks (and, implicitly, 127 more languages), and that 15,131
>> example count grows to 222,216 examples. The sum, based on the above
>> estimates, comes to 17,110,632 rows. I'm expecting that kind of
>> growth.
>>
>>>
>>> And the issue is not really record count - but value cardinality.  A low
>>> cardinality causes indexes trees to be deep and expensive to update, and
>>> slower to search.  A higher cardinality - which is common for a
>>> linking/tagging example - will scale much more efficiently.
>>>
>>> With low cardinality - if you EXPLAIN your queries - you'll frequently
>>> discover that your RDBMS' optimizer is [wisely] discarding your
>>> carefully [or not] crafted indexes anyway, regardless of table size.
>>>
>>> In addition to that every modern [which possibly excludes MySQL - but,
>>> seriously, who cares?] supports conditional indexes and fragmented
>>> tables.   These let you keep the simple link-table design [easily
>>> supported in all languages] while having the increased scalability of a
>>> more 'chunky' design.
>>>
>>> For example, in OGo, we have an obj_info table;  which is basically
>>> [with some dressing]: link_id INT, source_id INT, target_id INT.  Since
>>> source_id and target_id have a naturally high cardinality you can pretty
>>> much pour as much linking in there as you want.
>>>
>>> Worried about index depth or append speed?  Create multiple fragmented
>>> indexes:
>>>
>>> create index obj_link_1m on obj_link(target_id) WHERE target_id >
>>> 1000000 AND target_id < 2000000;
>>> create index obj_link_2m on obj_link(target_id) WHERE target_id >
>>> 1000000 AND target_id < 2000000;
>>> create index obj_link_3m on obj_link(target_id) WHERE target_id >
>>> 2000000 AND target_id < 3000000;
>>>
>>> .. or whatever ..
>>>
>>> Don't worry, the optimizer will choose the correct one.
>>>
>>> You can do an equivalent [with slightly different syntax] in Informix,
>>> DB2, or M$-SQL.   Each also provides transparent ways to fragment to
>>> data table itself [though the details of their strategies differ quite a
>>> bit - they all work].
>>>
>>> Unless your RDBMS tables balloon into the hundreds of millions of
>>> records most performance problems point to DBA incompetence or just
>>> apathy.  Know thy tools!
>>
>> You'll have to estimate the cardinality of the described data; I can't
>> calculate it. I already know I don't know enough about SQL to do the
>> thing efficiently, hence the original post. (Note I didn't exclude
>> relational databases)
>>
>> --
>> :wq
>> _______________________________________________
>> grlug mailing list
>> grlug at grlug.org
>> http://shinobu.grlug.org/cgi-bin/mailman/listinfo/grlug
> _______________________________________________
> grlug mailing list
> grlug at grlug.org
> http://shinobu.grlug.org/cgi-bin/mailman/listinfo/grlug
>



-- 
:wq


More information about the grlug mailing list