mikeash.com: just this guy, you know?

Posted at 2007-11-26 01:08 | RSS feed (Full text feed) | Blog Index
Next article: Leopard: First Impressions
Previous article: Perform Better With Garbage Collection
Tags: c++ casestudy performance stl
Algorithmic Optimizations: a Case Study
by Mike Ash  

Those who know me from a programming standpoint know that I am a big opponent of needless optimization. But sometimes optimization is necessary, and when that comes I'm a big proponent of examining algorithms over twiddling low-level code. I recently had a good opportunity to perform algorithmic optimizations in a somewhat unconventional scenario, and this post will describe what I did.

I've been working on a little project that needed a dictionary file format. This file format was intended to be something that could be memory mapped and searched efficiently without having to access most of the file for any individual search. In order to allow the rest of the file to store constant-size records and make it easier to work with, I moved all of the strings into a separate string table which is stored at the end of the file.

In order to reduce storage requirements, this string table is treated as a big blob, and strings are indicated using a pointer and a length, so that storage can be shared between longer strings and their substrings. Here's an example:

        "Renew my books."    "book"
        offset 42 length 15  offset 51 length 4
                     |        |
                     v        v
    ...a little lamb.Renew my books.Frog blast the...
                       ^   ^
                       |   |
                       |  "my"
                       |  offset 48 length 2
                    offset 44 length 3

These offset/length pairs are stored throughout the file to indicate strings. When a string is loaded, the offset and length are used to load it out of the string table. This arrangement is very simple for the reader.

It's a lot more complicated for the string table generator, though. It's easy to make a naive generator which just writes the strings one-by-one and generates the offsets for each one, but it's much harder to write a smart generator which makes good use of shared storage for substrings.

The basic algorithm for the generator which I decided to use was this:

        get offset of blob in str
        if offset does not exist
            offset is current length of blob
            append str to end of blob

By sorting the strings by length from largest to smallest, it's guaranteed that any string which is a substring of another string will be able to share storage. Storage can also be shared across string boundaries but it's less likely to happen, so I didn't put any effort into optimizing the arrangement of the blob for that.

This algorithm is pretty straightforward, but the "get offset of blob in str" line is the killer. The obvious technique would be to use a function like strstr() or, since this was sometimes using non-ASCII (UTF-16) data, memmem().

Of course one problem is that there is no memmem() function in the standard C library. Not to fear, I'll just write my own:

static const void *memmem( const void *big, size_t biglen,
                           const void *small, size_t smalllen )
	if( smalllen > biglen )
		return NULL;
	size_t offset = 0;
	while( offset <= biglen - smalllen )
		if( memcmp( big + offset, small, smalllen ) == 0 )
			return big + offset;
	return NULL;

Of course some of the more perceptive readers out there might notice that the performance of this function totally blows. In fact it was too slow to build a dictionary of any reasonable size in any reasonable amount of time, so I started looking around for better ways.

The first better way I tried was using the C++ STL's search algorithm. When given a char * it works almost exactly like memmem(). Performs almost exactly like it too. It's too general to be able to use a better algorithm so the only performance gain came from being able to inline all those pesky memcmp() calls. I went searching further.

(I can hear you screaming back there. Yes, you in the last row. The one constantly yelling, "Use Boyer-Moore!" I heard you the first time you said it. Just sit down and be patient, I'm getting there.)

After asking around a little, I found out that the standard algorithm for these searches is called the Boyer-Moore string search algorithm. This algorithm uses some ridiculously clever ideas to skip over large portions of the string being searched so that the running time doesn't get worse the longer the smaller string is. In fact it tends to perform better the longer it is. I even found a very nice implementation of this algorithm under very generous terms of use, so I grabbed it and used it in place of my memmem() function.

And speed was attained! The new algorithm made it practical to build larger dictionaries. I started working with a data set that produced about 6MB of output and it took around ten minutes to build. It was slightly annoying when I discovered a bug and had to rebuild it, but it was entirely doable.

A little later I started working with a data set that produced about 24MB of output and its running time returned to being essentially impossible. It took over an hour and a half just to build the string table, and a couple of minutes more to parse the file and build the rest of the output, and I kept discovering bugs in the parser that only showed up after I used it to build the file, so the turnaround time was getting to be somewhat insane. I decided to look into more improvements.

After some cogitating I hit upon an idea to trade memory for speed. I would build a gigantic hash table containing the offsets of every string of a certain constant length. Then when performing the search, I could examine only the offsets in the hash table, and nothing more. If there were no offsets, or if none of the offsets contained a match, then the string didn't exist in the table.

I finally settled on using a three-byte string prefix and simply gluing those three bytes together to produce a 24-bit value. I then created the offsets table to index using the prefix directly, meaning it contained 16.8 million entries. Each entry would be a pointer to a list of offsets, so this table of pointers would be 64MB, plus the space taken up by the offsets. There is approximately one offset for each byte in the string table (minus two, since it can't go all the way to the end) and each offset is four bytes, so this carries an additional four-to-one overhead over the size of the string table. With this data set the total overhead would be something around 150MB, an entirely reasonable tradeoff for smoething that will never run on a user's computer.

To better illustrate, here is a diagram of what the offsets table looks like in mid-run:

    |"aaa"|-->(532, 557, 1989)
    |"bon"|-->(3, 499, 2012)
    |"boo"|-->(12, 51, 732, 888)

Each three-character entry in the offsets table is a pointer to a list of offsets. Any entry which doesn't exist in the string table is a NULL pointer.

When adding the string "book" to the table, the "boo" entry is consulted. Offset 12 starts with "boolean logic" and is rejected. Next, offset 51 is examined. It starts with "book" and so 51 is used as the offset for this string.

When adding the string "bonus" to the table, the "bon" list is consulted. Offset 3 starts with "bon is" and is rejected. Offsets 499 and 2012 likewise do not contain "bonus", so the string "bonus" is appended to the end. This adds five new three-character prefixes to the string table, three in "bonus" itself and two for the last two characters in the string table, and these new prefixes and their offsets are added to the offsets table.

As you can see, this search strategy considerably reduces the amount of text which needs to be searched.

One last question is what to do with strings under three characters. One possibility would be to compute all the three-character prefixes which contain the string in question, and then search all of those offsets. Another possibility would be to pad with zero bytes and generate three prefixes for each offset. But I chose to keep things simple and just fall back to Boyer-Moore for this case, as such small strings are not all that common in my data. (Of course Boyer-Moore's speed advantage is basically useless for such small strings, but I already have the code handy, so why not?)

And what were the results? The small data set now generates its string table in about ten seconds. The large data set which used to take 100 minutes to generate the string table now takes about 2.5 minutes, for a speedup of roughly a factor of 40. I'm sure that even an insanely micro-optimized and vectorized Boyer-Moore would have a tough time being anywhere near 40 times faster than the original, once again showing how algorithmic improvements can totally beat the pants off more obvious and "traditional" micro-optimization.

Performance analysis shows that the code is spending most of its time in memcmp(), so additional speedups could potentially be gained from inlining the initial comparisons, or by expanding the length of the prefix so as to help provide fewer offsets to examine per prefix. The current level of performance is acceptable to me, however, so I plan to leave it be for now.

Interestingly the asymptotic running time of this algorithm is actually worse than the Boyer-Moore case. Boyer-Moore is O(n), where n is the size of the large string. The worst case of my new algorithm is O(n*m), where m is the size of the small string. The hashing is effectively a constant-time divisor to that n*m, which as we all know does not affect the big-O performance of an algorithm. However, since the divisor is about 16.8 million, it would take a really huge data set before the Boyer-Moore version really does start to perform better, and as can be seen from my real-world data sets, the multi-megabyte sizes I'm dealing with attain much better performance with the prefix table algorithm.

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.


[This adds five new three-character prefixes to the string table, three in "bonus itself" and two for the last two characters in thes tring table, and these new prefixes and their offsets are added to the offsets table.] -> [This adds five new three-character prefixes to the string table, three in "bonus" itself and two for the last two characters in the string table, and these new prefixes and their offsets are added to the offsets table.]

Sorry, been copy editing at work. ;-)
Actually, your final algorithm is O(N). Since you define M as the size of your short string (aka the size of the hash) that effectively makes M a constant and such if falls out of the running time equation.

Although I wonder if a string-tree implementation might have given you similar speed up, although all the pointer chasing might have made it worthwhile.
Nope, it's O(n*m).

Consider what happens when running this algorithm on a collection of large strings of identical lengths which contain the letter "x" for all but the last few characters. They will all hash to the same value. That one value will contain offsets pointing to nearly all of the large string. At each offset a string compare must be performed, which will fail on average halfway through the length of the string. (When comparing at the beginning of an existing string it must compare nearly the entire thing, and at the end it compares almost none; the average is one half.) Ignoring fenceposting, the total runtime is therefore (1 - epsilon) * n * m / 2, where epsilon is the proportion of the strings which is unique. And of course the constants fall out, so the big-O is O(n*m). Recall that big-O is worst case, not average case, so how it performs on non-pathological data is irrelevant.

To show this another way: consider that a naive string search (i.e. not Boyer-Moore or something else that's clever) is O(n*m). This just adds a hash table on top of that naive search algorithm, but of course hash tables fail on pathological data, so the worst case stays O(n*m).

Fancier algorithms could certainly improve this for pathological data, and probably for the data I actually deal with. However since this is code which is meant for my own personal use (it compiles dictionaries and users get the compiled data directly), the cost of implementing something fancier outweights any benefits I'd get.
Then I misunderstood what you described as M initially, as I understood it to be the length of strings that fell out of the hash table (too short for it), but your expansion implies that it is the length of the shortest string ever presented to the algorithm.
I can see where that would be confusing. The big-O analysis is for a single search of the bucket, and in that instance you have a large string, and then you're searching for a smaller string inside it. In other words, the large string is the haystack, and the small string is the needle. In this case O(n*m) has n as the length of the haystack and m as the length of the needle. The worst-case running time for the full algorithm, building the string table from scratch, is going to be around O(n^2*m^2), where n is the number of strings and m is the average length. But the running time on good data is going to be more like O(n), since a roughly constant number of locations needs to be checked for each string, and on good data the string matching will fail early.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

The Answer to the Ultimate Question of Life, the Universe, and Everything?
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.