Next article: Friday Q&A 2015-08-14: An Xcode Plugin for Unsmoothed Text
Previous article: Friday Q&A 2015-07-17: When to Use Swift Structs and Classes
Tags: fridayqna objectivec
Tagged pointers are an interesting technology used to increase performance and reduce memory usage. As of OS X 10.10, NSString
got the tagged pointer treatment, and today I'm going to take a look at how it works, a topic suggested by Ken Ferry.
Recap
Objects are aligned in memory, such that their address is always at least a multiple of the pointer size, and in practice typically a multiple of 16
. Object pointers are stored as a full 64-bit integer, but this alignment means that some of the bits will always be zero.
Tagged pointers take advantage of this fact to give special meaning to object pointers where those bits are not zero. In Apple's 64-bit Objective-C implementation, object pointers with the least significant bit set to one (which is to say, odd numbers) are considered tagged pointers. Instead of doing the standard isa
dereference to figure out the class, the next three bits are considered as an index into a tagged class table. This index is used to look up the class of the tagged pointer. The remaining 60 bits are then left up to the tagged class to use as they please.
A simple use for this would be to make suitable NSNumber
instances be tagged pointers, with the extra 60
bits used to hold the numeric value. The bottom bit would be set to 1
. The next three bits would be set to the appropriate index for the NSNumber
tagged class. The following 60
bits could then hold, for example, any integer value that fits within that space.
From the outside, such a pointer would look and act like any other object. It responds to messages like any other object, because objc_msgSend
knows about tagged pointers. If you ask it for its integerValue
, it will extract the data from those 60
bits and return it to you. However, you've saved a memory allocation, a pointer indirection on every access, and reference counting can be a no-op since there's no memory to free. For commonly used classes, this can make for a substantial performance improvement.
NSString
doesn't seem like a good candidate for tagged pointers, since it's of variable length and can be much longer than the 60
bits in a tagged pointer. However, a tagged pointer class can coexist with a normal class, using tagged pointers for some values and normal pointers for others. For example, with NSNumber
, any integer larger than 2^60 - 1
won't fit in the tagged pointer, and instead needs to be stored in a normal NSNumber
objcet allocated in memory. As long as the code that creates the objects is written appropriately, this all works fine.
NSString
could do the same thing. For strings that fit inside 60
bits, it could create a tagged pointer. Other strings would be placed in normal objects. This assumes that small strings are used often enough for this to be a noticeable performance gain. Is that actually the case in real-world code? It appears Apple thinks it is, since they went ahead and implemented it.
Possible Implementations
Before we look at how Apple did things, let's take a moment to think about how we might implement tagged string storage. The basics are simple: set the bottom bit to one, set the remaining bits to the appropriate tagged class index, set the remaining bits to whatever. How to use the remaining bits is the big question. We want to take the maximum advantage of the 60
bits available to us.
A Cocoa string is conceptually a sequence of Unicode code points. There are 1,112,064 valid Unicode code points, so one code point takes 21
bits to represent. That means we could fit two code points into these 60
bits, with 18
bits wasted. We could borrow some of those extra bits to hold a length, so that a tagged string could be zero, one, or two code points. Being limited to only two code points doesn't seem very useful, though.
The NSString
API is actually implemented in terms of UTF-16
, not raw Unicode code points. UTF-16
represents Unicode as a sequence of 16
bit values. The most common code points, which occupy the Basic Multilingual Plane or BMP, fit into a single 16
bit value, while code points above 65,535 require two. We could fit three 16
bit values into the 60
bits available, with 12
bits left over. Borrowing some bits for the length would allow us to represent 0-3 UTF-16
code units. This would allow up to three code points in the BMP, and one code point beyond the BMP (plus, optionally, one code point within it). Being limited to three is still pretty tight though.
Most strings in an app are probably ASCII. Even if the app is localized into a non-ASCII language, strings are used for far more than just displaying UI. They're used for URL components, file extensions, object keys, property list values, and much more. The UTF-8
encoding is an ASCII-compatible encoding that encodes each ASCII character as a single byte, while using up to four bytes for other Unicode code points. We can fit seven bytes in the 60
bits allotted to us, with 4
bits left over to use as a length. This allows our tagged strings to hold seven ASCII characters, or somewhat fewer non-ASCII characters, depending on exactly what they are.
If we're optimizing for ASCII, we might as well drop full Unicode support altogether. Strings containing non-ASCII characters can use real objects, after all. ASCII is a seven-bit encoding, so what if we allot only 7
bits per character? That lets us store up to eight ASCII characters in the 60
bits available, plus 4
bits left over for the length. This is starting to sound useful. There are probably a lot of strings in an app which are pure ASCII and contain eight characters or fewer.
Can we take it further? The full ASCII range contains a lot of stuff that isn't frequently used. There are a bunch of control characters, for example, and unusual symbols. Alphanumerics make up most of what's used. What can we squeeze into 6
bits?
6
bits results in 64
possible values. There are 26
letters in the ASCII alphabet. Including both upper and lower case brings it up to 52
. Including all digits 0-9 brings it up to 62
. There are two spots to spare, which we might give to, say, space and period. There are probably a lot of strings which only contain these characters. At 6
bits each, we can fit ten in our 60
bits of storage! But wait! We don't have any leftover bits to store the length. So either we store nine characters plus a length, or we remove one of the 64
possible values (I nominate the space to be the one to go) and use zero as a terminator for strings shorter than ten characters.
How about 5
bits? This isn't totally ludicrous. There are probably a lot of strings which are just lowercase, for example. 5
bits gives 32
possible values. If you include the whole lowercase alphabet, there are 6
extra values, which you could allot to the more common uppercase letters, or some symbols, or digits, or some mix. If you find that some of these other possibilities are more common, you could even remove some of the less common lowercase letters, like q
. 5
bits per character gives eleven characters if we save room for length, or twelve if we borrow a symbol and use a terminator.
Can we take it further? 5
bits is about as far as you can reasonably go with a fixed-size alphabet. You could switch to a variable length encoding, for example using a Huffman code. This would allow the letter e
, which is common, to be encoded with fewer bits than the letter q
. This might allow for as little as 1
bit per character, in the unlikely case that your string is all e
s. This would come at the cost of more complex and presumably slower code.
Which approach did Apple take? Let's find out.
Exercising Tagged Strings
Here's a bit of code that creates a tagged string and prints its pointer value:
NSString *a = @"a";
NSString *b = [[a mutableCopy] copy];
NSLog(@"%p %p %@", a, b, object_getClass(b));
The mutableCopy
/copy
dance is necessary for two reasons. First, although a string like @"a"
could be stored as a tagged pointer, constant strings are never tagged pointers. Constant strings must remain binary compatible across OS releases, but the internal details of tagged pointers are not guaranteed. This works fine as long as tagged pointers are always generated by Apple's code at runtime, but it would break down if the compiler embedded them in your binary, as would be the case with a constant string. Thus we need to copy the constant string to get a tagged pointer.
The mutableCopy
is necessary because NSString
is too clever for us, and knows that a copy
of an immutable string is a pointless operation, and returns the original string as the "copy." Constant strings are immutable, so the result of [a copy]
is just a
. A mutable copy forces it to make an actual copy, and then making an immutable copy of the result is enough to convince the system to give us a tagged pointer string.
Note that you must never depend on these details in your own code! The circumstances in which the NSString
code decides to give you a tagged pointer could change at any time and if you write code that somehow relies on it, that code will eventually break. Fortunately, you have to go out of your way to do that, and all normal, sensible code will work fine, blissfully ignorant of tagged anything.
Here's what the above code prints on my computer:
0x10ba41038 0x6115 NSTaggedPointerString
You can see the original pointer first, a nice round number indicative of an object pointer. The copy is the second value, and its taggedness is abundantly clear. It's an odd number, which means it can't be a valid object pointer. It's also a small number, well within the unmapped and unmappable 4GB zero page at the beginning of the 64-bit Mac address space, making it doubly imposible to be an object pointer.
What can we deduce from this value of 0x6115
? We know that the bottom four bits are part of the tagged pointer mechanism itself. The lowest nybble 5
is 0101
in binary. The bottom one bit indicates that it's a tagged pointer. The next three bits indicate the tagged class. Here those bits are 010
, indicating that the tagged string class occupies index 2
. Not that there's much to do with that information.
The 61
that leads the value is suggestive. 61
in hexadecimal happens to be the ASCII value of the lowercase letter a
, which is exactly what the string contains. It looks like there's a straight ASCII encoding in use. Convenient!
The class name makes it obvious what this class is for, and gives us a good starting point when it comes to looking into the actual code that implements this feature. We'll get to that shortly, but let's do some more outside inspection first.
Here's a loop that builds up strings of the form abcdef...
and prints them out one by one until it stops getting tagged pointers back:
NSMutableString *mutable = [NSMutableString string];
NSString *immutable;
char c = 'a';
do {
[mutable appendFormat: @"%c", c++];
immutable = [mutable copy];
NSLog(@"0x%016lx %@ %@", immutable, immutable, object_getClass(immutable));
} while(((uintptr_t)immutable & 1) == 1);
The first iteration prints:
0x0000000000006115 a NSTaggedPointerString
This matches what we saw above. Note that I'm printing out the full pointer with all leading zeroes to make things a bit more clear when comparing subsequent iterations. Let's compare with the second iteration:
0x0000000000626125 ab NSTaggedPointerString
The bottom four bits didn't change, as we'd expect. That 5
will remain constant, always indicating that this is a tagged pointer of type NSTaggedPointerString
.
The original 61
stays where it was, joined now by a 62
. 62
is, of course, the ASCII code for b
, so we can see that this encoding is an eight-bit encoding that uses ASCII. The four bits just before the terminal 5
changed from 1
to 2
, suggesting that this might be the length. Subsequent iterations confirm this:
0x0000000063626135 abc NSTaggedPointerString
0x0000006463626145 abcd NSTaggedPointerString
0x0000656463626155 abcde NSTaggedPointerString
0x0066656463626165 abcdef NSTaggedPointerString
0x6766656463626175 abcdefg NSTaggedPointerString
Presumably that's the end of it. The tagged pointer is full, and the next iteration will allocate an object and terminate the loop. Right? Wrong!
0x0022038a01169585 abcdefgh NSTaggedPointerString
0x0880e28045a54195 abcdefghi NSTaggedPointerString
0x00007fd275800030 abcdefghij __NSCFString
The loop goes through two more iterations before terminating. The length section continues to increase, but the rest of the tagged pointer turns into gibberish. What's going on? Let's turn to the implementation to find out.
Disassembly
The NSTaggedPointer
class lives in the CoreFoundation framework. It seems like it ought to live in Foundation, but a lot of the core Objective-C classes have been moved to CoreFoundation these days as Apple slowly gives up on making CoreFoundation an independent entity.
Let's start by looking at the implementation of -[NSTaggedPointerString length]
:
push rbp
mov rbp, rsp
shr rdi, 0x4
and rdi, 0xf
mov rax, rdi
pop rbp
ret
Hopper provides this handy decompilation to go along with it:
unsigned long long -[NSTaggedPointerString length](void * self, void * _cmd) {
rax = self >> 0x4 & 0xf;
return rax;
}
In short, to obtain the length, extract bits 4-7 and return them. This confirms what we observed above, that the four bits just before the terminal 5
indicate the length of the string.
The other primitive method for an NSString
subclass is characterAtIndex:
. I'll skip the lengthy disassembly and go straight to Hopper's decompiled output, which is pretty readable:
unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long arg2) {
rsi = _cmd;
rdi = self;
r13 = arg2;
r8 = ___stack_chk_guard;
var_30 = *r8;
r12 = rdi >> 0x4 & 0xf;
if (r12 >= 0x8) {
rbx = rdi >> 0x8;
rcx = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
rdx = r12;
if (r12 < 0xa) {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x3f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x6;
} while (rdx != 0x0);
}
else {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x1f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x5;
} while (rdx != 0x0);
}
}
if (r12 <= r13) {
rbx = r8;
___CFExceptionProem(rdi, rsi);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(rbp + r13 + 0xffffffffffffffc0) & 0xff;
if (*r8 != var_30) {
rax = __stack_chk_fail();
}
return rax;
}
Let's clean this up a little. The first three lines are just Hopper telling us which registers get which arguments. Let's go ahead and replace all uses of rsi
with _cmd
and rdi
with self
. arg2
is actually the index
parameter, so let's replace all uses of r13
with index
. And we'll get rid of the __stack_chk
stuff, as it's just a security hardening thing and not relevant to the actual workings of the method. Here's what the code looks like when cleaned up in this way:
unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
r12 = self >> 0x4 & 0xf;
if (r12 >= 0x8) {
rbx = self >> 0x8;
rcx = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
rdx = r12;
if (r12 < 0xa) {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x3f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x6;
} while (rdx != 0x0);
}
else {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x1f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x5;
} while (rdx != 0x0);
}
}
if (r12 <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(rbp + index + 0xffffffffffffffc0) & 0xff;
return rax;
}
Right before the first if
statement is this line:
r12 = self >> 0x4 & 0xf
We can recognize this as the same length extraction code that we saw in the implementation of -length
above. Let's go ahead and replace r12
with length
throughout:
unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
length = self >> 0x4 & 0xf;
if (length >= 0x8) {
rbx = self >> 0x8;
rcx = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
rdx = length;
if (length < 0xa) {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x3f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x6;
} while (rdx != 0x0);
}
else {
do {
*(int8_t *)(rbp + rdx + 0xffffffffffffffbf) = *(int8_t *)((rbx & 0x1f) + rcx);
rdx = rdx - 0x1;
rbx = rbx >> 0x5;
} while (rdx != 0x0);
}
}
if (length <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(rbp + index + 0xffffffffffffffc0) & 0xff;
return rax;
}
Looking inside the if
statement, the first line shifts self
right by 8
bits. The bottom 8
bits are bookkeeping: the tagged pointer indicator, and the string length. The remainder is then, we presume, the actual data. Let's replace rbx
with stringData
to make that more clear. The next line appears to put some sort of lookup table into rcx
, so let's replace rcx
with table
. Finally, rdx
gets a copy of the value of length
. It looks like it gets used as some sort of cursor later, so let's replace rdx
with cursor
. Here's what we have now:
unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
length = self >> 0x4 & 0xf;
if (length >= 0x8) {
stringData = self >> 0x8;
table = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
cursor = length;
if (length < 0xa) {
do {
*(int8_t *)(rbp + cursor + 0xffffffffffffffbf) = *(int8_t *)((stringData & 0x3f) + table);
cursor = cursor - 0x1;
stringData = stringData >> 0x6;
} while (cursor != 0x0);
}
else {
do {
*(int8_t *)(rbp + cursor + 0xffffffffffffffbf) = *(int8_t *)((stringData & 0x1f) + table);
cursor = cursor - 0x1;
stringData = stringData >> 0x5;
} while (cursor != 0x0);
}
}
if (length <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(rbp + index + 0xffffffffffffffc0) & 0xff;
return rax;
}
That's almost everything labeled. One raw register name remains: rbp
. That's actually the frame pointer, so the compiler is doing something tricky indexing directly off the frame pointer. Adding the constant 0xffffffffffffffbf
is the two's-complement "everything is ultimately an unsigned integer" way to subtract 65
. Later on, it subtracts 64
. This is all probably the same local variable on the stack. Given the bytewise indexing going on, it's probably a buffer placed on the stack. But it's weird, because there's a code path which does nothing but read from that buffer, without ever writing to it. What's going on?
It turns out that what's going on is Hopper forgot to decompile the else
branch of that outer if
statement. The relevant assembly looks like this:
mov rax, rdi
shr rax, 0x8
mov qword [ss:rbp+var_40], rax
var_40
is how Hopper shows that offset of 64
in the disassembly. (40
being the hexadecimal version of 64
.) Let's call the pointer to this location buffer
. The C version of this missing branch would look like:
*(uint64_t *)buffer = self >> 8
Let's go ahead and insert that, and replace the other places where rbp
is used to access buffer
with more readable versions of the code, plus add a declaration of buffer
to remind us what's going on:
unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
int8_t buffer[11];
length = self >> 0x4 & 0xf;
if (length >= 0x8) {
stringData = self >> 0x8;
table = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
cursor = length;
if (length < 0xa) {
do {
*(int8_t *)(buffer + cursor - 1) = *(int8_t *)((stringData & 0x3f) + table);
cursor = cursor - 0x1;
stringData = stringData >> 0x6;
} while (cursor != 0x0);
}
else {
do {
*(int8_t *)(buffer + cursor - 1) = *(int8_t *)((stringData & 0x1f) + table);
cursor = cursor - 0x1;
stringData = stringData >> 0x5;
} while (cursor != 0x0);
}
} else {
*(uint64_t *)buffer = self >> 8;
}
if (length <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = *(int8_t *)(buffer + index) & 0xff;
return rax;
}
Getting better. All those crazy pointer manipulation statements are a bit hard to read, though, and they're really just array indexing. Let's fix those up:
unsigned short -[NSTaggedPointerString characterAtIndex:](void * self, void * _cmd, unsigned long long index) {
int8_t buffer[11];
length = self >> 0x4 & 0xf;
if (length >= 0x8) {
stringData = self >> 0x8;
table = "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX";
cursor = length;
if (length < 0xa) {
do {
buffer[cursor - 1] = table[stringData & 0x3f];
cursor = cursor - 0x1;
stringData = stringData >> 0x6;
} while (cursor != 0x0);
}
else {
do {
buffer[cursor - 1] = table[stringData & 0x1f];
cursor = cursor - 0x1;
stringData = stringData >> 0x5;
} while (cursor != 0x0);
}
} else {
*(uint64_t *)buffer = self >> 8;
}
if (length <= index) {
rbx = r8;
___CFExceptionProem(self, _cmd);
[NSException raise:@"NSRangeException" format:@"%@: Index %lu out of bounds; string length %lu"];
r8 = rbx;
}
rax = buffer[index];
return rax;
}
Now we're getting somewhere.
We can see that there are three cases depending on the length. Length values less than 8 go through that missing else
branch that just dumps the value of self
, shifted, into buffer
. This is the plain ASCII case. Here, index
is used to index into the value of self
to extract the given byte, which is then returned to the caller. Since ASCII character values match Unicode code points within the ASCII range, there's no additional manipulation needed to make the value come out correctly. We guessed above that the string stored plain ASCII in this case, and this confirms it.
What about the cases where the length is 8
or more? If the length is 8
or more but less than 10
(0xa
), then the code enters a loop. This loop extracts the bottom 6
bits of stringData
, uses that as an index into table
, and then copies that value into buffer
. It then shifts stringData
down by 6
bits and repeats until it runs through the entire string. This is a six-bit encoding where the mapping from raw six-bit values to ASCII character values is stored in the table. A temporary version of the string is built up in buffer
, and the indexing operation at the end then extracts the requested character from it.
What about the case where the length is 10
or more? The code there is almost identical, except that it works five bits at a time instead of six. This is a more compact encoding that would allow the tagged string to store up to 11
characters, but only using an alphabet of 32
values. This will use the first half of table
as its truncated alphabet.
Thus we can see that the structure of the tagged pointer strings is:
- If the length is between
0
and7
, store the string as raw eight-bit characters. - If the length is
8
or9
, store the string in a six-bit encoding, using the alphabet"eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX"
. - If the length is
10
or11
, store the string in a five-bit encoding, using the alphabet"eilotrm.apdnsIc ufkMShjTRxgC4013"
Let's compare with the data we generated earlier:
0x0000000000006115 a NSTaggedPointerString
0x0000000000626125 ab NSTaggedPointerString
0x0000000063626135 abc NSTaggedPointerString
0x0000006463626145 abcd NSTaggedPointerString
0x0000656463626155 abcde NSTaggedPointerString
0x0066656463626165 abcdef NSTaggedPointerString
0x6766656463626175 abcdefg NSTaggedPointerString
0x0022038a01169585 abcdefgh NSTaggedPointerString
0x0880e28045a54195 abcdefghi NSTaggedPointerString
0x00007fbad9512010 abcdefghij __NSCFString
The binary expansion of 0x0022038a01169585
minus the bottom eight bits and divided into six-bit chunks is:
001000 100000 001110 001010 000000 010001 011010 010101
Using these to index into the table, we can see that this does indeed spell out "abcdefgh". Similarly, the binary expansion of
0x0880e28045a54195` minus the bottom eight bits and divided into six-bit chunks is:
001000 100000 001110 001010 000000 010001 011010 010101 000001
We can see that it's the same string, plus i
at the end.
But then it goes off the rails. After this point, it should switch to a five-bit encoding and give us two more strings, but instead it starts allocating objects at length 10
. What gives?
The five-bit alphabet is extremely limited, and doesn't include the letter b
! That letter must not be common enough to warrant a place in the 32
hallowed characters of the five-bit alphabet. Let's try this again, but instead start the string at c
. Here's the output:
0x0000000000006315 c NSTaggedPointerString
0x0000000000646325 cd NSTaggedPointerString
0x0000000065646335 cde NSTaggedPointerString
0x0000006665646345 cdef NSTaggedPointerString
0x0000676665646355 cdefg NSTaggedPointerString
0x0068676665646365 cdefgh NSTaggedPointerString
0x6968676665646375 cdefghi NSTaggedPointerString
0x0038a01169505685 cdefghij NSTaggedPointerString
0x0e28045a54159295 cdefghijk NSTaggedPointerString
0x01ca047550da42a5 cdefghijkl NSTaggedPointerString
0x39408eaa1b4846b5 cdefghijklm NSTaggedPointerString
0x00007fbd6a511760 cdefghijklmn __NSCFString
We now have tagged strings all the way up to a length of 11
. The binary expansions of the last two tagged strings are:
01110 01010 00000 10001 11010 10101 00001 10110 10010 00010
01110 01010 00000 10001 11010 10101 00001 10110 10010 00010 00110
Exactly what we're expecting.
Creating Tagged Strings
Since we know how tagged strings are encoded, I won't go into much detail on the code that creates them. The code in question is found within a private function called __CFStringCreateImmutableFunnel3
which handles every conceivable string creation case all in one gigantic function. This function is included in the open source release of CoreFoundation available on opensource.apple.com, but don't get excited: the tagged pointer strings code is not included in the open source version.
The code here is essentially the inverse of what's above. If the string's length and content fits what a tagged pointer string can hold, it builds a tagged pointer piece by piece, containing either ASCII, six-bit, or five-bit characters. There's an inverse of the lookup table. The table seen above as a constant string lives as a global variable called sixBitToCharLookup
, and there's a corresponding table called charToSixBitLookup
in the Funnel3
function.
That Weird Table
The full six-bit encoding table is:
eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX
A natural question here is: why is it in such a strange order?
Because this table is used for both six-bit and five-bit encodings, it makes sense that it wouldn't be entirely in alphabetical order. Characters that are used most frequently should be in the first half, while characters that are used less frequently should go in the second half. This ensures that the maximum number of longer strings can use the five-bit encoding.
However, with that division in place, the order within the individual halves doesn't matter. The halves themselves could be sorted alphabetically, but they're not.
The first few letters in the table is similar to the order in which letters appear in English, sorted by frequency. The most common letter in English is E, then T, then A, O, I, N, and S. E is right at the beginning of the table, and the others are near the beginning. The table appears to be sorted by frequency of use. The discrepancy with English probably arises because short strings in Cocoa apps aren't a random selection of words from English prose, but a more specialized bit of language.
I speculate that Apple originally intended to use a fancier variable-length encoding, probably based on a Huffman code. This turned out to be too difficult, or not worth the effort, or they just ran out of time, and so they dialed it back to the less ambitious version seen above, where strings choose between constant-length encodings using eight, six, or five bits per character. The weird table lives on as a leftover, and a starting point if they decide to go for a variable-length encoding in the future. This is pure guesswork, but that's what it looks like to me.
Conclusion
Tagged pointers are a really cool technology. Strings are an unusual application of it, but it's clear that Apple put a lot of thought into it and they must see a significant benefit. It's interesting to see how it's put together and how they got the most out of the very limited storage available.
That's it for today. Come back next time for more strange explorations of the trans-mundane. Friday Q&A is driven by reader suggestions, so if you have an idea for a subject you'd like to see covered here, please send it in!
Comments:
1. Did you notice that your original constant string pointer 0x10ba41038 is 8-aligned, not 16-aligned?
2. You could use a tagged pointer scheme even if you allowed byte-aligned objects. Just use the high bits instead of the low bits.
3. Given 8-aligned pointers, you don't need to dedicate a whole bit for a tag flag. The pointer must be fake if any of the low three bits are non-zero.
Also, there are probably still people at Apple that remember the whole "32-bit clean" debacle, where early MacOS programs used the top 8 of 32 bits of addresses for flags. Worked great when the machines it ran on had 512k of RAM - not so hot when they started to get 16M+. :)
For the 6 bit case:
if (length < 0xa) {
tableIndex = stringData >> (index * 6 * 2) & 0x3f;
return table[tableIndex];
}
Using the high bits works well as long as you know a chunk of the address space is unused. That's true on x86-64 for now, where only 48 bits are used. It's not the case on 32-bit, so a hypothetical tagged pointer implementation there wouldn't be able to use the high bits. If x86-64 expands to the full 64-bit address space then that would stop working, too.
And yes, you could use any of the bottom three bits as a signal for taggedness. This would expand the number of tagged classes you can discriminate between. I suspect it's less efficient to check for, though. Or maybe they just don't need that many (last I checked, the tagged class table wasn't fully used yet) and went for simplicity.
Bob Peterson: I know six-bit encodings had some history, and five-bit codings substantially pre-date the computer (e.g. Baudot code), but I never heard about the link between 18-bit words, six-bit encodings, and the standard three-character file extension. Interesting tidbit.
Matt: Using the high bits wouldn't break ASLR. You'd only use the high bits if there were portions of the address space that were always unmapped. This is the case in the current implementations of x86-64, for example, where the top 16 bits of a pointer must always be set to the same value, and on the Mac the high values are reserved for the kernel. That means that everything above 2^48 is unmapped from the perspective of userspace, and you can count on those top bits always being zero in a real pointer. ASLR will work within the 48-bit address space, but can't go beyond it.
People always bring up stuff like "32-bit clean" when I post articles about these things, but it simply doesn't apply. The problem with the 24->32-bit transition on the Mac was that third-party apps were pulling tricks like this. Apple could update their code, but third-party apps would break until they were fixed. When these tricks are only being done in the OS, it's not a problem. The OS gets updated whenever fundamental things like the process memory map is updated (for the x86-64 48-bit address space limitation, it would require new CPU hardware) and Apple can update their tagged pointer implementation accordingly. If it's burdensome, they can turn it off. It's a single
#define
in the Objective-C runtime to toggle it, and I believe you can even toggle it on a process-by-process basis using an environment variable.
As for why Apple doesn't use the high bits, I imagine they just don't need to, and using the low bit as the tagged pointer indicator keeps things simple.
Joseph Gentle: Good question! The exact same loops can be found in the implementation in other methods, like
-[NSTaggedPointerString getCharacters:range:]
and -[NSTaggedPointerString UTF8String]
. My guess is that the code that converts the tagged string storage to a sequence of ASCII bytes lives in a single location in the source code, and then gets inlined into these different methods. characterAtIndex:
would use this rather than reading the character directly to keep the implementation in a single location. The overhead is probably not important, as performance-sensitive code won't typically use characterAtIndex:
in the first place.The way it's implemented here, arithmetic (with NSNumber) has to mask off the low bits before doing anything. I guess that must be a big gain over the old way, since masking a couple bits is still a lot faster than dereferencing a pointer. Also (IME), NSNumber is really only used for storage, not arithmetic, so people don't really expect it to be fast. When you want fast integer math in Cocoa, you use "int".
In contrast, SBCL, for example, uses all 0's as the fixint tag value, so when they want to add 3 plus 4, they can just call x86 ADD. But this tagged pointer is the default way to do arithmetic, in SBCL, so making this case fast is important to them.
https://github.com/sbcl/sbcl/blob/master/src/compiler/generic/early-objdef.lisp
In fact, for fast integer math, they only need to test (and potentially branch) once. Just bitwise-OR the arguments together, and if the low bits are all still 0 (which in practice they almost always are), you know it's safe to ADD them together!
NaNs propagate through a calculation. Any calculation that involves a NaN will produce a NaN. That means that you can run through your calculation without checking the inputs, and just fall back to the slow path if the result is NaN.
I don't think Cocoa cares since, as you say, you don't use NSNumber for speed. The Objective-C tagged pointer implementation appears to be all about reducing memory allocations, which will also improve performance but as a sort of side effect. NSNumber gets the tagged pointer treatment not because apps are adding them together a lot, but because they're using a lot of NSNumbers to store in arrays and dictionaries and such.
__stack_chk_guard
:
In a malformed tagged pointer, this would still overflow
buffer
by up to four bytes, (probably garbling up the stored index
parameter on stack?) overwriting them with the ASCII code for “e”.
If it's really writing into the memory of
index
, that would then enter the exception branch for any original index
parameter reporting a ridiculously large out-of-range index—even if the caller passed 0.The vast majority of strings stored by almost any program (even in other languages) are ASCII. This optimization only works at all for really short strings. So the only case where this would matter at all to a "Chinese app" would be if it dealt with a huge number of tiny user-entered strings ... without doing any I/O, perhaps, since I/O tends to dwarf even memory dereferencing.
Remember, the short form of the encoding also doesn't work for 10-character ASCII strings that contain the letter "b" or the number "2". Does that mean apps that use "b" and "2" a lot will be 'slower'? Probably not enough for anybody to ever notice.
0xf5
just prints eeeeeeeeeeeeeee
and continues normally.
Note that if you experiment with this, you have to make sure that your code does something else with
NSString
first, like a dummy call to [NSString class]
, in order to get it properly registered with the runtime. Otherwise you crash in objc_msgSend
trying to mess with a tagged pointer it doesn't understand. NSString *seven = @"7";
NSString *taggedSeven = [[seven mutableCopy] copy];
NSLog(@"taggedSeven: %p, class: %@", taggedSeven, [taggedSeven class]); //taggedSeven: 0x3715, class: NSTaggedPointerString
NSLog(@"seven: %p, class: %@", seven, [seven class]); //seven: 0x100002078, class: __NSCFConstantString
NSArray *array = @[@4, @2.0, taggedSeven, @6, @10];
NSArray *result = [array sortedArrayUsingSelector:@selector(compare:)];
NSLog(@"result: %@", result); // 2, 4, 6, 10, 7
array = @[@4, @2.0, seven, @6, @10];
result = [array sortedArrayUsingSelector:@selector(compare:)]; //-[__NSCFNumber length]: unrecognized selector sent to instance 0x627
Operations also get slower, like taking the hashcode of a packed string. You want both the packed and unpacked version of the string "foo" to have the same hashcode, and since you can't store it on the tagged pointer string itself, you have to recompute it each time.
I googled that hardcoded "eilotrm..." string lookup, and got this:
https://llvm.org/svn/llvm-project/lldb/trunk/source/DataFormatters/CXXFormatterFunctions.cpp
The method in question is NSTaggedString_SummaryProvider
-[NSNumber compare:]
requires an NSNumber*
argument, and -[NSString compare:]
requires an NSString*
argument. You are (indirectly) passing an NSNumber*
argument to -[NSString compare:]
by mixing the objects in your array.David: Nice find. Makes sense that the debugger would want to know about the internal details like this.
Shane:
NSTaggedPointer
's compare:
implementation assumes that any tagged pointer passed in to the method is also an NSTaggedPointer
and does a direct comparison between the contents of the two. This is a valid optimization because the parameter to compare:
is NSString *
and by passing an NSNumber
in, you're violating that static type. So instead of blowing up trying to treat the tagged NSNumber
as a string by sending it string messages, it performs a bogus comparison on the tagged NSNumber
's internal data.Object equality is still perfectly safe, but I think the problem boils down to it meaning something a bit different from what you (and probably many other people) think it does. Pointer equality just checks: "Are these literally the exact same object in memory?" It sounds like you're using it to answer the question: "Were these created in different places?" There questions always had the same answer for a long time, but nothing says they have to.
I should say, I don't blame you for thinking this way and it seems like an easy mistake to make! Sometimes we just get it wrong.
It sounds like maybe you shouldn't be using string objects at all, but rather create your own custom class and pass those around. Your class could then contain a string, and since it's your own class, you'd have full control over how objects get allocated.
If it's not possible to have a completely separate class, you may be able to achieve the same effect by subclassing NSString.
https://github.com/llvm-mirror/lldb/blob/master/source/Plugins/Language/ObjC/NSString.cpp#L338
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.
That said, this is yet an other very interesting Friday Q&A. Thanks for the hard work.