Next article: Friday Q&A 2014-08-29: Swift Memory Dumping
Previous article: Friday Q&A 2014-08-01: Exploring Swift Memory Layout, Part II
Tags: fridayqna guest namemangling swift
It's been a long time since I wrote a Friday Q&A, but I'm back, with a brand-new post about a brand-new topic: Swift. Over the last few posts, Mike's gone into some detail about what Swift's internal structures looked like, but he's only touched very lightly on what the linker sees when it looks at Swift-containing binaries: mangled symbol names.
In a language such as C, where there can only ever be one function or piece of data by any given name (a symbol), name mangling is not required. Even so, if you look at the symbol table of a typical pure-C binary, you will find that each function name has had an _
(underscore) prepended to it. For example:
$ echo 'int main() { return 0; }' | xcrun clang -x c - -o ./test
$ xctest nm ./test
0000000100000000 T __mh_execute_header
0000000100000f80 T _main
U dyld_stub_binder
$
This simple "mangling" is now largely historical, serving little useful purpose, but remains intact for compatibility and consistency reasons. By convention, names defined in C will have an underscore, while global symbols defined by pure assembly will not (although many assembly language writers will prepend the underscore anyway for consistency).
Objective-C also does not have collisions between symbol names; Objective-C method implementations are always of the form -[class selector]
, and Objective-C does not allow overloading of identical selectors on the same class with different type signatures.
Okay, let's mangle some names already!
Matters become more complicated in languages where a simple name without any further information might be more ambiguous. Consider this example in C++:
$ cat | xcrun clang -x c++ - -o test
int foo(int a) { return a * 2; }
int foo(double a) { return a * 2.0; }
int main() { return foo(1) + foo(1.0); }
^D
$ xcrun nm -a test
0000000100000f30 T __Z3food
0000000100000f10 T __Z3fooi
0000000100000000 T __mh_execute_header
0000000100000f60 T _main
U dyld_stub_binder
Because foo
refers to two different functions with different signatures, which is legal in C++, it is impossible to simply generate two _foo
symbols; the linker would not know which was which. As a result, the C++ compiler "mangles" the symbols, using a strict set of encoding rules.
Unlike C and Objective-C, in C++ and Swift function names by themselves are not enough to tell apart each individual implementation of a function. Functions with the same name which take different parameter types (foo(int)
and foo(double)
, for example) require more information to set them apart. Using the full signature given in code (such as "foo(int)
") would lead to a lot of extra code in the linker and confusion when multiple type names map to the same underlying type (such as unsigned
and unsigned int
). Instead, in C++, the language's somewhat arcane type promotion and conversion rules are applied, and the result is mangled into a form the compiler and linker can use easily and without any confusion. The process is similar for Swift.
The simple example of foo
above is trivially broken down:
- First, the leading
_
common to C-style symbols. - Next,
_Z
, a prefix marking the symbol as a mangled global C++ name. - The number defines how many characters appear in the next identifier in the name; in this case 3.
3foo
thus means "the name 'foo'". - The
d
andi
are respectivelydouble
andint
builtin type names; return values are not part of a function's signature in C++, so the parameter list simply follows the function's full name.
For more information on how typical C++ compilers mangle names, see the Itanium C++ ABI documentation.
That's all very interesting, but for a Swift article, you're taking a long time to get there!
Swift's name mangling is somewhat different from C++'s. Swift uses an encoding clearly based on the C++ scheme in principle, but containing considerably more information and expressing concepts only available in a more mature type system.
I'll jump right in with a complex example. Consider the following excessively contrived and completely useless Swift code:
$ xcrun swiftc -emit-library -o test -
struct e {
enum f {
case G, H, I
}
}
class a {
class b {
class c {
func d(y: a, x w: b, v u: (x: Int) -> Int) -> e.f {
return e.f.G
}
}
}
}
^D
$ xcrun nm -g test
...
0000000000001c90 T __TFCCC4test1a1b1c1dfS2_FTS0_1xS1_1vFT1xSi_Si_OVS_1e1f
...
$
Swift will have generated over 100 more symbols, but this is the complex mangled name we'll tear apart: __TFCCC4test1a1b1c1dfS2_FTS0_1xS1_1vFT1xSi_Si_OVS_1e1f
Let's take it in order:
- Sure enough, the leading extra
_
is there even for Swift symbols. _T
is the marker for a Swift global symbol.F
tells us that the overall type of the symbol is a function.C
represents a "class" type. In this case, we're dealing with three nested classes, so it appears 3 times.4test
is the "module name", and1a
is the class name itself, yielding a class namedtest.a
.- At this point, the Swift parser will set up a stack of parsed names, looking for the first non-name token in the mangled name. In this case, it will find
f
after1d
. It then goes back and unwinds the stack of nested types from the inside out, yieldingtest.a
,test.a.b
, andtest.a.b.c
as class names. Since1d
has no corresponding nesting type (there were only threeC
s), it becomes the innermost part of the symbol's name-test.a.b.c.d
. - The lowercase
f
marks this symbol as an "uncurried function" type- in this case, a class method taking an implicitly bound first parameter, the instance itself. - Because we're now parsing a function type, the list of argument types comes next, followed by the return type. For an uncurried function type, the curried parameter(s) come first.
S2_
is a substitution, meaning it will use the third non substituted type encountered during parsing of the name thus far (the index is zero-based). In this case, this would betest.a.b.c
(the third class type). F
now marks the beginning of the function's parameter list, in the guise of a fresh function type. By now, it should be very obvious that the name mangling is heavily oriented around types.T
marks the beginning of a "tuple", which in this context is a list of types.S0_
is a substitution of the first type encountered in parsing, in this casetest.a
; the first parameter has this type.1x
is the external name of the second parameter. Notice that Swift does not encode internal names as part of the mangled signature.S1_
is a substitute of the second type encountered in parsing, in this casetest.a.b
; the second parameter has this type and the namex
.1v
is the external name of the third parameter.F
marks the start of another function type.T
marks the start of another tuple, the function's parameters (the function type is unnamed).1x
is the external name of the closure's first parameter.Si
isSwift.Int
, a shorthand for theInt
builtin type._
marks the end of the closure's arguments tuple.Si
is anotherInt
, the closure's return type_
marks the end of the uncurried function's arguments tuple.O
marks the start of anenum
type.V
marks the start of astruct
type, which will contain theenum
. (As we saw with the classes earlier, types are nested from the inside out in mangled names).S_
substitutes the (only) seen module name,test
. Notice that this is not a type substitution!1e
is the name of thestruct
.1f
is the name of theenum
.- The parser sees the end of the mangled name and unwinds through the two parsed names as it did with the class names earlier.
We thus have an uncurried function, named test.a.b.c.d
, taking a bound parameter of type test.a.b.c
, parameters of names and types (test.a, x: test.a.b, v: (x: Swift.Int) -> Swift.Int)
, and return type test.e.f
. As swift-demangle
shows us, the "official" demangling of this symbol is:
$ xcrun swift-demangle _TFCCC4test1a1b1c1dfS2_FT1zS0_1xS1_1vFT1xSi_Si_OVS_1e1f
_TFCCC4test1a1b1c1dfS2_FT1zS0_1xS1_1vFT1xSi_Si_OVS_1e1f ---> test.a.b.c.d (test.a.b.c)(z : test.a, x : test.a.b, v : (x : Swift.Int) -> Swift.Int) -> test.e.f
So what does it all mean?
Well, to most people, not a lot. Reading mangled names is fairly straightforward, in an algorithmic sense, but needlessly difficult for human eyes. That's why demangling tools exist; should you run across mangled symbol names in practice, there's no need to squint and mentally parse it all out. There are many, many, many more variations on mangled symbol names; I haven't touched on operator overloads, generics, protocols, or Objective-C compatible types, just to name a few. Here are just a few examples the compiler provided for free from the Swift code given above:
_TFV4test1eCfMS0_FT_S0_ ---> test.e.init (test.e.Type)() -> test.e
_TMLCCC4test1a1b1c ---> lazy cache variable for type metadata for test.a.b.c
_TMmCCC4test1a1b1c ---> metaclass for test.a.b.c
_TMnCC4test1a1b ---> nominal type descriptor for test.a.b
_TTWOV4test1e1fSs9EquatableFS2_oi2eeUS2___fMQPS2_FTS3_S3__Sb ---> protocol witness for Swift.Equatable.== infix <A : Swift.Equatable>(Swift.Equatable.Self.Type)(Swift.Equatable.Self, Swift.Equatable.Self) -> Swift.Bool in conformance test.e.f : Swift.Equatable
_TWoFC4test1aCfMS0_FT_S0_ ---> witness table offset for test.a.__allocating_init (test.a.Type)() -> test.a
_TWoFCCC4test1a1b1c1dfS2_FT1zS0_1xS1_1vFT1xSi_Si_OVS_1e1f ---> witness table offset for test.a.b.c.d (test.a.b.c)(z : test.a, x : test.a.b, v : (x : Swift.Int) -> Swift.Int) -> test.e.f
And so on.
To top it off, the Swift name mangling algorithm is completely undocumented and subject to change, as with most things Swift-related. The above examples were all produced using Xcode 6 beta 5.
In conclusion
Apple has taken a concept pioneered by C++ and expanded on it, based on Swift's unique and powerful type system. While Swift mangling shares some basic concepts with C++ mangling, it is in fact considerably different, and in some ways more powerful. It will be exciting to see whether Apple open sources, or at least documents, the logic behind Swift in general and the name mangling logic in particular, and opens up the secrets behind Swift's innovative design.
Easter egg
In case anyone was wondering, here's what happens when you add Unicode to the mix:
$ xcrun swiftc -emit-library -o test -
func ๐ (lhs: Int, rhs: Int) -> Int {
return 0;
}
^D
$ nm -g test
...
0000000000001420 T __TF4testX4GrIhFTSiSi_Si
...
$ xcrun swift-demangle __TF4testX4GrIhFTSiSi_Si
_TF4testX4GrIhFTSiSi_Si ---> test.๐ (Swift.Int, Swift.Int) -> Swift.Int
X4GrIh
translates to:
X
: eXtended character set4
: the encoded length of the nameGrIh
: the modified-Punycode encoding of the ๐ emoji (U+1F49B
)
Swift does not use standard Punycode encoding as used in DNS domain names, but it is similar. For more information, see RFC3492, the Punycode standard.
Comments:
I see that the Punycoding is adding an extra underscore for the acute "รก", however I don't know how to generate this programatically.
As you might know, the modified Punycode is just Punycode with upper-case letters instead of digits. See implementation here: https://gist.github.com/xtravar/6b52f59fb133229b360e
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.