mikeash.com: Friday Q&A 2014-08-15: Swift Name Mangling

Posted at 2014-08-15 14:20 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2014-08-29: Swift Memory Dumping
Previous article: Friday Q&A 2014-08-01: Exploring Swift Memory Layout, Part II
Tags: fridayqna guest namemangling swift

Friday Q&A 2014-08-08: Swift Name Mangling

by Gwynne Raskind

It's been a long time since I wrote a Friday Q&A, but I'm back, with a brand-new post about a brand-new topic: Swift. Over the last few posts, Mike's gone into some detail about what Swift's internal structures looked like, but he's only touched very lightly on what the linker sees when it looks at Swift-containing binaries: mangled symbol names.

In a language such as C, where there can only ever be one function or piece of data by any given name (a symbol), name mangling is not required. Even so, if you look at the symbol table of a typical pure-C binary, you will find that each function name has had an _ (underscore) prepended to it. For example:

    $ echo 'int main() { return 0; }' | xcrun clang -x c - -o ./test
    $ xctest nm ./test
    0000000100000000 T __mh_execute_header
    0000000100000f80 T _main
                     U dyld_stub_binder
    $

This simple "mangling" is now largely historical, serving little useful purpose, but remains intact for compatibility and consistency reasons. By convention, names defined in C will have an underscore, while global symbols defined by pure assembly will not (although many assembly language writers will prepend the underscore anyway for consistency).

Objective-C also does not have collisions between symbol names; Objective-C method implementations are always of the form -[class selector], and Objective-C does not allow overloading of identical selectors on the same class with different type signatures.

Okay, let's mangle some names already!
Matters become more complicated in languages where a simple name without any further information might be more ambiguous. Consider this example in C++:

    $ cat | xcrun clang -x c++ - -o test
    int foo(int a) { return a * 2; }
    int foo(double a) { return a * 2.0; }
    int main() { return foo(1) + foo(1.0); }
    ^D
    $ xcrun nm -a test
    0000000100000f30 T __Z3food
    0000000100000f10 T __Z3fooi
    0000000100000000 T __mh_execute_header
    0000000100000f60 T _main
                     U dyld_stub_binder

Because foo refers to two different functions with different signatures, which is legal in C++, it is impossible to simply generate two _foo symbols; the linker would not know which was which. As a result, the C++ compiler "mangles" the symbols, using a strict set of encoding rules.

Unlike C and Objective-C, in C++ and Swift function names by themselves are not enough to tell apart each individual implementation of a function. Functions with the same name which take different parameter types (foo(int) and foo(double), for example) require more information to set them apart. Using the full signature given in code (such as "foo(int)") would lead to a lot of extra code in the linker and confusion when multiple type names map to the same underlying type (such as unsigned and unsigned int). Instead, in C++, the language's somewhat arcane type promotion and conversion rules are applied, and the result is mangled into a form the compiler and linker can use easily and without any confusion. The process is similar for Swift.

The simple example of foo above is trivially broken down:

First, the leading _ common to C-style symbols.
Next, _Z, a prefix marking the symbol as a mangled global C++ name.
The number defines how many characters appear in the next identifier in the name; in this case 3. 3foo thus means "the name 'foo'".
The d and i are respectively double and int builtin type names; return values are not part of a function's signature in C++, so the parameter list simply follows the function's full name.

For more information on how typical C++ compilers mangle names, see the Itanium C++ ABI documentation.

That's all very interesting, but for a Swift article, you're taking a long time to get there!
Swift's name mangling is somewhat different from C++'s. Swift uses an encoding clearly based on the C++ scheme in principle, but containing considerably more information and expressing concepts only available in a more mature type system.

I'll jump right in with a complex example. Consider the following excessively contrived and completely useless Swift code:

        $ xcrun swiftc -emit-library -o test -
        struct e {
                enum f {
                        case G, H, I
                }
        }
        class a {
                class b {
                        class c {
                                func d(y: a, x w: b, v u: (x: Int) -> Int) -> e.f {
                                        return e.f.G
                                }
                        }
                }
        }
        ^D
        $ xcrun nm -g test
        ...
        0000000000001c90 T __TFCCC4test1a1b1c1dfS2_FTS0_1xS1_1vFT1xSi_Si_OVS_1e1f
    ...
    $

Swift will have generated over 100 more symbols, but this is the complex mangled name we'll tear apart: __TFCCC4test1a1b1c1dfS2_FTS0_1xS1_1vFT1xSi_Si_OVS_1e1f

Let's take it in order:

Sure enough, the leading extra _ is there even for Swift symbols.
_T is the marker for a Swift global symbol.
F tells us that the overall type of the symbol is a function.
C represents a "class" type. In this case, we're dealing with three nested classes, so it appears 3 times.
4test is the "module name", and 1a is the class name itself, yielding a class named test.a.
At this point, the Swift parser will set up a stack of parsed names, looking for the first non-name token in the mangled name. In this case, it will find f after 1d. It then goes back and unwinds the stack of nested types from the inside out, yielding test.a, test.a.b, and test.a.b.c as class names. Since 1d has no corresponding nesting type (there were only three Cs), it becomes the innermost part of the symbol's name- test.a.b.c.d.
The lowercase f marks this symbol as an "uncurried function" type- in this case, a class method taking an implicitly bound first parameter, the instance itself.
Because we're now parsing a function type, the list of argument types comes next, followed by the return type. For an uncurried function type, the curried parameter(s) come first. S2_ is a substitution, meaning it will use the third non substituted type encountered during parsing of the name thus far (the index is zero-based). In this case, this would be test.a.b.c (the third class type).
F now marks the beginning of the function's parameter list, in the guise of a fresh function type. By now, it should be very obvious that the name mangling is heavily oriented around types.
T marks the beginning of a "tuple", which in this context is a list of types.
S0_ is a substitution of the first type encountered in parsing, in this case test.a; the first parameter has this type.
1x is the external name of the second parameter. Notice that Swift does not encode internal names as part of the mangled signature.
S1_ is a substitute of the second type encountered in parsing, in this case test.a.b; the second parameter has this type and the name x.
1v is the external name of the third parameter.
F marks the start of another function type.
T marks the start of another tuple, the function's parameters (the function type is unnamed).
1x is the external name of the closure's first parameter.
Si is Swift.Int, a shorthand for the Int builtin type.
_ marks the end of the closure's arguments tuple.
Si is another Int, the closure's return type
_ marks the end of the uncurried function's arguments tuple.
O marks the start of an enum type.
V marks the start of a struct type, which will contain the enum. (As we saw with the classes earlier, types are nested from the inside out in mangled names).
S_ substitutes the (only) seen module name, test. Notice that this is not a type substitution!
1e is the name of the struct.
1f is the name of the enum.
The parser sees the end of the mangled name and unwinds through the two parsed names as it did with the class names earlier.

We thus have an uncurried function, named test.a.b.c.d, taking a bound parameter of type test.a.b.c, parameters of names and types (test.a, x: test.a.b, v: (x: Swift.Int) -> Swift.Int), and return type test.e.f. As swift-demangle shows us, the "official" demangling of this symbol is:

    $ xcrun swift-demangle _TFCCC4test1a1b1c1dfS2_FT1zS0_1xS1_1vFT1xSi_Si_OVS_1e1f
    _TFCCC4test1a1b1c1dfS2_FT1zS0_1xS1_1vFT1xSi_Si_OVS_1e1f ---> test.a.b.c.d (test.a.b.c)(z : test.a, x : test.a.b, v : (x : Swift.Int) -> Swift.Int) -> test.e.f

So what does it all mean?
Well, to most people, not a lot. Reading mangled names is fairly straightforward, in an algorithmic sense, but needlessly difficult for human eyes. That's why demangling tools exist; should you run across mangled symbol names in practice, there's no need to squint and mentally parse it all out. There are many, many, many more variations on mangled symbol names; I haven't touched on operator overloads, generics, protocols, or Objective-C compatible types, just to name a few. Here are just a few examples the compiler provided for free from the Swift code given above:

    _TFV4test1eCfMS0_FT_S0_ ---> test.e.init (test.e.Type)() -> test.e
    _TMLCCC4test1a1b1c ---> lazy cache variable for type metadata for test.a.b.c
    _TMmCCC4test1a1b1c ---> metaclass for test.a.b.c
    _TMnCC4test1a1b ---> nominal type descriptor for test.a.b
    _TTWOV4test1e1fSs9EquatableFS2_oi2eeUS2___fMQPS2_FTS3_S3__Sb ---> protocol witness for Swift.Equatable.== infix <A : Swift.Equatable>(Swift.Equatable.Self.Type)(Swift.Equatable.Self, Swift.Equatable.Self) -> Swift.Bool in conformance test.e.f : Swift.Equatable
    _TWoFC4test1aCfMS0_FT_S0_ ---> witness table offset for test.a.__allocating_init (test.a.Type)() -> test.a
    _TWoFCCC4test1a1b1c1dfS2_FT1zS0_1xS1_1vFT1xSi_Si_OVS_1e1f ---> witness table offset for test.a.b.c.d (test.a.b.c)(z : test.a, x : test.a.b, v : (x : Swift.Int) -> Swift.Int) -> test.e.f

And so on.

To top it off, the Swift name mangling algorithm is completely undocumented and subject to change, as with most things Swift-related. The above examples were all produced using Xcode 6 beta 5.

In conclusion
Apple has taken a concept pioneered by C++ and expanded on it, based on Swift's unique and powerful type system. While Swift mangling shares some basic concepts with C++ mangling, it is in fact considerably different, and in some ways more powerful. It will be exciting to see whether Apple open sources, or at least documents, the logic behind Swift in general and the name mangling logic in particular, and opens up the secrets behind Swift's innovative design.

Easter egg
In case anyone was wondering, here's what happens when you add Unicode to the mix:

    $ xcrun swiftc -emit-library -o test -
    func 💛 (lhs: Int, rhs: Int) -> Int {
            return 0;
    }
    ^D
    $ nm -g test
    ...
        0000000000001420 T __TF4testX4GrIhFTSiSi_Si
    ...
    $ xcrun swift-demangle __TF4testX4GrIhFTSiSi_Si
    _TF4testX4GrIhFTSiSi_Si ---> test.💛 (Swift.Int, Swift.Int) -> Swift.Int

X4GrIh translates to:

X: eXtended character set
4: the encoded length of the name
GrIh: the modified-Punycode encoding of the 💛 emoji (U+1F49B)

Swift does not use standard Punycode encoding as used in DNS domain names, but it is similar. For more information, see RFC3492, the Punycode standard.

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

glasspusher at 2014-08-20 05:12:52:

Great set of articles! I don't get into the guts of things as much as I used to...my code doesn't need to be blazing fast, but understanding stuff underneath never hurts. Fascinating and I enjoy following along.

Tonny at 2014-11-05 22:33:08:

Good article. Even first several paragraphs revealed things for me that I have never thought of. Like the fact that in objective-c you cannot overload methods, since you won't explore while doing regular stuff.

Paul Von Schrottky at 2015-02-03 21:05:33:

Do you know how to use the native Punycode converter? Let's say I have a ViewController, my app name is "Acá Estoy" and I want to generate the mangled name "_TtC10Aca__Estoy14ViewController".
I see that the Punycoding is adding an extra underscore for the acute "á", however I don't know how to generate this programatically.

Paul Von Schrottky at 2015-02-04 12:58:03:

I found a way, mangle a known class and extract the target name using regex "_TtC[0-9]+([^0-9]+)[0-9]+". Then use that to generate mangled class names for any other class at runtime.

Gabriele at 2015-08-02 20:22:18:

In C++, when defining a function, we can use the extern "C" attribute to force the C style mangling, which makes the function callable from a C code (and maybe also change the ABI in use). Is there anything similar in Swift? Something similar to @objc but making a simple swift function callable from C?

Mike K at 2015-09-09 03:22:20:

Thanks for the great article. It was really helpful.

As you might know, the modified Punycode is just Punycode with upper-case letters instead of digits. See implementation here: https://gist.github.com/xtravar/6b52f59fb133229b360e

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Code syntax highlighting thanks to Pygments.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
	Formatting: `<i> <b> <blockquote> <code>`.
	NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.