Next article: Friday Q&A 2016-02-19: What Is the Secure Enclave?
Previous article: Friday Q&A 2015-12-25: Swifty Target/Action
Tags: fridayqna memory swift
Swift's classes tend to be straightforward for most people new to the language to understand. They work pretty much like classes in any other language. Whether you've come from Objective-C or Java or Ruby, you've worked with something similar. Swift's structs are another matter. They look sort of like classes, but they're value types, and they don't do inheritance, and there's this copy-on-write thing I keep hearing about? Where do they live, anyway, and how do they work? Today, I'm going to take a close look at just how structs get stored and manipulated in memory.
Simple structs
To explore how structs get stored in memory, I built a test program consisting of two files. I compiled the test program with optimizations enabled but without whole-module optimization. By building tests that make calls from one file to the other, I was able to prevent the compiler from inlining everything, providing a clearer picture of where everything gets stored and how the data is passed between functions.
To start out, I created a simple struct with three elements:
struct ExampleInts {
var x: Int
var y: Int
var z: Int
}
I created three functions that take an instance of this struct and return one of the fields:
func getX(parameter: ExampleInts) -> Int {
return parameter.x
}
func getY(parameter: ExampleInts) -> Int {
return parameter.y
}
func getZ(parameter: ExampleInts) -> Int {
return parameter.z
}
In the other file, I created an instance of the struct and called each get function:
func testGets() {
let s = ExampleInts(x: 1, y: 2, z: 3)
getX(s)
getY(s)
getZ(s)
}
The compiler generates this code for getX:
pushq %rbp
movq %rsp, %rbp
movq %rdi, %rax
popq %rbp
retq
Consulting our cheat sheet, we recall that arguments are passed sequentially in registers rdi, rsi, rdx, rcx, r8, and r9, and return values are placed in rax. The first two instructions here are just the function prologue, and the last two are the epilogue. The real work being done here is the movq %rdi, %rax, which takes the first parameter and returns it. Let's look at getY:
pushq %rbp
movq %rsp, %rbp
movq %rsi, %rax
popq %rbp
retq
This is almost the same, but it returns the second parameter. How about getZ?
pushq %rbp
movq %rsp, %rbp
movq %rdx, %rax
popq %rbp
retq
Again, almost the same, but it returns the third parameter. From this we can see that the individual struct elements are treated as separate parameters and passed to the functions individually. Picking out an element on the receiving end is a simple matter of picking the right register.
Let's confirm this on the calling end. Here's the generated code for testGets:
pushq %rbp
movq %rsp, %rbp
movl $1, %edi
movl $2, %esi
movl $3, %edx
callq __TF4main4getXFVS_11ExampleIntsSi
movl $1, %edi
movl $2, %esi
movl $3, %edx
callq __TF4main4getYFVS_11ExampleIntsSi
movl $1, %edi
movl $2, %esi
movl $3, %edx
popq %rbp
jmp __TF4main4getZFVS_11ExampleIntsSi
We can see it constructing the struct instance directly in the parameter registers. (The edi, esi, and edx registers refer to the lower 32 bits of the rdi, rsi, and rdx registers, respectively.) It doesn't even bother trying to save the values across the calls, but just rebuilds the struct instance each time. Since the compiler knows exactly what the contents are, it can deviate significantly from how the Swift code is written. Note how the call to getZ is generated a bit differently from the other two. Since it's the last thing in the function, the compiler generates it as a tail call, cleaning up the local call frame and setting up getZ to return directly to the function that called testGets.
Let's see what sort of code the compiler generates when it doesn't know the struct contents. Here's a variant on this test function which gets the struct instance from elsewhere:
func testGets2() {
let s = getExampleInts()
getX(s)
getY(s)
getZ(s)
}
getExampleInts just creates the struct instance and returns it, but it's in the other file so the compiler can't see what's going on when optimizing testGets2. Here's that function:
func getExampleInts() -> ExampleInts {
return ExampleInts(x: 1, y: 2, z: 3)
}
What sort of code does testGets2 generate, now that the compiler can't know what the struct contains? Here it is:
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %rbx
pushq %rax
callq __TF4main14getExampleIntsFT_VS_11ExampleInts
movq %rax, %rbx
movq %rdx, %r14
movq %rcx, %r15
movq %rbx, %rdi
movq %r14, %rsi
movq %r15, %rdx
callq __TF4main4getXFVS_11ExampleIntsSi
movq %rbx, %rdi
movq %r14, %rsi
movq %r15, %rdx
callq __TF4main4getYFVS_11ExampleIntsSi
movq %rbx, %rdi
movq %r14, %rsi
movq %r15, %rdx
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
jmp __TF4main4getZFVS_11ExampleIntsSi
Since the compiler can't reconstitute the values at each step, it has to save them. It places the three struct elements into the registers rbx, r14, and r15, then loads the parameter registers from those registers at each call. Those registers are saved by the caller, which means that their values are preserved across the call. And again, the compiler generates a tail call for getZ, with some more extensive cleanup beforehand.
At the top of the function, it calls getExampleInts and then loads values from rax, rdx, and rcx. Apparently the struct values are returned in those registers. Let's look at getExampleInts to confirm:
pushq %rbp
movl $1, %edi
movl $2, %esi
movl $3, %edx
popq %rbp
jmp __TFV4main11ExampleIntsCfMS0_FT1xSi1ySi1zSi_S0_
This places the values 1, 2, and 3 into the argument registers, then calls the struct's constructor. Here's the generated code for that constructor:
pushq %rbp
movq %rsp, %rbp
movq %rdx, %rcx
movq %rdi, %rax
movq %rsi, %rdx
popq %rbp
retq
Sure enough, it returns the three values in rax, rdx, and rcx. The cheat sheet says nothing about returning multiple values in multiple registers. How about the official PDF? It does say that two values can be returned in rax and rdx, but there's no mention of returning a third value in rcx. That's clearly what's happening, though. That's the fun of a new language: it doesn't always have to play by the old rules. If it was interoperating with C code it would have to follow the standard conventions, but Swift-to-Swift calls can invent new ones.
How about inout parameters? If they work like we'd do it in C, we'd expect the struct to be laid out in memory and a pointer passed in. Here are two test functions (in two different files, of course):
func testInout() {
var s = getExampleInts()
totalInout(&s)
}
func totalInout(inout parameter: ExampleInts) -> Int {
return parameter.x + parameter.y + parameter.z
}
Here's the generated code for testInout:
pushq %rbp
movq %rsp, %rbp
subq $32, %rsp
callq __TF4main14getExampleIntsFT_VS_11ExampleInts
movq %rax, -24(%rbp)
movq %rdx, -16(%rbp)
movq %rcx, -8(%rbp)
leaq -24(%rbp), %rdi
callq __TF4main10totalInoutFRVS_11ExampleIntsSi
addq $32, %rsp
popq %rbp
retq
In the prologue, it creates a 32-byte stack frame. It then calls getExampleInts, and after the call saves the resulting values into stack slots at offsets -24, -16, and -8. It then calculates a pointer to offset -24, loads that into the rdi parameter register, and calls totalInout. Here's the generated code for that function:
pushq %rbp
movq %rsp, %rbp
movq (%rdi), %rax
addq 8(%rdi), %rax
jo LBB4_3
addq 16(%rdi), %rax
jo LBB4_3
popq %rbp
retq
LBB4_3:
ud2
This loads the values by offset from the parameter that's passed in, totaling them up and returning the result in rax. The jo instructions are checking for overflow. If either of the addq instructions produce an oveflow, the jo instructions will jump down to the ud2 instruction which terminates the program.
We can see that it's exactly as we expected: when passing the struct to an inout parameter, the struct is laid out contiguously in memory and then a pointer to it is passed in.
Big structs
What happens if we're dealing with a larger struct, bigger than fits comfortably in registers? Here's a test struct with ten elements:
struct TenInts {
var elements = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
}
Here's a get function that constructs an instance and returns it. This is placed in a separate file to avoid inlining:
func getHuge() -> TenInts {
return TenInts()
}
Here's a function that gets an element out of this struct:
func getHugeElement(parameter: TenInts) -> Int {
return parameter.elements.5
}
Finally, a test function that exercises these:
func testHuge() {
let s = getHuge()
getHugeElement(s)
}
Let's look at the generated code, starting with testHuge:
pushq %rbp
movq %rsp, %rbp
subq $160, %rsp
leaq -80(%rbp), %rdi
callq __TF4main7getHugeFT_VS_7TenInts
movups -80(%rbp), %xmm0
movups -64(%rbp), %xmm1
movups -48(%rbp), %xmm2
movups -32(%rbp), %xmm3
movups -16(%rbp), %xmm4
movups %xmm0, -160(%rbp)
movups %xmm1, -144(%rbp)
movups %xmm2, -128(%rbp)
movups %xmm3, -112(%rbp)
movups %xmm4, -96(%rbp)
leaq -160(%rbp), %rdi
callq __TF4main14getHugeElementFVS_7TenIntsSi
addq $160, %rsp
popq %rbp
retq
This code (excluding the function prologue and epilogue) can be broken into three pieces.
The first piece calculates the address for offset -80 relative to the stack frame, and calls getHuge, passing that address as a parameter. The getHuge function has no parameters in the source code, but it's common to use a hidden parameter to return larger structs. The caller allocates storage for the return value, then passes a pointer to that storage in the hidden parameter. That appears to be what's going on here, with that allocated storage residing on the stack.
The second piece copies the returned struct from stack offset -80 to stack offset -160. It loads pieces of the struct sixteen bytes at a time into five xmm registers, then places the contents of those registers back on the stack starting at offset -160. I'm not clear why the compiler generates this copy rather than using the original value in place. I suspect the optimizer just isn't quite smart enough to realize that it doesn't need the copy.
The third piece calculates the address for stack offset -160 and then calls getHugeElement passing that address as a parameter. In our previous experiment with a three-element struct, it was passed by value in registers. With this larger struct, it's passed by pointer instead.
The generated code for the other functions confirms this: the struct is passed in and out by pointer, and lives on the stack. Here's getHugeElement to start with:
pushq %rbp
movq %rsp, %rbp
movq 40(%rdi), %rax
popq %rbp
retq
This loads offset 40 from the parameter passed in. Each element is eight bytes, so offset 40 corresponds to elements.5. The function then returns this value.
Here's getHuge:
pushq %rbp
movq %rsp, %rbp
pushq %rbx
subq $88, %rsp
movq %rdi, %rbx
leaq -88(%rbp), %rdi
callq __TFV4main7TenIntsCfMS0_FT_S0_
movups -88(%rbp), %xmm0
movups -72(%rbp), %xmm1
movups -56(%rbp), %xmm2
movups -40(%rbp), %xmm3
movups -24(%rbp), %xmm4
movups %xmm0, (%rbx)
movups %xmm1, 16(%rbx)
movups %xmm2, 32(%rbx)
movups %xmm3, 48(%rbx)
movups %xmm4, 64(%rbx)
movq %rbx, %rax
addq $88, %rsp
popq %rbx
popq %rbp
retq
This looks a lot like testHuge above: it allocates stack space, calls a function, in this case, the TenInts constructor function, then copies the return value to its final location. Here, that final location is the pointer passed in as the implicit parameter.
While we're here, let's take a look at the TenInts constructor:
pushq %rbp
movq %rsp, %rbp
movq $1, (%rdi)
movq $2, 8(%rdi)
movq $3, 16(%rdi)
movq $4, 24(%rdi)
movq $5, 32(%rdi)
movq $6, 40(%rdi)
movq $7, 48(%rdi)
movq $8, 56(%rdi)
movq $9, 64(%rdi)
movq $10, 72(%rdi)
movq %rdi, %rax
popq %rbp
retq
Like the other functions, this takes a pointer to memory for the new struct as an implicit parameter. It then stores the values 1 through 10 into that memory and returns.
I came across an interesting case while building out these test cases. Here's a test function which makes three calls to getHugeElement intsead of just one:
func testThreeHuge() {
let s = getHuge()
getHugeElement(s)
getHugeElement(s)
getHugeElement(s)
}
Here's the generated code:
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
subq $392, %rsp
leaq -120(%rbp), %rdi
callq __TF4main7getHugeFT_VS_7TenInts
movq -120(%rbp), %rbx
movq %rbx, -376(%rbp)
movq -112(%rbp), %r8
movq %r8, -384(%rbp)
movq -104(%rbp), %r9
movq %r9, -392(%rbp)
movq -96(%rbp), %r10
movq %r10, -400(%rbp)
movq -88(%rbp), %r11
movq %r11, -368(%rbp)
movq -80(%rbp), %rax
movq -72(%rbp), %rcx
movq %rcx, -408(%rbp)
movq -64(%rbp), %rdx
movq %rdx, -416(%rbp)
movq -56(%rbp), %rsi
movq %rsi, -424(%rbp)
movq -48(%rbp), %rdi
movq %rdi, -432(%rbp)
movq %rbx, -200(%rbp)
movq %rbx, %r14
movq %r8, -192(%rbp)
movq %r8, %r15
movq %r9, -184(%rbp)
movq %r9, %r12
movq %r10, -176(%rbp)
movq %r10, %r13
movq %r11, -168(%rbp)
movq %rax, -160(%rbp)
movq %rax, %rbx
movq %rcx, -152(%rbp)
movq %rdx, -144(%rbp)
movq %rsi, -136(%rbp)
movq %rdi, -128(%rbp)
leaq -200(%rbp), %rdi
callq __TF4main14getHugeElementFVS_7TenIntsSi
movq %r14, -280(%rbp)
movq %r15, -272(%rbp)
movq %r12, -264(%rbp)
movq %r13, -256(%rbp)
movq -368(%rbp), %rax
movq %rax, -248(%rbp)
movq %rbx, -240(%rbp)
movq -408(%rbp), %r14
movq %r14, -232(%rbp)
movq -416(%rbp), %r15
movq %r15, -224(%rbp)
movq -424(%rbp), %r12
movq %r12, -216(%rbp)
movq -432(%rbp), %r13
movq %r13, -208(%rbp)
leaq -280(%rbp), %rdi
callq __TF4main14getHugeElementFVS_7TenIntsSi
movq -376(%rbp), %rax
movq %rax, -360(%rbp)
movq -384(%rbp), %rax
movq %rax, -352(%rbp)
movq -392(%rbp), %rax
movq %rax, -344(%rbp)
movq -400(%rbp), %rax
movq %rax, -336(%rbp)
movq -368(%rbp), %rax
movq %rax, -328(%rbp)
movq %rbx, -320(%rbp)
movq %r14, -312(%rbp)
movq %r15, -304(%rbp)
movq %r12, -296(%rbp)
movq %r13, -288(%rbp)
leaq -360(%rbp), %rdi
callq __TF4main14getHugeElementFVS_7TenIntsSi
addq $392, %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
The structure of this function is similar to the previous version. It calls getHuge, copies the result, then calls getHugeElement three times. For each call, it copies the struct again, presumably to guard against getHugeElement making modifications. What I found really interesting is that the copies are all done one element at a time using integer registers, rather than two elements at a time in xmm registers as testHuge did. I'm not sure what causes the compiler to choose the integer registers here, as it seems like copying two elements at a time with the xmm registers would be more efficient and result in smaller code.
I also experimented with really large structs:
struct HundredInts {
var elements = (TenInts(), TenInts(), TenInts(), TenInts(), TenInts(), TenInts(), TenInts(), TenInts(), TenInts(), TenInts())
}
struct ThousandInts {
var elements = (HundredInts(), HundredInts(), HundredInts(), HundredInts(), HundredInts(), HundredInts(), HundredInts(), HundredInts(), HundredInts(), HundredInts())
}
func getThousandInts() -> ThousandInts {
return ThousandInts()
}
The generated code for getThousandInts is pretty crazy:
pushq %rbp
pushq %rbx
subq $8008, %rsp
movq %rdi, %rbx
leaq -8008(%rbp), %rdi
callq __TFV4main12ThousandIntsCfMS0_FT_S0_
movq -8008(%rbp), %rax
movq %rax, (%rbx)
movq -8000(%rbp), %rax
movq %rax, 8(%rbx)
movq -7992(%rbp), %rax
movq %rax, 16(%rbx)
movq -7984(%rbp), %rax
movq %rax, 24(%rbx)
movq -7976(%rbp), %rax
movq %rax, 32(%rbx)
movq -7968(%rbp), %rax
movq %rax, 40(%rbx)
movq -7960(%rbp), %rax
movq %rax, 48(%rbx)
movq -7952(%rbp), %rax
movq %rax, 56(%rbx)
movq -7944(%rbp), %rax
movq %rax, 64(%rbx)
movq -7936(%rbp), %rax
movq %rax, 72(%rbx)
...
movq -104(%rbp), %rax
movq %rax, 7904(%rbx)
movq -96(%rbp), %rax
movq %rax, 7912(%rbx)
movq -88(%rbp), %rax
movups -80(%rbp), %xmm0
movups -64(%rbp), %xmm1
movups -48(%rbp), %xmm2
movups -32(%rbp), %xmm3
movq %rax, 7920(%rbx)
movq -16(%rbp), %rax
movups %xmm0, 7928(%rbx)
movups %xmm1, 7944(%rbx)
movups %xmm2, 7960(%rbx)
movups %xmm3, 7976(%rbx)
movq %rax, 7992(%rbx)
movq %rbx, %rax
addq $8008, %rsp
popq %rbx
popq %rbp
retq
The compiler generates two thousand instructions to copy this struct. This seems like a good place to emit a call to memcpy, but I imagine that optimizing for absurdly gigantic structs isn't a high priority for the compiler team right now.
Class Fields
Let's take a look at what happens when the struct fields are more complicated than simple integers. Here's a simple class, and a struct which contains one:
class ExampleClass {}
struct ContainsClass {
var x: Int
var y: ExampleClass
var z: Int
}
Here's a set of functions (split across two files to defeat inlining) which exercise them:
func testContainsClass() {
let s = ContainsClass(x: 1, y: getExampleClass(), z: 3)
getClassX(s)
getClassY(s)
getClassZ(s)
}
func getExampleClass() -> ExampleClass {
return ExampleClass()
}
func getClassX(parameter: ContainsClass) -> Int {
return parameter.x
}
func getClassY(parameter: ContainsClass) -> ExampleClass {
return parameter.y
}
func getClassZ(parameter: ContainsClass) -> Int {
return parameter.z
}
Let's start by looking at the generated code for the getters. Here's getClassX:
pushq %rbp
movq %rsp, %rbp
pushq %rbx
pushq %rax
movq %rdi, %rbx
movq %rsi, %rdi
callq _swift_release
movq %rbx, %rax
addq $8, %rsp
popq %rbx
popq %rbp
retq
The three struct elements will be passed in the first three parameter registers, rdi, rsi, and rdx. This function wants to return the value in rdi by moving it to rax and then returning, but it has to do some bookkeeping first. It appears that the object reference passed in rsi is passed retained, and must be released before the function returns. This code moves rdi into a safe temporary register, rbx, then moves the object reference to rdi and calls swift_release to release it. It then moves the value in rbx to the return register rax and returns from the function.
The code for getClassZ is pretty much the same, except instead of taking the value from rdi, it takes it from rdx:
pushq %rbp
movq %rsp, %rbp
pushq %rbx
pushq %rax
movq %rdx, %rbx
movq %rsi, %rdi
callq _swift_release
movq %rbx, %rax
addq $8, %rsp
popq %rbx
popq %rbp
retq
The code for getClassY will be the odd one, since it returns an object reference rather than an integer. Here it is:
pushq %rbp
movq %rsp, %rbp
movq %rsi, %rax
popq %rbp
retq
This is short! It moves the value from rsi, which is the object reference, into rax and returns it. There's no bookkeeping, just a quick shuffling of data. Apparently, the value is passed in retained, but also returned retained, so this code doesn't have to do any memory management at all.
So far we've seen that the code for dealing with this struct is much like the code for dealing with the struct containing three Int fields, except that the object reference field is passed in retained and must be released by the callee. With that in mind, let's look at the generated code for testContainsClass:
pushq %rbp
movq %rsp, %rbp
pushq %r14
pushq %rbx
callq __TF4main15getExampleClassFT_CS_12ExampleClass
movq %rax, %rbx
movq %rbx, %rdi
callq _swift_retain
movq %rax, %r14
movl $1, %edi
movl $3, %edx
movq %rbx, %rsi
callq __TF4main9getClassXFVS_13ContainsClassSi
movq %r14, %rdi
callq _swift_retain
movl $1, %edi
movl $3, %edx
movq %rbx, %rsi
callq __TF4main9getClassYFVS_13ContainsClassCS_12ExampleClass
movq %rax, %rdi
callq _swift_release
movl $1, %edi
movl $3, %edx
movq %rbx, %rsi
popq %rbx
popq %r14
popq %rbp
jmp __TF4main9getClassZFVS_13ContainsClassSi
The first thing this function does is call getExampleClass to get the ExampleClass instance it stores in the struct. It takes the returned reference and moves it to rbx for safekeeping.
Next, it calls getClassX, and to do so it has to build a copy of the struct in the parameter registers. The two integer fields are easy, but the object field needs to be retained to match what the functions expect. The code calls swift_retain on the value stored in rbx, then places that value in rsi and places 1 and 3 in rdi and rdx to build the complete struct. Finally, it calls getClassX.
The code to call getClassY is nearly the same. However, getClassY returns an object reference which needs to be released. After the call, this code moves the return value into rdi and calls swift_release to take care of its required memory management.
This function calls getClassZ as a tail call, so the code here is a bit different. The object reference came retained from getExampleClass, so it doesn't need to be retained separately for this final call. This code places it into rsi, places 1 and 3 into rdi and rdx again, then cleans up the stack and jumps to getClassZ to make the final call.
Ultimately, there's little change from a struct with all Ints. The only real difference is that copying a struct with an object in it requires retaining that object, and disposing of that struct requires releasing the object.
Conclusionstruct storage in Swift is ultimately pretty straightforward, and much of what we've seen carries over from C's much simpler structs. A struct instance is largely treated as a loose collection of independent values, which can be manipulated collectively when required. Local struct variables might be stored on the stack or the individual pieces might be stored in registers, depending on the size of the struct, the register usage of the rest of the code, and the whims of the compiler. Small structs are passed and returned in registers, while larger structs are passed and returned by reference. structs get copied whenever they're passed and returned. Although you can use structs to implement copy-on-write data types, the base language construct is copied eagerly and more or less blindly.
That's it for today. Come back next time for more daring feats of programming. Friday Q&A is driven by reader ideas, so if you grow bored while waiting for the next installment and have something you'd like to see discussed, send it in!
Comments:
Jon: All of these examples are with optimization enabled, but not whole-module optimization. WMO will allow inlining and many other fancy things even when parts are in separate files, as long as they're all in the same module (e.g. in the same app target). This is great for real-world code but troublesome for testing codegen because inlining and dead-code elimination make it hard to get the compiler to generate anything useful (or indeed, anything at all) for simplified examples. If you have WMO enabled then some of the cases where struct data has to be loaded into registers or stack memory will be eliminated due to inlining, when the compiler judges it to be a net win. Where the compiler refrains from inlining, then the calls will look just like they do here.
Hope all of these mov will please the processor.
Using as much as possible of different registers will be better optimized by the out-of-order cpu exec.
Swift brags about copy-on-write structs, but it's the Swift stdlib collection types (Array, etc) that implement CoW. It's not a language-level feature.
In fact, the Swift team suggests using it, when possible, by simply stealing these implementations:
"To eliminate the cost of copying large values adopt copy-on-write behavior. The easiest way to implement copy-on-write is to compose existing copy-on-write data structures, such as Array."
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.