What does unique_ptr<> cost?

2014-10-07 00:00:00 +0000

I just watched Jonathan Blow’s proposal for a new programming language, which got me thinking about the difficulties that motivated the talk.

In particular, I think unique_ptr<> is fantastic, but I’m curious about how it affects compile times and code size. Let’s find out.

First, some C++

#include <memory>

using std::unique_ptr;

struct S {
    unique_ptr<int[]> ints;
};

int main() {
    const int LEN = 50;
    auto s = S {
        unique_ptr<int[]> { new int[LEN] }
    };

    auto j = 0u;
    for (auto i = 0u; i < LEN; ++i) {
        s.ints[i] = ++j;
    }

    auto sum = 0;
    for (auto i = 0u; i < LEN; ++i) {
        sum += s.ints[i];
    }

    printf("Sum! %i\n", sum);
}

Next, the equivalent C:

#include <stdlib.h>
#include <stdio.h>

typedef struct S {
    int* ints;
} S;

#define LEN 50

int main() {
    S s;
    unsigned i, j;
    int sum;

    s.ints = (int*)malloc(LEN * sizeof(int));

    j = 0u;
    for (i = 0u; i < LEN; ++i) {
        s.ints[i] = ++j;
    }

    sum = 0;
    for (i = 0u; i < LEN; ++i) {
        sum += s.ints[i];
    }

    printf("Sum! %i\n", sum);

    free(s.ints);
}

On my machine (a mid-2012 Retina MBP), I get these figures: (averaged over 5 runs of clang and clang++ each)

We’ll look at code size too:

c1.c:      37ms / 8kb
cpp1.cpp: 123ms / 8kb

Wow! What’s going on?

First off, using clang++ to build the C version runs at the same speed. That should have been obvious, but I wanted to test it anyway.

Secondly, if I add #include <memory> to the C version and run it through clang++, I see the same 123ms build times.

So, that’s probably most of the picture, but my mental model of templates is that you pay for them in two ways:

When you #include the header, you pay the time it takes for the compiler to parse that header, and
you pay again to instantiate the template with a particular set of types.

So, what’s the instantiation cost?

Let’s try this:

struct S0 { int a0; S0(int i) : a0(i) {} }; auto u0 = unique_ptr<S0> { new S0(0) };
struct S1 { int a1; S1(int i) : a1(i) {} }; auto u1 = unique_ptr<S1> { new S1(1) };
// 998 more!

vs this

struct S0 { int a0; S0(int i) : a0(i) {} }; auto u0 = new S0(0);
struct S1 { int a1; S1(int i) : a1(i) {} }; auto u1 = new S1(1);
// 998 more!

// also
delete u0;
delete u1;
// et cetera

The results:

1000_structs.cpp:       1303ms / 79kb
1000_unique_ptrs.cpp:   6668ms / 185kb

Wow! It looks like each unique_ptr<> costs about 5ms and 100 bytes.

First, let’s look at compile speed. I wonder if it has to do with the number of unique (heh) unique_ptr<> instantiations, or mere utterance of the type name. If we change all the unique_ptr<>s so they have the same type, we get:

1000_identical_unique_ptrs.cpp: 599ms / 79kb

Wait, what? Why is it faster than doing it the hard way? Shouldn’t our build times be worse because we’re asking it to expand a bunch of extra templates?

Also note that file size matches up with what we get when we deallocate explicitly: Our executable gets larger with the number of kinds of unique_ptr<>s we instantiate, but it doesn’t cost anything to use the same kind of pointer many times. This makes sense: the full implementation shouldn’t be much more than a deleted copy constructor, a move constructor, and a destructor.

Could it be that all those delete statements cost 120ms? What happens if we remove them?

1000_struct_no_free.cpp: 667ms / 63kb

This is unexpected: a very boring, monomorphic built-in language construct costs more to use than a template class.

We still need to look into the code size:

$ clang++ -S -Os -std=c++11 1000_unique_ptrs.cpp
$ emacs 1000_unique_ptrs.cpp

I see this over and over:

    .private_extern __ZNSt3__110unique_ptrI2S1NS_14default_deleteIS1_EEED1Ev
    .globl  __ZNSt3__110unique_ptrI2S1NS_14default_deleteIS1_EEED1Ev
    .weak_def_can_be_hidden __ZNSt3__110unique_ptrI2S1NS_14default_deleteIS1_EEED1Ev
    .align  1, 0x90
__ZNSt3__110unique_ptrI2S1NS_14default_deleteIS1_EEED1Ev: ## @_ZNSt3__110unique_ptrI2S1NS_14default_deleteIS1_EEED1Ev
    .cfi_startproc
## BB#0:
    pushq   %rbp
Ltmp7:
    .cfi_def_cfa_offset 16
Ltmp8:
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
Ltmp9:
    .cfi_def_cfa_register %rbp
    movq    %rdi, %rax
    movq    (%rax), %rdi
    movq    $0, (%rax)
    testq   %rdi, %rdi
    je  LBB1_1
## BB#2:                                ## %_ZNKSt3__114default_deleteI2S1EclEPS1_.exit.i.i
    popq    %rbp
    jmp __ZdlPv                 ## TAILCALL
LBB1_1:                                 ## %_ZNSt3__110unique_ptrI2S1NS_14default_deleteIS1_EEED2Ev.exit
    popq    %rbp
    retq

FYI, the tail call at the end is the global operator delete:

$ c++filt __ZdlPv
operator delete(void*)

It looks like the destructor isn’t being inlined. That’s a shame. I wasn’t able to coerce clang into inlining it. (it already has the __always_inline__ attribute)

Recap

You pay a tiny bit of constant overhead just to #include <memory>
You pay a bit for each distinct specialization of unique_ptr<>, but it’s cheaper than what you pay for an explicit delete statement.
You pay a bit of filesize for each kind of unique_ptr<>.
It’s basically free to talk about many unique_ptr<>s of the same type.

Code