Why does the performance of Golang's memmove (copy/append) vary so much?

115 Views Asked by At

I recently encountered a performance problem that I can't easily explain. I hope that someone here can help me understand why these two pieces of code perform so differently.

(Code snippets in this text are simplified. The actual code can be seen in this file on Github.)

I'm attempting to refactor code in a library, without causing a performance regression. In order to do that I need to implement an in-memory io.Writer that I can continuously append to without doing expensive realloc-and-copy operations as bytes.Buffer does. My solution is to use a slice-of-byte-slices and allocate new byte-slices as I write - the outer slice will need to be realloced, but it's relatively small compared to the size of the data, and the actual data never needs to be copied once written.

type noReallocWriter struct {
    data [][]byte
    idx  int
    off  int
}

In my first implementation, all the individual byte-slices were allocated with a fixed capacity and on Write I would copy into them until full, then allocate a new one. I assumed it would benefit from always having large buffers of contiguous data and needing to allocate somewhat infrequently. (And in fact it does in other places, but not when writing.)

func (w *noReallocWriter) Write(p []byte) (int, error) {
    lenP := len(p)
    if lenP == 0 {
        return 0, nil
    }
    for len(p) > 0 {
        if w.idx == len(w.data) {
            w.data = append(w.data, make([]byte, 0, FIXED_SIZE))
        }
        curData := w.data[w.idx]
        n := copy(curData[w.off:cap(curData)], p)
        w.data[w.idx] = curData[:w.off+n]
        w.off += n
        p = p[n:]
        if w.off >= cap(curData) {
            w.idx++
            w.off = 0
        }
    }
    return lenP, nil
}

However, this implementation performed worse than the earlier version of the code I was trying to refactor. To help diagnose the issue, I wrote another version to more-closely mimic the behavior of the old, faster code. The new version allocates a new (variable-capacity) byte-slice for every write and copied the data into it.

func (w *noReallocWriter) Write(p []byte) (n int, err error) {
    if w.idx != len(w.data) {
        panic("noReallocWriter only supports appending to the end.")
    }

    if len(p) == 0 {
        return 0, nil
    }

    buf := make([]byte, 0, len(p))
    buf = append(buf, p...)
    a.data = append(a.data, buf)
    a.idx++
    return len(p), nil
}

In a very stripped-down synthetic test just writing a bunch of data to these writers, they perform about the same. However, when used (also in a synthetic test, but) in the context of my application, the second version performed much better. In both cases, (basically) all of the time is spent in memmove (copy / append) moving data from the passed in byte-slice (b) to the byte-slice stored inside the Writer. However, the version with fixed-size byte-slices spends 3-4x the time in memmove, and the time is split between the lines 372-375 of memmove_amd64.s. By comparison, the version with variable-sized buffers spends all of it's time on line 375. Normally I might not trust a profiler's analysis of assembly by line-number, but that measurement is very consistent.

365:    // Aligned memory copying there
366: gobble_128_loop:
367:    VMOVDQU (SI), Y0
368:    VMOVDQU 0x20(SI), Y1
369:    VMOVDQU 0x40(SI), Y2
370:    VMOVDQU 0x60(SI), Y3
371:    ADDQ    AX, SI
372:    VMOVDQA Y0, (DI)
373:    VMOVDQA Y1, 0x20(DI)
374:    VMOVDQA Y2, 0x40(DI)
375:    VMOVDQA Y3, 0x60(DI)
376:    ADDQ    AX, DI
377:    SUBQ    AX, BX
378:    JA  gobble_128_loop

I don't really understand the assembly, but I'm not sure how one implementation can spend so much time in lines 372-374 and the other doesn't, when they copy the same amount of data, mostly in similarly-sized chunks, and nothing jumps directly to line 375.

I ran performance profiles, which you can see here (fixed-size) and here (variable).

Some notes about the testing that generated those profiles:

In both versions, the application writes the same data in the same pattern. In practice, write sizes alternate between small (35 bytes) and large (just under 256KB, but variable). I tested the first version with different FIXED_SIZE values between 32KB and 1MB - performance was about the same from 64KB to 256KB and lower above or below that.

0

There are 0 best solutions below