How could I write 8 bytes of data to memory at once in 16-bit mode by nasm?

106 Views Asked by At

NASM compiler remind me the following code is error, "error: instruction not supported in 16-bit mode".

[bits 16]
xor ax, ax
mov ds, ax
mov qword [ds:0x0], 0x0

But the following code is ok:

[bits 16]
xor ax, ax
mov ds, ax
mov dword [ds:0x0], 0x0
mov dword [ds:0x0], 0x0

Who can tell me the reasons, I just want to write 8 bytes of data to memory at once in 16-bit mode.

I try to set some data to memory.

2

There are 2 best solutions below

4
Brendan On

How could I write 8 bytes of data to memory at once in 16-bit mode by nasm?

To write exactly 8 bytes of data at once in real mode, your choices are:

a) The cmpxch8b instruction, which requires a Pentium CPU or later

b) A floating point store (fst or fstp), which requires an FPU and that it's valid for double precision floating point's encoding.

c) The movq instruction, which requires a CPU that supports MMX.

You might also be able to write more than 8 bytes (e.g. read the last 8 bytes, then write 16 bytes with pusha so that the last 8 bytes are the same as they were previously). In that case, there's no atomicity guarantee with pusha (or pushad) so it may or may not meet your definition of "at once". There are multiple SSE and AVX stores that could work if your CPU is new enough to support them (if you do the required setup where necessary).

Of course if you don't need atomicity then it's easier to just do a pair of 32-bit stores (which requires an 80386 or later CPUs), like:

    mov dword [ds:0x0], 0x0
    mov dword [ds:0x0 + 4], 0x0

..or four 16-bit stores (which will work on all 80x86 CPUs in real mode).

The other alternative is to use no instructions at all, like:

        section .data
    dq 0                 ;Create 8 bytes of explicitly zeroed memory
        section .text

Who can tell me the reasons, I just want to write 8 bytes of data to memory at once in 16-bit mode.

There are always limits, and 8 bytes is four times as much as "16-bit" (or 2 bytes) was originally designed for. Over the years the limits have been increased with the introduction of new features (e.g. from 16-bit to 32-bit). The last increase (up to 64-bit) couldn't be made compatible with real mode due to not having enough previously unused opcodes that could become size override prefixes (AMD literally recycled previously used opcodes to make 64-bit code viable).

0
Peter Cordes On

With a single 64-bit store instruction, your options include x87, like fild / fistp to copy an arbitrary 64-bit integer. Or fldz / fstp to store a zero. On old CPUs this is probably not faster than two integer stores, and maybe not even on Pentium or newer CPUs.

Or MMX or SSE (xorps xmm0,xmm0 / movlps [0], xmm0).
MMX pxor mm0, mm0 / movq [0], mm0 would require an emms before you could use x87 insns, so prefer SSE if available (Pentium 3 or later). Especially if you want to zero more than 8 bytes, like 16 at a time, or consider rep stosd: optimized microcode on P6 and later CPUs will use wide stores if available, and it's not terrible on old CPUs.

If the address is aligned, these are guaranteed atomic on P5 Pentium or later. And being a single instruction, they're atomic wrt. interrupts on anything, in case that matters. (Unless the x87 fistp is emulated by a trap to software on an old CPU without hardware x87.)
Fun fact: GCC -m32 -march=pentium or ppro without SSE uses fild / fistp for std::atomic<uint64_t> .load and .store (Godbolt)


Since you're zeroing a register anyway and writing code that requires a 386 or later (for dword operand-size for mov), you could save code size by replacing all the displacements and immediates with that register (except for the +4; your second example stores 4 bytes to the same place twice)

 xor esi, esi          ; 3 bytes (in 16-bit mode, including the 66 operand-size prefix)
 mov ds, si            ; 2 bytes
 mov [si + 0], esi     ; 3 bytes (66-prefix + opcode + modrm)
 mov [si + 4], esi     ; 4 bytes (66-prefix + opcode + modrm + disp8)

12 bytes total vs. your 22 even after removing the redundant ds prefixes. (In NASM, [ds:0] emits an actual ds prefix byte in the machine code, even though that's already implied by an addressing mode where the base isn't bp/ebp or esp. You only need ds:0 for silly assemblers like MASM that ignore brackets when there aren't registers involved; in NASM syntax, [anything] is a memory operand.)

I picked ESI instead EAX because [si] is a valid 16-bit addressing-mode so we wouldn't need address-size prefixes on those instructions for [eax].

Using [si] and [si+4] addressing modes works for x87 or MMX/SSE as well, like movlps [si], xmm0 (3 bytes). vs. 4 bytes for SSE2 movq [si], xmm0.

SSE1 ...ps instructions have 2-byte opcodes (0F xx), later SSE instructions have larger code-size. movlps is the most efficient way to do a 64-bit store from an XMM register. (But for loads, use movsd or movq to zero extend; movlps loads merge a new low half into the existing XMM register, which is a false dependency and takes an ALU uop.)