I'm in the process of writing an x86_64 disassembler, to get a better understanding of the assembly-encoding rules. I got a working version, and I understand most things about prefixes, modR/M and so on. But I'm a bit unsure what's the smartest way for detecting the type of instruction (after all prefixes have been evaluated; and before the modR/M byte is checked).
Initially, after seeing that an instruction like "push" encodes the register in the first 3 bits of the opcode, I started parsing instructions like this:
struct EncodedInstructionByte
{
uint8_t encoding : 3;
uint8_t header : 5;
};
This seems to work for most instructions, at first. push could be identified by checking "header == 0x0A", pop with 0x0B, and so forth. Though when parsing the setcc/jcc-instructions, they seem to have an actual "header" of 4 bits, and encode the cc-code in their last 4 bits:
struct EncodedInstructionByteCC
{
uint8_t ccCode : 4;
uint8_t header : 4;
};
Some of the instructions didn't even seem to fit that scheme at all, like call-indirect (which is just 0xFF). Note that was far from encountering or detecting all possible instructions, so my understanding at that point was pretty limited.
Now after looking at the Intel 64 Architecture Software Developer Manuals opcode-table (Table B13), it seems that the general format is 4-4 bits. For example, "push r" is given as
0101 0 reg
whereas "pop r" would then be
0101 1 reg
Meaning that I could check if the last four bits == 0101b/0x05, and then check the next bit to see if its push or pop; followed by the reg-index stored in the first 3 bits.
Does that actually make sense, or is the 0101-"header" for both push and pop purely incidential?
I guess I'm having a bit of trouble forming this into one poignant question. Aside from wanting to understand, my end-goal is to have condensed and smart opcode-detection scheme. Seeing how extensive the x86_64 instruction set is, I would not want to manually check all 7 opcode variants for "push r" and "pop r", but a general detection/parsing-scheme that works for the entire instruction-set. So I'm wondering if starting to indentify opcodes by looking at the last 4 bits for a first grouping, and then decerning further makes any sense; or if this is too broad of an attempt at categorization, and I'm seeing patterns that are not really intended. If my approach is not viable, I would appreciate if someone could suggest an alternative opcode detection scheme.
You probably just want a table by opcode byte of how to handle it, with only a few different handlers. So your code is not too messy but the data array initializer looks like an opcode map (http://ref.x86asm.net/coder64.html). Actually a few different maps, one for the
0F xx2-byte opcodes, others for the0F 3A xxand0F 38 xx3-byte opcodes. And then there's prefixes, e.g.F3 90 pauselooks likerep nop.For the
push regshort forms, 8 entries pointing to that5:3handler, or for cmovcc/jcc/setcc 16 entries each pointing at the condition-code handler.The entries could be a struct that also includes the mnemonic as a string, except for opcode bytes like
FFand others where the 3-bit ModRM/rfield is another three opcode bits, effectively: How to read the Intel Opcode notationYou probably don't want separate maps for combinations of prefixes, so you might record which prefixes have been seen before the opcode as bit flags (or as counts if you want to be able to print
rep rep rep add eax, ecxforf3 f3 f3 01 c8which has 3 meaningless REP/REPZ prefixes.)Currently that's meaningless, reserved for future use, but in practice inapplicable REP prefixes will be silently ignored. (So when CPU vendors want to add extensions that are backwards-compatible with existing CPUs, like performance hints, or like how
tzcntcan run asrep bsfand give the same result for inputs other than0, they can pick an encoding that includes a REP prefix.) No extension has ever required 2 of the same prefix, or two from the same category likeF3 F2(REPZ / REPNZ) on the same instruction.Anyway, your struct that says what to do next to disassemble the current instruction, given an opcode byte, might have pointers to more data structures for what a different instruction if there's a
repprefix. For example the entry for90isnopwith no prefixes, a 2-bytenop(xchg ax,ax) with a66prefix, orpausewith anF3prefix. So thestruct insn *with_repzmember might be a pointer to a struct withconst char *mnemonic = "pause";. Or maybe a table of meaningful prefixes that you linear-search? Lots of ways to go about this.Note that VEX and EVEX in 32-bit mode (and 16-bit protected mode) overlap invalid encodings of instructions like
lesandlds. VEX and EVEX prefixes don't decode in 16-bit real mode; in that case the invalid encodings will #UD fault instead of being VEX prefixes. Apparently some DOS software like SoftPC usedC4intentionally as a trap, and later NTVDM. In 64-bit mode,les,lds, andbounddon't exist, soC4andC5bytes always begin a VEX prefix, and62always EVEX. (Slipping between the cracks of invalid encodings in other modes is why they have some fields NOTed and can only encode 16 or 32 registers in 64-bit mode.)