I'm reading data from machines using the Modbus protocol. If you're not familiar with Modbus, I envy you. More seriously though, Modbus is a low-level protocol that allows you to read "registers" from a machine by specifying a start address and a number of registers to read. A "register" is just a two-byte word. Machines implementing Modbus generally only natively understand how to encode whole-number types (int, short, etc), and floats in their registers. So, if you need to represent another datatype, you have to find a way to represent it as a series of integers and floats on the Modbus-implementing machine's end, and devise some way to reconstruct those individual integer+float values into the actual datatype you need on your end.

This wouldn't be an issue, but the machines I'm communicating with use an unholy combination of little-endian and big-endian. Specifically, ints, shorts, etc are all encoded in big-endian, but floats are encoded in a format that is neither big-endian nor little-endian. For example, if we have a floating-point value x that would be represented as 01 02 03 04 in big-endian, and 04 03 02 01 in little-endian, these machines would represent x as 03 04 01 02. I don't know any name for this format, so I've just been calling it "regswapped big-endian", as it would be big-endian if you just swapped the contents of the first and second registers.

Until now, I've been decoding regswapped floats by just swapping the positions of bytes 1+2 with bytes 3+4 whenever I read a float, before passing it to struct.unpack(">f"). That has worked just fine because the actual data I'm reading has just been integers and floats.

However, the requirements of the data-collection software I wrote were recently expanded. It is no longer sufficient to assume that a float is just a float, and an integer is just an integer, as I now need to support more complex types that are composed of multiple integers and/or floats. For example, a series of six unsigned shorts and a float which together represent a timestamp with offset. If the floats were in big-endian format like every other value the machines record, I would just pass the data to struct.unpack(">HHHHHHf"), and then pass the output from that to datetime.datetime(), but they aren't encoded in big-endian format - I still need to swap bytes 1+2 with bytes 3+4 of any float I read before I can actually decode it. But since I'm reading data that is composed of multiple individual values now, I can't just blindly swap bytes 1+2 with bytes 3+4 - I need to know exactly where any floats are located within the bytes I've read.

The most straightforward way I can think of to handle this would be to just add my own custom format code to the struct module's format specification mini-language to represent a "regswapped float", maybe using the character "r". That way, I could just do struct.unpack(">HHHHHHr") or struct.unpack(">rr"), or whatever else I need, and be done with it. However, it seems that the struct module's format specification mini-language comes directly from Python's underlying C implementation.

So instead, I thought I could create a subclass of struct.Struct that checks for "r" in the format string that it's passed, perform any byte-swapping on "r"-type floats as necessary, construct a new format string replacing all "r"s with "f"s, and unpack() the data using that new format string. But to be able to do that correctly with any arbitrary format string, I would effectively need to re-implement all the parsing rules for the struct format-specification mini-language, which would be a terrible idea.

So, is there any way for me to add a custom format code, or achieve something like it in Python, without modifying the underlying C?

1

There are 1 best solutions below

0
Nick Muise On

The technical answer to my original question as I asked it is "no, there is no way to add new format codes to the struct module's format minilanguage without modifying Python's underlying C implementation".

However, the practical answer to the actual problem I was trying to solve is more positive, if a bit more complicated than what I'd originally hoped for.

For a bit of context, I already had a few objects to handle the process of taking register bytes, decoding them to their underlying primitive value or values, and casting those primitive value(s) to the actual target datatype. Specifically, I'd defined the Decoder class to handle decoding register bytes to tuples of the underlying ints/floats/etc, and the Caster class to handle turning tuples of one or more ints/floats/etc to timestamps/whatever other datatype.

I had originally defined a small set of predefined Decoder objects, each of which took one of a few predefined struct.Structs as an initialization parameter. The Decoder.decode(register_bytes) method would literally just pass register bytes into the underlying struct to .unpack(). And I had a subclass of Decoder called RegswapDecoder that just blindly swapped bytes 1+2 with bytes 3+4, to handle the weird float format. That all worked just fine when all the reigsters I was reading only represented a single value, but this of course wasn't flexible enough to handle the more complicated formats of the new compound datatypes I need to handle, for the reasons I outlined in my question.

Directly using Structs inflexible, and that inflexibility is what was killing me, so I refactored the Decoder class to take an arbitrarily-defined "decode function" to use during decoding instead of a Struct as an initialization parameter. Then, I replaced the predefined Structs I had created with a few predefined functions instead. Technically, most of these decode functions literally just call a predefined Struct's .unpack() method, because that's all they actually need, but the added flexibility of using functions instead of just Structs means that I can handle compound datatypes that include the icky floats without issue. I do have to define a new cast function for each compound type I have to support, but the only actual code I need to write in them beyond calling some Struct's .unpack() method is swapping the positions of whatever bytes represent the icky floats, so they're not too hard to write. If there end up being a ton of compound types that use these icky floats, I'll have a whole ton of repetitive code that just switches bytes 1+2 and 3+4 of those floats, which is kind of a bummer, but realistically, there will only ever be a handful of compound types I need to support, due to the underlying limitations of the Modbus protocol. And as an added bonus, if we ever start using another brand of PLC that mangles data in yet another way... I'll already have the flexibility I need to handle it.

Considering that I would have had to create a new Struct for each datatype anyways, it's really not a big deal to have to define a decode function for each type. And as a bonus to my new approach, I was able to drop the RegswapDecoder class - the more I think about it, the more I feel like that class was bad design shoehorn anyways.

Ultimately, while I still would have preferred to be able to just extend the struct module's format minilanguage, I feel that this solution was a good-enough way to achieve what I actually needed to accomplish.