I'm attempting to read what I believe is a fixed-width file into Python. Haven't typically had to deal with encoding issues before, and this one stumps me a bit.
File has no extension, I believe its from a mainframe cobol system. When I open it in sublime I see this:
5cf0 f1f2 f4f2 f0f2 f401 489c 024c f0f1
f2f5 f2f0 f2f4 0148 9c02 5c00 1c00 0000
0000 0000 0000 0040 4040 4040 4040 4040
4040 4040 4040 4040 4040 4040 4040 4040
... and so on
The best encoding I see for this is cp037 or cp1140:
with open(said_file, encoding='cp1140') as f:
data = f.read()
results in somewhat legible material - the date of the file etc. After some brute forcing it looks like 437 is the length of the fixed width rows. Here's the first few characters of a row after the above header:
*01242024\x01çæ\x02<01252024\x01çæ\x02*\x00\x1c\x00\x00\x00... <to len 437, I suppose this is a header>
C\x010026987100\x01\x00\x01çæ\x00æ2... < to len 437 * 2, I suppose this is the actual start of the data> etc
...
Accompanying this file is a .dtd xml file that describes the fixed width nature. This cedilla (ç) occurs frequently at offset 14 length 3, which the .dtd file describes as "SIGNED,TRAILING_SIGN". I am unsure how exactly to translate this into Python using something like the struct library.
How can I better reverse engineer this data, with an end goal of making this usable in something like Pandas? Until I hear otherwise, I have extreme doubts that this cedilla actually exists, and is instead an encoding issue happening in translation.
I've asked for the cobol copybooks, but until then I'm interested in hearing about other techniques I can use to better reverse engineer this file.
EDIT:
Adding a quick sample of the DTD. Trying to be careful since this is sensitive data. Theres one per "column" and appears to be a few for groups of columns. Below is an example of a singular record item in this file.
...
<SOURCEFIELD BUSINESSNAME = "" DATATYPE="number" FIELDNUMBER = "28" FIELDPROPERTY = "0" FIELDTYPE="ELEMITEM" HIDDEN="NO" KEYTYPE="NOT A KEY" LENGTH="10" LEVEL="5" NAME="CSTCTL_MO_YR" NULLABLE="NULL" OCCURS="0" OFFSET="79" PHYSICALLENGTH="3" PHYSICALOFFSET="69" PICTURETEXT="S9(5)" PRECISION="5" SCALE="0" USAGE="COMP-3" USGE_FLAGS="SIGNED,TRAILING_SIGN">
...
As you have the structure then just use that.
The important thing is to not use any encoding but read the file as binary. Then split it into its parts and handle the text parts one by one as bytearray which you then convert with whatever encoding it has. Apart from "plain text" and "encoded 8bit EBCDIC" you likely have packed numbers, possibly binary numbers and potentially also UTF-8 and/or UTF-16 - all mixed together.
This does look like a sequential "text" file (fixed-length) without a header, that contains either binary data or (quite likely)
PACKED-DECIMAL/ binary coded decimal.Looking at https://en.wikipedia.org/wiki/EBCDIC (use correct sub-encoding to get non-English characters out) this may be
5+(or an asterisk, see next comment) +01242024(either text or numeric display) +01489+(each half-byte is a digit, the last half-byte 0x0c is positive, negative would be 0x0d, unsigned 0x0f, in a COBOL copybookPIC S9(05) PACKED-DECIMAL(could also beS9(04)in which case the leading zero is only the necessary padding byte) +01252024+01489++025++001++ some binary zeroes (this could be a computational field, or something else, be aware of byte-ordering issues then!) and so forth.In case of numeric display there also may be an "overpunch" of a sign. Check if your dtd is enough to get the structure or just wait for the copybook and make yourself familiar with the COBOL definitions using IBM Documentation on
USAGE.