Convert EBCDIC file to ASCII using Python 2

3.5k Views Asked by At

I need to convert the EBCDIC files to ASCII using python 2.

The sample extract from the sample file looks like the below (in notepad++)

enter image description here

I have tried to decode it with 'cp500' and then encode it in 'utf8' in python like below

with open(path, 'rb') as input_file:
    line = input_file.read()
    line = line.decode('cp500').encode('utf8').strip()
    print line

And below

with io.open(path, 'rb', encoding="cp500") as input_file:
    line = input_file.read()
    print line

Also, tried with codecs

with codecs.open(path, 'rb') as input_file:
    count = 0
    line = input_file.read()
    line = codecs.decode(line, 'cp500').encode('utf8')
    print line

Also, tried importing/installing the ebcdic module, but it doesn't seem to be working properly. here is the sample output for the first 58 chars

enter image description here

It does transform the data to some human-readable values for some bytes but doesn't seem to be 100 percent in ASCII. For example, the 4th character in the input file is 'P' (after the first three NUL), and if I open the file in hex mode, the hex code for 'P' is 0x50, which maps to character 'P' in ASCII. But the code above gives me the character '&' for this in output, which is the EBCDIC character for hex value 0x50.

Also, tried the below code,

with open(path, 'rb') as input_file:
    line = input_file.read()
    line = line.decode('utf8').strip()
    print line

It gives me the below error.

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 4: invalid continuation byte

And If I change the 'utf8' to 'latin1' in the above code, it generates the same output as in the input file shown above which was opened in the notepad++.

Can anyone please help me with how to transform the EBCDIC to ASCII correctly?

Should I build my own mapping dictionary/table/map to transform the EBCDIC to ASCII i.e. convert the file data in hex codes and then get the corresponding ASCII char from that mapping table/dict? If I do so, then hex 0x40 is 'Space' and 0xe2 is 'S' in EBCDIC but in ASCII 0x40 is '@' and 0xe2 doesn't have the mapping in the ASCII. But as per the input data, it looks like I need EBCDIC characters in this case. So should I construct some map by looking at the input data and decide wheater I want EBCDIC or ASCII character for some particular hex value and construct that map accordingly for lookup?

Or I need to follow some other way to correctly parse the data.

Note:- The non-alphanumeric data is needed as well, there are some images at some particulars places in the input file encoded in that non-alphanumeric/alphanumeric chars, which we can extract, so not sure if I need to convert that to ASCII or leave as its.

Thanks in advance

4

There are 4 best solutions below

0
Ramandeep Mehmi On BEST ANSWER

Posting for others how I was able to transform the EBCDIC to ASCII.

I learned that I only needed to convert the non-binary alpha-numeric data to ASCII from EBCDIC. To know which data will be non-binary alphanumeric data, one needs to understand the format/structure of the EBCDIC/input file. Since I knew the format/structure of the input file, I was aware of which fields/bytes of the input files needed transformation and did transform only those bytes leaving other binary data as it is in the input file.

Earlier I was trying to convert the whole file into ASCII, which was converting the binary data as well, hence distorting the data in conversion. Hence, by understanding the structure/format of the files I converted only the required alphanumeric data to ASCII and processed it. It worked.

1
Milos Lalovic On

You are reading the file in binary mode so the content in the buffer is in EBCDIC. You need to decode it to ASCII. Try the following:

with open(path, 'rb') as input_file:
    line = input_file.read()
    line = line.decode('utf8').strip()
    print line

The above suggestion was tested on a z/OS machine, but if you are running on an ASCII machine you can try the following instead:

with codecs.open(path, 'rb', 'cp500') as input_file:
    line = input_file.read()
    print line

These suggestions assume you have a text file, but if the file contains binary data mixed with text you will need a different approach as suggested by @bruce-martin.

0
Bruce Martin On

Options

  1. Convert the file to Text on the Mainframe - They have the tools understand the formats
  2. You might be able to use Stingray to read the file in python
  3. Write a Cobol program (GNU Cobol) to translate the file
  4. Use java utilities coboltocsv or coboltoxml to convert the file
  5. Java/Jython code with JRecord

ZOS Mainframe Files

The 2 main mainframe file formats

  • FB - all records (lines) are the same length
  • VB - each record start with a length and is followed by the data. These files can be transfered to other platforms with/without the record length.

Cobol Files

A Cobol copybook allows you to work out

  • Where fields start and End
  • The format of the field

Some examples of Cobol Fields and there representation

Inn this example I will look at 2 Cobol Field definitions and how 4 values are represented in a file

Cobol field definition

         03  fld1               pic s999v99.
         03  fld2               pic s999v99 comp-3.

                            Representation in the file                               
   Numeric-Value          pic s999v99         pic s999v99 comp-3
       12.34               0123D                 x'01234C'
      -12.34               0123M                 x'01234d'
       12.35               0123E                 x'01235C'
      -12.35               0123N                 x'01235d'
1
Jim On

I was trying to convert a COBOL copybook with embedded hex from EBCDIC to ASCII. I found a partial answer below:

Take a look at the codecs module. From the standard encodings table, it looks like EBCDIC is also known as cp-500. Something like the following should work:

import codecs

with open("EBCDIC.txt", "rb") as ebcdic:
    ascii_txt = codecs.decode(ebcdic.read(), "cp500")
    print(ascii_txt)

As mpez0 noted in the comments, if you're using Python 3, you can condense the code to this:

with open("EBCDIC.txt", "rt", "cp500") as ebcdic:
    print(ebcdic.read())