Change UTF 16 encoding to UTF 8 encoding for files in AWS S3

1.6k Views Asked by At

My main goal is to have AWS Glue move files stored in S3 to a database in RDS. My current issue is that the format in which I get these files has a UTF 16 LE encoding and AWS Glue will only process text files with UTF 8 encoding. See (https://docs.aws.amazon.com/glue/latest/dg/glue-dg.pdf, pg. 5 footnote). On my local machine, python can easily change the encoding by this method:

from pathlib import Path
path = Path('file_path')
path.write_text(path.read_text(encoding="utf16"), encoding="utf8")

I attempted to implement this in a Glue job as such:

bucketname = "bucket_name"
s3 = boto3.resource('s3')
my_bucket = s3.Bucket(bucketname)
subfolder_path = "folder1/folder2"
file_filter = "folder2/file_header_identifier"

for obj in my_bucket.objects.filter(Prefix=file_filter):
    filename = (obj.key).split('/')[-1]
    file_path = Path("s3://{}/{}/{}".format(bucketname, subfolder_path, filename))
    file_path.write_text(file_path.read_text(encoding="utf16"), encoding="utf8")

I'm not getting an error in Glue, but it is not changing the text encoding of my file. But when I try something similar in Lambda, which is probably the wiser service to work with, I do get an error that s3 has no Bucket attribute. I'd prefer to keep all this ETL work in glue for convenience.

I'm very new to AWS so any advice is welcomed.

0

There are 0 best solutions below