Parse quoted-printable encoding content from .mht file

133 Views Asked by At

I am trying to get all the images from .mht file by using Nokogiri gem. But since the .mht file has quoted-printable encoding, all the images that I received, has weird characters in it:

<img alt='3D"AFC-Logo' src="3D%22https://upload.=" width='3D"75"' height='3D"75"'>
<img src="3D%22https://en.wikipedia.org/static/images/footer/wikimedia-butto=" width='3D"88"' height='3D"31"' alt='3D"Wikimedia'>
<img src="3D%22https://en.wikipedia.org/static/images/footer/poweredby_mediawiki_8=" alt='3D"Powered' width='3D"88"' height='3D"31"'>

This is the link to that .mht file: https://drive.google.com/file/d/1DtbgrFyCEcggAk1nqpZSluNhRt-k3t95/view?usp=sharing

And below is the code that I am using to get all the images from the .mht file:

html = File.open("1646037951.mht").read
image_links = get_image_links(html)

def get_image_links(html)
  html_doc = Nokogiri::HTML(html)
  nodes = html_doc.xpath("//img[@src]")
  raise "No <img .../> tags!" if nodes.empty?
  nodes.inject([]) do |uris, node|
     puts node.to_s
     uris << node.attr('src').strip
  end.uniq
end

I have tried to parse it by using .unpack('M').first but it's still not working as it just returns the same result as above.

Or maybe Rails have something for this?

0

There are 0 best solutions below