scrapy can't handle "<" character

46 Views Asked by mik.ro At 07 November 2019 at 13:59

I'm trying to extract text containing "<" (lower than character). On my localhost everything works fine, on the server however the text after and including "<" gets truncated.

1) hipoksemia tętnicza (PaO<sub>2</sub>/FiO<sub>2</sub> < 300 )

so I receive:

1) hipoksemia t\u0119tnicza (PaO<sub>2</sub>/FiO<sub>2</sub>

There is no problem with scraping > character. Thank you for your help.

Original Q&A

There are 1 best solutions below

Gallaecio On 08 November 2019 at 12:54

< is invalid HTML. It should be <.

Scrapy uses Parsel to parse XML/HTML responses. Parsel uses lxml to parse XML/HTML documents. lxml does not handle broken HTML as well as web browsers and other parsers do.

There is an open issue for Parsel to handle these scenarios. It will probably require supporting an alternative to lxml in Parsel, which is not trivial to implement, so it may take a while before that issue is solved.

scrapy can't handle "<" character

There are 1 best solutions below

Related Questions in SCRAPY

Related Questions in LXML

Related Questions in PARSEL

Trending Questions

Popular # Hahtags

Popular Questions