Trying to parse an UTF-8 RSS feed from an URL, i first tried this:
$flux = simplexml_load_file("https://mamanslouves.com/feed");
foreach($flux->channel->item as $Item){
$title = $Item->title;
echo $title;
}
This code works, but titles containing accents (éèà) are converted to another charset. It appears that the following code fix the problem:
$raw = file_get_contents("https://mamanslouves.com/feed");
$flux = simplexml_load_string($raw);
foreach($flux->channel->item as $Item){...}
I would like to understand why.
Going by the discussion, I had with MarkusZeller in the comments, I think the answer should contain 2 components.
First we need to look at the URL you're using. It is not the URL of the file you eventually download. A look in the network tab of the browser developer tools shows this:
First there are two permanent redirections (301) before the RSS feed itself is downloaded. Everything is
UTF-8encoded, the XML, and even the file you download. The only thing that isn'tUTF-8is the first redirect, it isiso-8859-1encoded. You can see this by inspecting the headers in the network tab.Then we need to consider what
simplexml_load_file()does. It needs to figure out the encoding of the file it downloads. There are many places it could get the encoding from: The HTTP headers of the redirects, the HTTP headers of the feed, or the XML content. It is now clear it uses the first thing it encounters: The HTTP header of the first redirect, which saysiso-8859-1. So, what is reallyUTF-8is read asiso-8859-1and everything goes wrong from there. The misread characters are then converted toUTF-8, but that makes no sense, as you saw.To prove that it is the wrong
charsetin the first redirect that messes things up you can get the feed without the redirections:And this does return normal accented letters.
The reason that going through
file_get_contents()does work is because this function doesn't care about thecharset, it just gives you the binary data which is then later interpreted as aUTF-8string. Exactly as Markus said.