I’ve looked into this some more and it appears what I said above is in fact wrong:
The original does have the content attribute too, but the HTML inside is correctly encoded in the original but isn’t in ours.
It turns out, in HTML5, the following is perfectly valid HTML:
<div data-content="<h1>this is ok</h1>"></div>
In older HTML specs, that wouldn’t be valid. So even though the source document uses escaped characters in the attribute value, the HTML fragment
<div data-content="<h1>this is ok</h1>"></div>
is actually treated the same as the one above by an HTML5 parser. That perhaps explains why when I loaded the feed in the browser, I didn’t see any display issues.
You can verify this by saving the following two HTML documents as test1.html and test2.html, and comparing how your browser treats the attribute value in its inspector view. They will appear exactly the same:
test1.html
<!DOCTYPE html>
<html lang="en">
<head><title>Test 1</title></head>
<body>
<h1>Test 1: Unescaped</h1>
<div data-content="<h2>this is ok</h2>"></div>
</body>
</html>
test2.html
<!DOCTYPE html>
<html lang="en">
<head><title>Test 2</title></head>
<body>
<h1>Test 2: Escaped</h1>
<div data-content="<h2>this is ok</h2>"></div>
</body>
</html>
They also both validate as HTML5.
And when I select the body element in Firefox’s inspector view, right click and choose to copy the inner HTML, I get the following:
Test 1 inner HTML (Firefox output)
<h1>Test 1: Unescaped</h1>
<div data-content="<h2>this is ok</h2>"></div>
Test 2 inner HTML (Firefox output)
<h1>Test 2: Escaped</h1>
<div data-content="<h2>this is ok</h2>"></div>
We use HTML5-PHP’s HTML5 output in Full-Text RSS, so that’s why it also outputs it this way. But if you’re consuming this with older software or software that’s not HTML5-aware, it will cause problems like the one you’re experiencing.
To avoid the problem, we either need to offer a more compatible HTML output or tell Full-Text RSS to strip attribute values that might cause your parser problems. In previous versions we actually defaulted to PHP’s built-in libxml output, which would output the attribute value escaped as it appears in test2.html. For a while we offered HTML5 output as an optional override, but made it the only option in recent versions. We might re-visit that decision if this kind of markup becomes a bigger problem.
In the meantime, what I suggest you do is remove these attributes using a site config file in Full-Text RSS. One example:
site_config/custom/.yahoo.com.txt
strip: //@data-content
You can also write a more general one that will remove any attribute that has a <
character in it, on any site:
site_config/custom/global.txt
strip: //@*[contains(., '<')]
Hope that’s some help.