Issues with Yahoo Finance

ws420 · November 2, 2019, 7:59am

Hi,

Scraping Yahoo Finance returns some HTML code and CSS inside the content. Is there some parameter that could help to sanitize the content, while preserving links and formatting?

Example: https://finance.yahoo.com/news/vivo-cannabis-host-third-quarter-190000566.html

When posting to Wordpress, this content appears inside the post:

Thanks!

fivefilters · November 2, 2019, 11:14am

We can only offer support when it comes to the output of Full-Text RSS itself. When I try loading this URL with Full-Text RSS, it looks okay to me: http://ftr.fivefilters.org/makefulltextfeed.php?url=https%3A%2F%2Ffinance.yahoo.com%2Fnews%2Fvivo-cannabis-host-third-quarter-190000566.html

Are you using the latest version of Full-Text RSS?

ws420 · November 4, 2019, 4:14am

The URL you linked doesn’t look okay when looking at the source code. I tested it before posting. There is some html & css code that shouldn’t be in the post body. http://prntscr.com/ps31hg

Yes, using the last version.

fivefilters · November 4, 2019, 11:20am

I’m not sure I understand.

if you’re saying it doesn’t look right because there’s HTML encoding when you view source, that’s to be expected in an RSS feed. If there are certain HTML elements that don’t belong to the body, can you highlight those elements. We might be able to help you write a site config file to remove them.

Your first post, however, showed HTML code being reproduced in the output. That is a problem, but that was output in WordPress, not from our Full-Text RSS product. So if that’s the issue, it’s best to take it up with the WordPress plugin developer. When I check the Full-Text RSS output (which I linked) I don’t see HTML code being reproduced.

Generally, we try to help when article content is being missed by Full-Text RSS. But when it’s elements or formatting the user wants to remove, we recommend they write a site config files targetting the site. This can be done if you’re using the self-hosted version of Full-Text RSS. You can remove elements that you don’t want in the final output or attributes using the strip directive. More info at https://help.fivefilters.org/full-text-rss/site-patterns.html

ws420 · November 4, 2019, 8:36pm

Of course i’m expecting some valid html code in a standard RSS feed. But look at the HTML code of the link you shared, it is not standard and the code is invalid. All of the text is stuffed inside a “content” attribute and there is a ton of CSS code that shouldn’t be here.

Here is an excerpt from the html code of the article you linked above (converted htmlentities & beautified):

View original content to download multimedia: <a href="http://www.newswire.ca/en/releases/archive/November2019/01/c8785.html" rel="nofollow noopener" target="_blank">http://www.newswire.ca/en/releases/archive/November2019/01/c8785.html</a></p>
<p class="canvas-atom canvas-text Mb(1.0em) Mb(0)--sm Mt(0.8em)--sm" type="text" content="<style> /* Style Definitions */ span.prnews_span { font-size:8pt; font-family:&quot;Arial&quot;; color:black; } a.prnews_a { color:blue; } li.prnews_li { font-size:8pt; font-family:&quot;Arial&quot;; color:black; } p.prnews_p { font-size:0.62em; font-family:&quot;Arial&quot;; color:black; margin:0in; } .prngen3{ BORDER-TOP:1pt; BORDER-RIGHT:1pt; VERTICAL-ALIGN: TOP; BORDER-BOTTOM:1pt; TEXT-ALIGN: LEFT; PADDING-LEFT:0.50em; BORDER-LEFT:1pt; PADDING-RIGHT:0.50em } .prngen2{ BORDER-TOP:1pt; BORDER-RIGHT:1pt; VERTICAL-ALIGN: TOP; BORDER-BOTTOM:black 1pt solid; TEXT-ALIGN: LEFT; PADDING-LEFT:0.50em; BORDER-LEFT:1pt; PADDING-RIGHT:0.50em } .prntblns{ BORDER-TOP: 1pt; BORDER-RIGHT: 1pt; BORDER-COLLAPSE: collapse; BORDER-BOTTOM: 1pt; BORDER-LEFT: 1pt } .prntac{ TEXT-ALIGN: CENTER } </style>" data-reactid="41">
   <style> /* Style Definitions */ span.prnews_span { font-size:8pt; font-family:"Arial"; color:black; } a.prnews_a { color:blue; } li.prnews_li { font-size:8pt; font-family:"Arial"; color:black; } p.prnews_p { font-size:0.62em; font-family:"Arial"; color:black; margin:0in; } .prngen3{ BORDER-TOP:1pt; BORDER-RIGHT:1pt; VERTICAL-ALIGN: TOP; BORDER-BOTTOM:1pt; TEXT-ALIGN: LEFT; PADDING-LEFT:0.50em; BORDER-LEFT:1pt; PADDING-RIGHT:0.50em } .prngen2{ BORDER-TOP:1pt; BORDER-RIGHT:1pt; VERTICAL-ALIGN: TOP; BORDER-BOTTOM:black 1pt solid; TEXT-ALIGN: LEFT; PADDING-LEFT:0.50em; BORDER-LEFT:1pt; PADDING-RIGHT:0.50em } .prntblns{ BORDER-TOP: 1pt; BORDER-RIGHT: 1pt; BORDER-COLLAPSE: collapse; BORDER-BOTTOM: 1pt; BORDER-LEFT: 1pt } .prntac{ TEXT-ALIGN: CENTER } </style>
</p>
<div data-reactid="42"></div>
<p><strong><a href="https://blockads.fivefilters.org">Let's block ads!</a></strong> <a href="https://blockads.fivefilters.org/acceptable.html">(Why?)</a></p>

This code is invalid and it will break when posting it to WordPress or trying to view the feed with some readers. There’s a whole CSS stylesheet hardcoded inside a “content” attribute. If there is no way to sanitize the content returned by full-text-rss, it means that everything coming from Yahoo News will cause issues with full-text-rss, which would be disappointing because Yahoo News is one of the top sources for news on the web

According to W3C validator, the attribute “content” shouldn’t be here or else the feed is not compatible with all readers.

description should not contain content attribute (11 occurrences)

https://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fftr.fivefilters.org%2Fmakefulltextfeed.php%3Furl%3Dhttps%253A%252F%252Ffinance.yahoo.com%252Fnews%252Fvivo-cannabis-host-third-quarter-190000566.html

ws420 · November 4, 2019, 8:42pm

Also, here is the original source of the article reposted on Yahoo:

Scraping the original article works perfectly with full-text-rss. This is the expected output. No invalid html, no weird tags attributes, no messy css stuffed inside “content” attributes, and everything works perfectly with all readers or posting to wordpress.
http://ftr.fivefilters.org/makefulltextfeed.php?url=https%3A%2F%2Fwww.newswire.ca%2Fnews-releases%2Fvivo-cannabis-to-host-third-quarter-2019-financial-results-conference-call-847770931.html&max=3

No error at all related to the content with W3C validator when scraping the original source instead of Yahoo:
https://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fftr.fivefilters.org%2Fmakefulltextfeed.php%3Furl%3Dhttps%3A%2F%2Fwww.newswire.ca%2Fnews-releases%2Fvivo-cannabis-to-host-third-quarter-2019-financial-results-conference-call-847770931.html%26max%3D3

fivefilters · November 4, 2019, 9:29pm

Thanks for clarifying. The original does have the content attribute too, but the HTML inside is correctly encoded in the original but isn’t in ours. We’re looking into this to see why that is.

Thanks for the report and the clarification.

fivefilters · November 7, 2019, 1:19am

I’ve looked into this some more and it appears what I said above is in fact wrong:

The original does have the content attribute too, but the HTML inside is correctly encoded in the original but isn’t in ours.

It turns out, in HTML5, the following is perfectly valid HTML:

<div data-content="<h1>this is ok</h1>"></div>

In older HTML specs, that wouldn’t be valid. So even though the source document uses escaped characters in the attribute value, the HTML fragment

<div data-content="&lt;h1&gt;this is ok&lt;/h1&gt;"></div>

is actually treated the same as the one above by an HTML5 parser. That perhaps explains why when I loaded the feed in the browser, I didn’t see any display issues.

You can verify this by saving the following two HTML documents as test1.html and test2.html, and comparing how your browser treats the attribute value in its inspector view. They will appear exactly the same:

test1.html

<!DOCTYPE html>
<html lang="en">
  <head><title>Test 1</title></head>
  <body>
    <h1>Test 1: Unescaped</h1>
    <div data-content="<h2>this is ok</h2>"></div>
  </body>
</html>

test2.html

<!DOCTYPE html>
<html lang="en">
  <head><title>Test 2</title></head>
  <body>
    <h1>Test 2: Escaped</h1>
    <div data-content="&lt;h2&gt;this is ok&lt;/h2&gt;"></div>
  </body>
</html>

They also both validate as HTML5.

And when I select the body element in Firefox’s inspector view, right click and choose to copy the inner HTML, I get the following:

Test 1 inner HTML (Firefox output)

<h1>Test 1: Unescaped</h1>
<div data-content="<h2>this is ok</h2>"></div>

Test 2 inner HTML (Firefox output)

<h1>Test 2: Escaped</h1>
<div data-content="<h2>this is ok</h2>"></div>

We use HTML5-PHP’s HTML5 output in Full-Text RSS, so that’s why it also outputs it this way. But if you’re consuming this with older software or software that’s not HTML5-aware, it will cause problems like the one you’re experiencing.

To avoid the problem, we either need to offer a more compatible HTML output or tell Full-Text RSS to strip attribute values that might cause your parser problems. In previous versions we actually defaulted to PHP’s built-in libxml output, which would output the attribute value escaped as it appears in test2.html. For a while we offered HTML5 output as an optional override, but made it the only option in recent versions. We might re-visit that decision if this kind of markup becomes a bigger problem.

In the meantime, what I suggest you do is remove these attributes using a site config file in Full-Text RSS. One example:

site_config/custom/.yahoo.com.txt

strip: //@data-content

You can also write a more general one that will remove any attribute that has a < character in it, on any site:

site_config/custom/global.txt

strip: //@*[contains(., '<')]

Hope that’s some help.

ws420 · November 9, 2019, 4:03am

Awesome! Thank you so much for the great support, I appreciate it!