The entire feed is put in the description tag

anarcho · July 10, 2023, 2:43am

We have this problem with many feeds
https://ftr.fivefilters.net/makefulltextfeed.php?url=https%3A%2F%2Fniquebecnicanada.anarkhia.org%2F%3Ffeed%3Drss2%26time%3D16889526961&max=3

Using a fresh install of the latest version of Full Text RSS with default config, the feed is broken. The entire feed is put in the description tag.

Any solution for this?

Also, does this plugin support PHP 8.2? We already bought full text RSS and recently bought the latest version update because we are upgrading to 8.2 but full-text-rss is throwing warnings and errors about deprecated functions usage

mb_convert_encoding(): handling html entities via mbstring is deprecated; use htmlspecialchars
Warning: Attempt to read property “nodeType” on null in /full-text-rss/libraries/content-extractor/ContentExtractor.php on line 864

fivefilters · July 13, 2023, 11:34am

Hi there, we’re going to have another update out soon to address some further issues people have reported with PHP 8.2. So if you’re seeing errors with PHP 8.2, we recommend running Full-Text RSS on PHP 8.1 for now. Will make sure the error you’ve reported here is looked at too, in case it hasn’t already been addressed in the upcoming release.

As for the URL you reported, the problem is that the source feed is invalid:

Sorry

This feed does not validate.

Full-Text RSS by default accepts both feed URLs and regular article URLs as input. If it can’t parse the HTTP response as a feed, it will try to parse it as a regular web article. That’s why you’re seeing the odd result.

There is a request parameter called accept that can be used to ensure that the input URL is only ever treated as a feed XML (multiple items) or as an article HTML (single item). If you tell Full-Text RSS to treat the URL as a feed only (&accept=feed), you will see it complains:

Sorry, couldn’t parse as feed

There’s more information on the accept parameter in the docs: Usage and Request Parameters | FiveFilters.org Docs

accept parameter

Possible values: auto (default), feed, html

Example: makefulltextfeed.php?accept=feed&url=…

Tell Full-Text RSS what it should expect when fetching the input URL. By default Full-Text RSS tries to guess whether the response is a feed or regular HTML page. It’s a good idea to be explicit by passing the appropriate type in this parameter. This is useful if, for example, a feed stops working and begins to return HTML or redirecs to a HTML page as a result of site changes. In such a scenario, if you’ve been explicit about the URL being a feed, Full-Text RSS will not parse HTML returned in response. If you pass accept=html (previously html=1), Full-Text RSS will not attempt to parse the response as a feed. This increases performance slightly and should be used if you know that the URL is not a feed.

Note: If excluded, or set to auto, Full-Text RSS first tries to parse the server’s response as a feed, and only if it fails to parse as a feed will it revert to HTML parsing. In the default parse-as-feed-first mode, Full-Text RSS will identify itself as PHP first and only if a valid feed is returned will it identify itself as a browser in subsequent requests to fetch the feed items. In parse-as-html mode, Full-Text RSS will identify itself as a browser from the very first request.

anarcho · July 13, 2023, 3:59pm

Hi, thanks for the supportr

Any ETA for the 8.2 update? We had to disable error display on full-text-rss because all feeds were broken due to errors and deprecated warnings throwing in the feed and not validating

As for the other problem, well we run a large aggregator with hundreds of sources and there is a considerably high number of feeds having this problem. Is there any workaround to fix the feed?

Seems like the feed we used as an example is not validating only because of a single character in the entire feed on line 562, why is it blocking the entire extract?

For example, if I put the same feed in Feedly then it parses correctly without any error…

Or maybe is there a third party tool that we could use to fix the feeds and proxy them locally before extracting the fixed feed with full-text-rss?

anarcho · July 13, 2023, 4:44pm

I found a dirty workaround…

Looks like the errors are in the description tag, so I made a proxy to fix the feed:

https://direct.anarchistfederation.net/rss_aggregator/fix_invalid_rss_syntax.php?url=https://niquebecnicanada.anarkhia.org/?feed=rss2

$_GET['url'] = "https://niquebecnicanada.anarkhia.org/?feed=rss2";
$feed = urldecode($_GET['url']);
$feed = file_get_contents($feed);
$feed = trim($feed);
header("Content-type: text/xml");

$feed = preg_replace('/\<description\>[\s\S]+?\<\/description\>/', '<description></description>', $feed);
$feed = preg_replace('/\<content\:encoded\>[\s\S]+?\<\/content\:encoded\>/', '<content:encoded></content:encoded>', $feed);

echo $feed;

So now we are left with a feed containing just the URLs and titles, and we can let full-text-RSS do the job of extracting the content from the cleaned feed without worrying about feed validation

The fixed feed is validated:
https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Fdirect.anarchistfederation.net%2Frss_aggregator%2Ffix_invalid_rss_syntax.php%3Furl%3Dhttps%3A%2F%2Fniquebecnicanada.anarkhia.org%2F%3Ffeed%3Drss2

and now the extract works:

https://ftr.fivefilters.net/makefulltextfeed.php?url=https%3A%2F%2Fdirect.anarchistfederation.net%2Frss_aggregator%2Ffix_invalid_rss_syntax.php%3Furl%3Dhttps%253A%2F%2Fniquebecnicanada.anarkhia.org%2F%3Ffeed%3Drss2&max=3

But surely, there must be a better way?

fivefilters · July 14, 2023, 6:46am

Thanks for the update @anarcho,

Full-Text RSS uses SimplePie for feed parsing. Will have to see if there are any options for fixing broken feeds. If you have more URLs you can provide with this issue, will be happy to look at them when considering this.

On a more general note, SimplePie does not try to validate feeds in the same way as the Feed Validator I linked to. There are probably many feeds which fail feed validation but which can be parsed just fine by SimplePie. But feed readers usually rely on XML parsing as part of the process of reading in a feed, and in this case, the feed is invalid because the XML itself is invalid (not well-formed). This results in XML parsing errors.

XML parsers are much less forgiving than HTML parsers, so usually will not try to continue parsing or resolve problems. I’m guessing on systems where this feed does work, they are trying to clean up the feed and fix errors before parsing. We’ll see if there are options like this that we can explore.

In such cases we usually recommend users of the feed contact the feed publisher and alert them to the problem because there will be many feed readers that will not be able to read their feed. Pointing to the feed validator results is usually a good idea because it tells the publisher where the problem is.

In this particular case, the error is in the content of the item (<content:encoded>). Full-Text RSS replaces this with content it retrieves, so we may add a rule that if feed parsing fails, we strip out all <content:encoded> elements in a pre-parsing step and then try to parse again.

By the way, usually feeds which have <content:encoded> elements already contain the full article text. Out of curiousity I checked some items in this feed, and it does seem to be the case here. If you haven’t considered it already, you could try just plugging the source feed URL into your aggregator as it is. You’ll probably get more reliable results that way.