Trying to get a full extraction from Boing Boing

When I try to add Boing Boing https://boingboing.net to RFT FTR, I’m getting an error:

### This page contains the following errors:

error on line 2 at column 1: Extra content at the end of the document

### Below is a rendering of the page up to the first error.

I suspect this is a problem with the siteconfig for boing boing, but I have no idea how I should fix it. When I try to load it into the point and click extractor, it won’t load the URL at all.

Any suggestions?

Hello @blinkingline, welcome at FiveFilters forum!

‘RFT’? Or did you mean FTR for Fulltext-RSS?

The html structure on BoingBoing has changed some time ago, so our existing config for this domain doesn’t work.

Main problem was the single_page_link-line which doesn’t match any longer to any useful content.

I fixed that config just now. If you are a self-hoster, please go to /admin/update.php and click the update button. For ftr.fivefilters.net you have to wait for the next full hour until the new config takes effect.

If you have further issues, please send a direct link to the articles you try to get, not only the domain.

1 Like

That worked perfectly, Thanks so much!

2 Likes

I just noticed this morning I am no longer getting a full extraction from Boing Boing. Using self-hosted Full-Text RSS 3.9.13

If I look at this article: https://boingboing.net/2025/12/28/famed-actress-brigitte-bardot-dead-at-91.html

In my feed I get the [unable to retrieve full-text content] message and the content stops after the second paragraph and is truncated with a “Read the Rest” link:


I’m guessing I’m missing a class or something but I’m not sure which one it is from looking at the source.

I do have no problems with that link on my self-hosted FTR 3.9.13, @blinkingline. Maybe there was a temporarily problem on boingboing?

The “Read the rest” link is from the original feed. If fetching full text fails, the original <description> from feed will be used for FTR’s output feed

Could you please test your direct article-link directly with FTR? If you get an error, activate the log and post it.

Hrm, I’m still getting the error .

If I do the single link as the feed, this is all I see:

Here’s the debug output, but it doesn’t look like there’s any error there?

* APC is disabled or not available on server
* Supplied URL: https://boingboing.net/2025/12/29/in-case-you-missed-it-sony-just-patented-ai-generated-tutorials.html
* ** Loading class HumbleHttpAgent (humble-http-agent/HumbleHttpAgent.php)
* ** Loading class ContentExtractor (content-extractor/ContentExtractor.php)
* ** Loading class SiteConfig (content-extractor/SiteConfig.php)
* --------
* Attempting to process URL as feed
* ** Loading class SimplePie_HumbleHttpAgent (humble-http-agent/SimplePie_HumbleHttpAgent.php)
* ** Loading class DisableSimplePieSanitize (DisableSimplePieSanitize.php)
* Fetching URL (https://boingboing.net/2025/12/29/in-case-you-missed-it-sony-just-patented-ai-generated-tutorials.html)
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://boingboing.net/2025/12/29/in-case-you-missed-it-sony-just-patented-ai-generated-tutorials.html
* ......adding to pool
* . looking for site config for boingboing.net in custom folder
* . looking for site config for boingboing.net in standard folder
* ... found site config in standard folder (boingboing.net.txt)
* Cached site config with key boingboing.net
* Cached site config with key boingboing.net.merged
* Checking fingerprints...
* No fingerprint matches
* . looking for site config for global in custom folder
* . looking for site config for global in standard folder
* ... found site config in standard folder (global.txt)
* Cached site config with key global
* Cached site config with key global.merged.ex
* Appending site config settings from global.txt
* ......user-agent set to: PHP/7.4
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1
* Sending request...
* Received responses
* ... site config for boingboing.net.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* --------
* Constructing a single-item feed from URL
* ** Loading class FeedWriter (feedwriter/FeedWriter.php)
* --------
* Fetching feed items
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://boingboing.net/2025/12/29/in-case-you-missed-it-sony-just-patented-ai-generated-tutorials.html
* ......in memory
* --------
* Processing feed item 1
* Item URL: https://boingboing.net/2025/12/29/in-case-you-missed-it-sony-just-patented-ai-generated-tutorials.html
* ** Loading class FeedItem (feedwriter/FeedItem.php)
* URL already fetched - in memory (https://boingboing.net/2025/12/29/in-case-you-missed-it-sony-just-patented-ai-generated-tutorials.html, effective: https://boingboing.net/2025/12/29/in-case-you-missed-it-sony-just-patented-ai-generated-tutorials.html)
* Done!

have you changed anything in the site depended config, @blinkingline?

../site_config/custom/boingboing.net.txt
should not exist, if you not have tried to fix something (make backup and remove it here)

and

../site_config/standard/boingboing.net.txt
should look like here in the repo:

I had tried something with a custom config but deleted it. The standard site config matches what I have on my install.

I have really no idea (I am not a coder), just wild guessing.
please run on a browser your-ftr-domain/ftr_compatibility_test.php

which PHP version? everything eneabled in the table? Anything not enabled in the paragraphs below that table? APC is not needed with only a few users. I have deactivated caching completely

Everything is good there except that I am not running php-tidy. For that it recommends I use the HTML5 parser for “problematic” feeds. Is there a way to force that?

You did not name your PHP version. Versions above 8.1 may also cause problem at the moment.

You should consider to install php-tidy or you might run into more problems on other domains.

I don’t know if the parser directive in boingboing.net.txt is for this, but you can try to add ONE of these:

parser: html5php
parser: html5lib

I’m on PHP 8.4.5, which as you say might be part of the issue. I’ve installed the php-tidy module, but it doesn’t seem to have corrected the issue.

I’ll give a go on the parser route and see if anything improves.

Addding the parser variable to a custom site config produces the same result unfortunately.