I’ve found it difficult to debug why some of my feeds cannot extract content from my self-served version, yet the same feed is extracted using your site. For example, this works fine:
https://ftr.fivefilters.net/makefulltextfeed.php?max=1&url=https://www.employeebenefitsblog.com/feed/
On the other hand, my server cannot scrape it. Below is the full debug rawhtml. I have tried writing a custom config, with no luck. I am on 3.9.11 and a rather old PHP version 5.6. I have tried various user-agent and referrer configs with no luck. I’m happy to provide any details of my set up that might be useful. I do successfully scrape hundreds of sites for our clients. The failures tend to gather around certain clients, which suggests some type of client-specific blocking. Yet your site can extract some of these.
Any assistance is appreciated.
* APC is disabled or not available on server
* Supplied URL: https://www.employeebenefitsblog.com/feed/
* ** Loading class HumbleHttpAgent (humble-http-agent/HumbleHttpAgent.php)
* ** Loading class ContentExtractor (content-extractor/ContentExtractor.php)
* ** Loading class SiteConfig (content-extractor/SiteConfig.php)
* --------
* Attempting to process URL as feed
* ** Loading class SimplePie_HumbleHttpAgent (humble-http-agent/SimplePie_HumbleHttpAgent.php)
* ** Loading class DisableSimplePieSanitize (DisableSimplePieSanitize.php)
* Fetching URL (https://www.employeebenefitsblog.com/feed/)
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://www.employeebenefitsblog.com/feed/
* ......adding to pool
* . looking for site config for employeebenefitsblog.com in custom folder
* ... found site config (employeebenefitsblog.com.txt)
* . looking for site config for employeebenefitsblog.com in standard folder
* ... no site config match for employeebenefitsblog.com
* Cached site config with key employeebenefitsblog.com.merged
* Checking fingerprints...
* No fingerprint matches
* . looking for site config for global in custom folder
* ... found site config (global.txt)
* Cached site config with key global
* . looking for site config for global in standard folder
* ... site config for global already loaded in this request
* . merging config files
* Cached site config with key global.merged.ex
* Appending site config settings from global.txt
* ......user-agent set to: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:138.0) Gecko/20100101 Firefox/138.0
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1
* Sending request...
* Received responses
* ... site config for employeebenefitsblog.com.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* --------
* Constructing a single-item feed from URL
* ** Loading class FeedWriter (feedwriter/FeedWriter.php)
* LAST DATE IS NOT SET
* --------
* Fetching feed items
* From these URLs: Array
(
[0] => https://www.employeebenefitsblog.com/feed/
)
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://www.employeebenefitsblog.com/feed/
* ......in memory
* --------
* Processing feed item 1
* Item URL: https://www.employeebenefitsblog.com/feed/
* ** Loading class FeedItem (feedwriter/FeedItem.php)
* Fetching URL (https://www.employeebenefitsblog.com/feed/)
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://www.employeebenefitsblog.com/feed/
* ......in memory
* No character encoding found, so treating as UTF-8
* Here are the HTTP response headers from the remote server:
HTTP/1.1 200 OK
Content-Type: text/html
Cache-Control: no-cache, no-store
Connection: close
Content-Length: 853
X-Iinfo: 53-245284477-0 0NNN RT(1773867264906 8) q(0 -1 -1 -1) r(0 -1) B12(11,12041106,0) U24
Set-Cookie: visid_incap_2358687=5OKwKHnBT3mQuckoCSMALAARu2kAAAAAQUIPAAAAAABEtrGdlMQMfk3b6szT7L1w; expires=Thu, 18 Mar 2027 07:03:32 GMT; HttpOnly; path=/; Domain=.employeebenefitsblog.com
Set-Cookie: incap_ses_414_2358687=sTiKOfyiVFMlvOeJ+NK+BQARu2kAAAAAr4Wvk+nNEY8lQTw+pSiG4Q==; path=/; Domain=.employeebenefitsblog.com
* Here's the raw HTML (after attempted UTF-8 conversion):
<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?SWUDNSAI=31&xinfo=53-245284477-0%200NNN%20RT%281773867264906%208%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%2811%2c12041106%2c0%29%20U24&incident_id=414000180670461100-1370686680232494261&edet=12&cinfo=0b000000&rpinfo=0&cts=%2fVQBWTPClYI6ShWoqEN45Hd9rXXC64rU%2fvlfDQnbtsPcMa3%2bHXQJ53EkgHrp7gsL&cip=50.18.214.17&mth=GET" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 414000180670461100-1370686680232494261</iframe></body></html>
Hi @drpudding,
I am not a developer, just guessing.
It could be your very old PHP version, but as 5.6 is the minimum required version, I don’t think, it is the problem.
In your log there is an html snipped with content:
Request unsuccessful. Incapsula incident ID: xxx
Incapsula is kind of a content delivery network, including a web application firewall and bot protection service.
So I think your servers might be on their blacklist. But then, fivefilters.net would be also on that list, most likely.
I can fetch the feed on my self-hosted 3.9.13 but only the first 3 articles show an image. All others only get a placeholder icon. while on ftr.fivefilters.net all images are placeholders. I think, the images are not relevant in this kind of articles but it could be a hint of some timing or blocking behavior.
I don’t think, that this would help, but you may try these user-agents (activate only one, per test).
Crossing fingers
#http_header(User-Agent): Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:143.0) Gecko/20100101 Firefox/143.0
#http_header(user-agent): Mozilla/5.0 (Macintosh; Intel Mac OS X 14.7; rv:140.0) Gecko/20100101 Firefox/140.0
#http_header(User-Agent): Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
#http_header(User-Agent): Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
#http_header(User-Agent): Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
#http_header(User-Agent): curl/7.83.1
#http_header(User-Agent): curl/7.54.0
#http_header(User-Agent): PHP/8.4
#http_header(User-Agent): PHP/7.4
#http_header(User-Agent): PHP/5.6
#http_header(User-Agent): Mastodon/4.3.2 (http.rb/5.2.0; +https://mastodon.social/)
#http_header(User-Agent): Mastodon/4.3.2 (http.rb/5.2.0; +https://mastodon.social/) Bot
http_header(referer): https://www.employeebenefitsblog.com/
body: //div[contains(@class, 'et_pb_post_content')]
title: //meta[@property='og:title']/@content
prune: no
tidy: no
test_url: https://www.employeebenefitsblog.com/2025/10/eeoc-enforcement-actions-underscore-employers-religious-accommodation-policies/
Thank you for the quick response. Unfortunately, I had already tried this very same user-agent experiment, having read it in another forum post. Tried it again with your exact config settings but still no luck. I usually assume these failures are due to blocking, especially since this client has several other failing feeds on different domains.
If I can bother you for another question. Here is another one that works for you but not me (different client):
https://ftr.fivefilters.net/makefulltextfeed.php?max=1&url=https://professionalliabilitymatters.com/feed
When I attempt this, I see this at the end.
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://professionalliabilitymatters.com/feed
* ......in memory
* --------
* Processing feed item 1
* Item URL: https://professionalliabilitymatters.com/feed
* ** Loading class FeedItem (feedwriter/FeedItem.php)
* URL already fetched - in memory (https://professionalliabilitymatters.com/feed, effective: https://professionalliabilitymatters.com/feed)
* Here are the HTTP response headers from the remote server:
* Here's the raw HTML (after attempted UTF-8 conversion):
My question is related to the mention of “in memory.” Our service scrapes all of the feeds we manage every two hours, looking for new content. Do these “in memory” mentions mean that there is some caching in play, such that a successive run is not attempting a new scrape if it sees a previous attempt cached? I want to be sure that on successive scrape attempts, even if a feed URL has recently been attempted, it will be attempted again – and especially if I have updated the site config file.
I have caching & apc = false in my config.
Thanks again!
As I wrote, I am not a dev. Caching is off, as we could see in your first post. So there should no old fetch. I think the feed-url was fetched just a bit earlier within the same call.
You didn’t post full log, so I found no hints this time.
The feed items links to articles which are on a different domain. Could you try to fetch the article link only:
/makefulltextfeed.php?max=1&url=https://www.goldbergsegalla.com/blog/professional-liability-matters/conflicts/a-friend-on-the-bench-is-a-conflict-indeed
If you have console access to the server, try to curl the article and inspect if the result contains the real content or some cloudflare challenge messages etc.
sudo curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:145.0) Gecko/20100101 Firefox/145.0" -o /tmp/test.txt https://example.com/foo-bar
Tried to fetch the individual articles via FTR, but still no luck.
Ran a cURL request over console/ssh & via PHP and got the same Incapsula response:
[body] => <html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?SWUDNSAI=31&xinfo=45-118181445-0%200NNN%20RT%281773946775164%2050%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%2811%2c12041205%2c0%29%20U24&incident_id=361000200753999034-653764554427794733&edet=12&cinfo=0b000000&rpinfo=0&cts=kt1V0FKJEZ5VUUAqyoDk9faHQ76P1DanuYyDkmxjv%2fg8sb3rUZwlVfti8zleiHL8&cip=52.52.114.225&mth=GET" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 361000200753999034-653764554427794733</iframe></body></html>
[http_code] => 200
We are on AWS servers, which I believe have triggered more blocking than however you are running the fivefilters requests. 