@blinkingline , I don’t get this error. Neither on my own 3.9.13 nor on the public 3.10.
If you can identify the problemeatic article, activate the ‘debug’ output and post that content. Or post the debug of the feed fetch. But please, post this as code block or file attachment, so we can easily read it.
And you should mask sensible information, like your public server name, before you post it here.
Some websites are blocking user-agents like ‘curl’ or ‘PHP’ or ‘GoogleBot’, so the current config file sets a common browser, so we get the content. Some websites, like this, makes it easy for us to bypass their settings just by using a different user agent.
Using the URL from above and just looking for a single article, here’s what the debug looks like:
* APC is disabled or not available on server
* Supplied URL: http://phys.org/rss-feed/space-news/
* ** Loading class HumbleHttpAgent (humble-http-agent/HumbleHttpAgent.php)
* ** Loading class ContentExtractor (content-extractor/ContentExtractor.php)
* ** Loading class SiteConfig (content-extractor/SiteConfig.php)
* --------
* Attempting to process URL as feed
* ** Loading class SimplePie_HumbleHttpAgent (humble-http-agent/SimplePie_HumbleHttpAgent.php)
* ** Loading class DisableSimplePieSanitize (DisableSimplePieSanitize.php)
* Fetching URL (http://phys.org/rss-feed/space-news/)
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...http://phys.org/rss-feed/space-news/
* ......adding to pool
* . looking for site config for phys.org in custom folder
* . looking for site config for phys.org in standard folder
* ... found site config in standard folder (phys.org.txt)
* Cached site config with key phys.org
* Cached site config with key phys.org.merged
* Checking fingerprints...
* No fingerprint matches
* . looking for site config for global in custom folder
* . looking for site config for global in standard folder
* ... found site config in standard folder (global.txt)
* Cached site config with key global
* Cached site config with key global.merged.ex
* Appending site config settings from global.txt
* ......user-agent set to: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1
* Sending request...
* Received responses
* Redirect detected. Valid URL: https://phys.org/rss-feed/space-news/
* ** Loading class CookieJar (humble-http-agent/CookieJar.php)
* Following redirects #1...
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://phys.org/rss-feed/space-news/
* ......adding to pool
* ... site config for phys.org.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* ......user-agent set to: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1
* Sending request...
* Received responses
* ... site config for phys.org.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* ** Loading class FeedWriter (feedwriter/FeedWriter.php)
* --------
* Fetching feed items
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://phys.org/news/2025-03-dark-universe-telescope.html
* ......adding to pool
* ... site config for phys.org.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* ......user-agent set to: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1
* Sending request...
* Received responses
* ... site config for phys.org.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* --------
* Processing feed item 1
* Item URL: https://phys.org/news/2025-03-dark-universe-telescope.html
* ** Loading class FeedItem (feedwriter/FeedItem.php)
* URL already fetched - in memory (https://phys.org/news/2025-03-dark-universe-telescope.html, effective: https://phys.org/news/2025-03-dark-universe-telescope.html)
* Done!
I am not a dev, but I see no reason in the log.Could you open a console prompt on your server and try to cURL the article. Check if you get the content or some server message instead.
curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0" http://phys.org/news/2025-03-wolf-rayet-pinwheel-star-reveals.html
Maybe also with https instead of http.
Is your server using some kind of VPN or DNS filter liek piHole?
The output from that gives back this bit in html when I pass it with https (it returns nothing with just http):
<p>This request seems a bit unusual, so we need to confirm that you're human. Please press and hold the button until it turns completely green. Thank you for your cooperation!</p><div class="clearfix"><button id="holdButton"><span>Press and Hold</span><div class="progress-bar" id="progressBar"></div></button><p id="status">Press and hold the button</p></div><p>If you believe this is an error, please contact our <a href="https://sciencex.com/help/" rel="nofollow">support team</a>.</p><hr><p><small>MYDROPLETSIPADDRESS : d5a3642f-fe6f-4b2c-9217-53380913</small></p></div></body></html>
I’m running this from a basic Digital Ocean droplet, no extra VPN or filters.
I had feared something like this. Is there something special about your connection that makes phys think you’re not human? VPN, DNS proxy, other server software that might have blacklisted your IP? Maybe IP6 only?
But maybe we’re in luck. Please try the following user agents in turn. In each case in combination with and without the referer in the last line:
Remove the ‘#’ in front of the line, to activate that line.
There are each using the https, only because when I do a curl with just http is fails completely.
Case 1: #http_header(User-Agent): Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:138.0) Gecko/20100101 Firefox/138.0
No joy. Each entry has [unable to retrieve full-text content] followed by a blurb.
Case 2: #http_header(User-Agent): Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
This works as expected!
Case 3: #http_header(User-Agent): Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
This also works.
Case 4: #http_header(User-Agent): Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Another success.
Case 5: #http_header(User-Agent): curl/7.83.1
Unable to retrieve any content at all.