unable to retrieve full-text content at phys.org

blinkingline · March 18, 2025, 11:18pm

I’m trying to get the space news feed from phys.org.

Feed is Space News - Space, Astronomy, Space Exploration

When I try to look at the feedI get some of the content, but it is prefixed with [unable to retrieve full-text content].

When I look at the feed, it looks valid, save for a weird thing about the <channel> tag.

Debugging the feed shows the <item><link> to be something like: Wolf-Rayet 104 'pinwheel' star reveals a surprise (and some relief) which also does not retrieve. The existing phys.org site config seems to only have a user-agent in it.

HolgerAusB · March 19, 2025, 10:57am

@blinkingline , I don’t get this error. Neither on my own 3.9.13 nor on the public 3.10.

If you can identify the problemeatic article, activate the ‘debug’ output and post that content. Or post the debug of the feed fetch. But please, post this as code block or file attachment, so we can easily read it.
And you should mask sensible information, like your public server name, before you post it here.

Some websites are blocking user-agents like ‘curl’ or ‘PHP’ or ‘GoogleBot’, so the current config file sets a common browser, so we get the content. Some websites, like this, makes it easy for us to bypass their settings just by using a different user agent.

blinkingline · March 19, 2025, 1:31pm

What’s exceptionally weird is that it’s not a single article, it’s everything in the feed.

Using the URL from above and just looking for a single article, here’s what the debug looks like:

* APC is disabled or not available on server
* Supplied URL: http://phys.org/rss-feed/space-news/
* ** Loading class HumbleHttpAgent (humble-http-agent/HumbleHttpAgent.php)
* ** Loading class ContentExtractor (content-extractor/ContentExtractor.php)
* ** Loading class SiteConfig (content-extractor/SiteConfig.php)
* --------
* Attempting to process URL as feed
* ** Loading class SimplePie_HumbleHttpAgent (humble-http-agent/SimplePie_HumbleHttpAgent.php)
* ** Loading class DisableSimplePieSanitize (DisableSimplePieSanitize.php)
* Fetching URL (http://phys.org/rss-feed/space-news/)
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...http://phys.org/rss-feed/space-news/
* ......adding to pool
* . looking for site config for phys.org in custom folder
* . looking for site config for phys.org in standard folder
* ... found site config in standard folder (phys.org.txt)
* Cached site config with key phys.org
* Cached site config with key phys.org.merged
* Checking fingerprints...
* No fingerprint matches
* . looking for site config for global in custom folder
* . looking for site config for global in standard folder
* ... found site config in standard folder (global.txt)
* Cached site config with key global
* Cached site config with key global.merged.ex
* Appending site config settings from global.txt
* ......user-agent set to: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1
* Sending request...
* Received responses
* Redirect detected. Valid URL: https://phys.org/rss-feed/space-news/
* ** Loading class CookieJar (humble-http-agent/CookieJar.php)
* Following redirects #1...
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://phys.org/rss-feed/space-news/
* ......adding to pool
* ... site config for phys.org.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* ......user-agent set to: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1
* Sending request...
* Received responses
* ... site config for phys.org.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* ** Loading class FeedWriter (feedwriter/FeedWriter.php)
* --------
* Fetching feed items
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://phys.org/news/2025-03-dark-universe-telescope.html
* ......adding to pool
* ... site config for phys.org.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* ......user-agent set to: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1
* Sending request...
* Received responses
* ... site config for phys.org.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* --------
* Processing feed item 1
* Item URL: https://phys.org/news/2025-03-dark-universe-telescope.html
* ** Loading class FeedItem (feedwriter/FeedItem.php)
* URL already fetched - in memory (https://phys.org/news/2025-03-dark-universe-telescope.html, effective: https://phys.org/news/2025-03-dark-universe-telescope.html)
* Done!

And this is what the resulting render looks like:

HolgerAusB · March 19, 2025, 2:46pm

I am not a dev, but I see no reason in the log.Could you open a console prompt on your server and try to cURL the article. Check if you get the content or some server message instead.

curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:133.0) Gecko/20100101 Firefox/133.0" http://phys.org/news/2025-03-wolf-rayet-pinwheel-star-reveals.html

Maybe also with https instead of http.

Is your server using some kind of VPN or DNS filter liek piHole?

blinkingline · March 19, 2025, 3:06pm

The output from that gives back this bit in html when I pass it with https (it returns nothing with just http):

<p>This request seems a bit unusual, so we need to confirm that you're human. Please press and hold the button until it turns completely green. Thank you for your cooperation!</p><div class="clearfix"><button id="holdButton"><span>Press and Hold</span><div class="progress-bar" id="progressBar"></div></button><p id="status">Press and hold the button</p></div><p>If you believe this is an error, please contact our <a href="https://sciencex.com/help/" rel="nofollow">support team</a>.</p><hr><p><small>MYDROPLETSIPADDRESS : d5a3642f-fe6f-4b2c-9217-53380913</small></p></div></body></html>

I’m running this from a basic Digital Ocean droplet, no extra VPN or filters.

HolgerAusB · March 19, 2025, 5:51pm

I had feared something like this. Is there something special about your connection that makes phys think you’re not human? VPN, DNS proxy, other server software that might have blacklisted your IP? Maybe IP6 only?

But maybe we’re in luck. Please try the following user agents in turn. In each case in combination with and without the referer in the last line:

Remove the ‘#’ in front of the line, to activate that line.

#http_header(User-Agent): Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:138.0) Gecko/20100101 Firefox/138.0
#http_header(User-Agent): Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
#http_header(User-Agent): Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
#http_header(User-Agent): Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
#http_header(User-Agent): curl/7.83.1
#http_header(User-Agent): curl/7.54.0
#http_header(User-Agent): PHP/8.4
#http_header(User-Agent): PHP/7.4
#http_header(User-Agent): Mastodon/4.3.2 (http.rb/5.2.0; +https://mastodon.social/)
#http_header(User-Agent): Mastodon/4.3.2 (http.rb/5.2.0; +https://mastodon.social/) Bot

#http_header(referer): https://phys.org/

Maybe could also try https-links of phys.org instead of http

please report your findings

HolgerAusB · March 19, 2025, 5:53pm

So that is not a local server. It could be, that these known server IP-ranges are on a blacklist that is blocked by phys.org

blinkingline · March 19, 2025, 6:26pm

There are each using the https, only because when I do a curl with just http is fails completely.

Case 1:
#http_header(User-Agent): Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:138.0) Gecko/20100101 Firefox/138.0
No joy. Each entry has [unable to retrieve full-text content] followed by a blurb.

Case 2:
#http_header(User-Agent): Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
This works as expected!

Case 3:
#http_header(User-Agent): Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
This also works.

Case 4:
#http_header(User-Agent): Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Another success.

Case 5:
#http_header(User-Agent): curl/7.83.1
Unable to retrieve any content at all.

Case 6:
#http_header(User-Agent): curl/7.54.0
Same result as Case 5.

Case 7:
#http_header(User-Agent): PHP/8.4
Same as Case 5.

Case 8:
#http_header(User-Agent): PHP/7.4
Same as Case 5.

Case 9:
#http_header(User-Agent): Mastodon/4.3.2 (http.rb/5.2.0; +https://mastodon.social/)
Works as expected.

Case 10:
#http_header(User-Agent): Mastodon/4.3.2 (http.rb/5.2.0; +https://mastodon.social/) Bot
Works as expected.

Case 11:
#http_header(referer): https://phys.org/
Same as Case 5.

So looks like there are some viable options.

HolgerAusB · March 19, 2025, 6:38pm

case 11 should not be tested separately, this referer should be used in combination to the user-agent.

So I will change the config now, to use the mastodon bot as user-agent

HolgerAusB · March 19, 2025, 6:42pm

have you tested this header with your FTR too?

HolgerAusB · March 19, 2025, 6:59pm

you may now update site patterns, the new config is live.

blinkingline · March 19, 2025, 7:15pm

Really appreciate the help here!