bizjournals.com.txt

This one isn’t working. Could you take a look at it? Thanks!

The site doesn’t respond with content when we make a request from our servers, so there’s very little we can do I’m afraid.

I was afraid that might be the answer. Thanks for checking!

The problem is, the the site seems to detect, that content is not fetched with a browser and send an ‘I am not a robot’ page. Try to curl and you see what you get:

curl -A 'Mozilla/5.0 (Windows NT 10.0; rv:103.0) Gecko/20100101 Firefox/103.0' 'https://www.bizjournals.com/sanfrancisco/inno/stories/news/2022/09/13/terawatt-electric-vehicles-infrastructure-energy.html'

Sometimes this happens with a browser, too. Unfortunately, I don’t think that we can prevent this behaviour. :frowning:

With curl I got the feed after setting the user-agent to Feedly/1.0 but sometimes I got an error page ‘The requested URL was not found on this server’. This also happens on webbrowser every 3rd or 4th try.

I can’t manage to load the feed when adding user-agent to the site-config. Tried it with and without referer.

http_header(user-agent): Feedly/1.0
# http_header(user-agent): Feedly/1.0(+http://www.feedly.com/fetcher.html; like FeedFetcher-Google)
http_header(referer): http://feeds.bizjournals.com/

I even symlinked bizjournals.com.txt to feeds.bizjournals.com.txt

ARRRRG. Now I removed the user-agent and commented out body and date directives and got fulltext-contend for the feed… sometimes. I need to reload the ftr-page 4 or 5 times to get the content.

So the source feed isn’t that stable.

If you run your own FTR-server or at least a webserver, reachable from the internet, you could write a cron script, which is fetching the feed with curl, checking if NOT contains ‘The requested URL was not found on this server’ and save this to a path of your webserver. Then directing FTR to that webservers address

Thanks for the info, Holger. Some sites unfortunately can be quite tricky to handle.

I’m guessing that with this site, when it fails to return the content you want, you get some undesirable text instead as the item content - e.g. ‘Are you a robot?’. Is that right? Or maybe for the entire feed.

One alternative to what you suggested as a workaround might be to update the site config file to target the content with an XPath selector that only matches when a successful load occurs. And to prevent Full-Text RSS from falling back to auto-detection when this selector fails to match. You can do that with the following site config directive:

autodetect_on_failure: no

You’ll also probably want to combine this with the &exc=1 query string parameter on the generated feed URL to tell Full-Text RSS to remove items from the output where article extraction has failed. (There’s a field in the form when generating a full-text feed where this can be specified too: “If extraction fails: remove item from feed”.)

1 Like

Haven’t tried that so much, that is non of my news sites. Just wanted to help user SixthStreet. But it would make no sense if they do this on their original feed. So I think it’s only on the single articles itself.

1 Like