wsj.com wont work in self-hosted deployment

For some reason when I try to create a feed from WSJ.com: World News in my self-hosted Full-Text-RSS it’s not really extracting the whole articles.

I compared with the results in FiveFilter’s hosted one and it is able to pull the whole thing.

See some screenshots of both scenarios, mine and Five Filter’s.


Thank you

Fabio

Worth mentioning, I’m all up to date

I did some debugging and it appears that my hosted FTR isn’t fetching the content from the WSJ URLs…

On the left it’s my hosted one, on the right is FiveFilter’s hosted. You can clearly see that FiveFilter’s is fetching content, whereas in mine that’s completely skipped for some reason.

Any ideas??

My self hosted FTRSS delivers large articles. Are you shure that you have no extra config files wsj.com.txt or (feeds.a.)dj.com.txt in your site_config/standard or /custom folder?

I have FiveFilter’s site patterns from their Github, which does include a file for wsj.com.txt

Is there a way to change/configure the user agents and referrers?

Sorry for confusing. I read your question on smartphone and answered there. Your own log also has changed the useragent and referrer. So no solution from my side, sorry.

Just saw: Your version check is from fivefilters.org not your own server

That is actually from my server… when you go to the self-hosted FTR webpage and click “check for updates”, it takes you to fivefilters.org passing your version and site config date on the GET request

Are you able to get full articles from WSJ in your self hosted?

Tested your link on my phone with my self hosted FTR and got very large articles with many photos. Did not check if the articles where complete.

Hm, I am at home now and checked again. I am just a user so I just trying to find out.

your referrer for the item-link is set to google, while ftr.fivefilter’s and mine is set to the item-link:

* ......user-agent set to: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0
* ......referer set to: https://www.wsj.com/amp/articles/russia-launches-new-drone-attacks-as-partnership-with-iran-deepens-11670666867

Don’t know where this is done, because in wsj.com.txt the part for user_agent and referrer differs and they are commented out.

Are you shure, that you did no changes to the original wsj.com.txt file? Maybe you Downlad this from Github again and replace your version.

Hi, I’m not sure what you mean… In the stage where the list of feeds is being retrieved, both mine and FiveFilter’s logs use the same user agent and referrer (www.google.co.uk) (see green rectangle in screenshot)

The issue seems to be that, in my hosted FTR, the code never performs the stuff shown in the red rectangle, it stops in the “* ** Loading class Readability (readability/Readability.php)” line and then skips to “* Attempting to extract content”

I’m sure I’m using the original wsj.com.txt file… I didnt even copy the file manually to the web server, I just used the “check for updates” link in the Admin UI and FTR itself downloaded all the files from Github to the “/site_config/standard” folder

Thanks for the help by the way.

ýes, but in the red rectangle there is a second pair of useragent/referrer. And this seems to to come of one of the three classes loaded at the beginning. The green part is reading the feed, the red part reads the article from source.

I understand, unfortunately my hosted FTR never starts even the first line of the red section, meaning it doesn’t even begin fetching the URL content

Hi Fabio, can I ask where your server is located? And one thing you can try is to pass &debug=rawhtml in the querystring to see the HTTP response, including content, sent back from the server.

I wonder if you’re getting a different response from their server. That can sometimes explain the difference users experience. Simply uploading Full-Text RSS to a different server (e.g. Hetzner Cloud, Digital Ocean) or a different server location (US or Europe) can affect how the target site responds.

It might not be either of those of course, but if you have an easy way to try running on a different server, I’d suggest trying that before debugging too much.

Hi, thanks for the response. I’m hosting FTR in my home lab, in Texas, USA. I tried the debug=rawhtml yesterday when I was troubleshooting it and FTR just isn’t retrieving the content from the URL it finds in the feed.

I obviously tried WSJ from my desktop web browser and the page loads successfully, all of it (after the paywall removers “cleanse” the pages)

Fabio

Hi Fabio,

Haven’t been able to reproduce this problem running Full-Text RSS locally.

One thing you could try is to change the HTTP method to see if that’s a cause.

If you want to try that, follow the steps below:

  1. Open makefulltextfeed.php

  2. Find the line:

    $http = new HumbleHttpAgent($_req_options);
    
  3. Change it to:

    $http = new HumbleHttpAgent($_req_options, HumbleHttpAgent::METHOD_CURL_MULTI);
    
  4. Save and load the feed again.

You might also want to try:

$http = new HumbleHttpAgent($_req_options, HumbleHttpAgent::METHOD_FILE_GET_CONTENTS);

@fivefilters I’m having the same problem as @fabio. I also have a self-hosted setup and am located in the US. WSJ used to work great until a few months ago. There has been no change on my end; it just stopped pulling full articles one day. I have tried a copy of my setup on a different server that uses Starlink instead of Comcast and have had the same issue. Using debug=rawhtml on extract.php appears to show that wsj is responding with only the stub article (what a nonsubscriber would receive). However, from a browser on the same server, I can login to WSJ and view articles just fine. My guess is that credentials need to be passed; is there a way to do this?

we troubleshooted the problem and the issue is related to FTR running anywhere in the US - for some reason it gets a response from the WSJ servers which is different than that received by an FTR anywhere in Europe.

I created a VM in a vps in Germany and installed FTR in it and it works perfectly. I’m using that temporarily to see if they can change the wsj site settings file to handle the different scenarios.

Try digital ocean, they give you a $200 credit valid for 2 months

Fabio

@fabio, @Delgreco007, we’ve updated the site config file for wsj.com, can you please try updating your site config files and see if this fixes it for you. This one should work even if your server is in the US. (You can find the changed file here: ftr-site-config/wsj.com.txt at master · fivefilters/ftr-site-config · GitHub)

That works great. Thank you!

1 Like

It works great, thank you!

1 Like