Site pattern for wsj.com not working?

Hello, it appears that the site patterns for the Wall Street Journal (WSJ.com: World News) aren’t working…? I tried on my self-hosted FTR as well as the Five Filters one, and it is only able to extract the text for some items but not others

Any help appreciated

Just had a short look at it. It seems that nearly every article has a paywall (now). Did WSJ had changes here, recently? The article pages only contain the title, two or three sentences and maybe one image. The rest is not loaded in advance like on some other paywall-sites.

As I have no subscription, I can’t test if you can export a cookie from your browser while logged in and paste this cookie into your site_config wsj.com.txt:
http_header(cookie): COOKIENAME=COOKIECONTENT

Nevertheless the currently used selectors for the body of article doesn’t exist in the source. We need to have a deeper look into that. Maybe I’lI find some time during the weekend.

@fabio, could you please post direct URLs of articles that work and, if you find some, URLs of free articles with no paywall.

You are a WSJ subscriber, aren’t you? If not, you won’t get much more than the original description from the original feed.

Meanwhile I found the problem. WSJ changed their URL-scheme and the single_page_link doesn’t work any more on all articles that do not named /articles/headline-title-a1234

That was a hard one for me, but finally I could get around this, after I understood how if_page_contains work.

My PR1192 has to be approved by a dev. After this has happened, wsj should work again.

Hello HolgerAusB, thanks for the update!

It seems that the site patterns is kind of a game of cat and mouse with the site owners, they’re always changing things on their end that require adjustments on the patterns side!

Have a great week!

1 Like