WSJ fails extraction

Delgreco007 · September 21, 2023, 12:53am

@fivefilters WSJ is once again failing extraction, both in my self-hosted deployment and on ftr.fivefilters.org.

I have tried copying the session cookie (djcs_session) from a logged in browser, and while this seems to work with other browsers, it does not work in Postman nor when added to a custom pattern in FTR.

HolgerAusB · September 21, 2023, 4:31pm

the trick to just redirect the page to wsj.com/amp/articles/foo-bar-bfa123a5 doesn’t work any longer.

a browser plugin, that I use still is able to catch full article. In addition to the redirect it uses a rule, which I could not rebuild for ftr

allow_cookies: 1,
block_regex: /(cdn\.cxense\.com\/|cdn\.ampproject\.org\/v\d\/amp-(access|subscriptions)-.+\.js)/,
useragent: "googlebot"

@fivefilters, any ideas?

fivefilters · September 25, 2023, 5:49pm

Hmm, will have to look into this to see if there’s anything we can do.

fivefilters · September 28, 2023, 5:57pm

@HolgerAusB I had a look and I’m seeing the same as you, no /amp/ support any more (it just redirects). Previously the /amp/ version would contain the full content. Do you know what the browser plugin is doing? I can’t really tell from the snippet you pasted. Adding a user agent string like that makes no difference in my testing.

HolgerAusB · September 28, 2023, 6:27pm

it seem to block some scripts from the block_regex part. But I do not really know what the trick is.

fabio · October 6, 2023, 10:21pm

Any news? I was doing some digging on the Magnolia paywall remover and found this:

I can’t exactly figure out what the javascript is doing to the DOM, but it appears there’s multiple ways to handle the paywall removal depending on the article URL, whether it contains “livecoverage”, or “articles”, etc… It does seem to use amp in some way, but again I’m not proficient in JS enough to tell what it does exactly