Extraction fails for SOME articles from the Financial Times

fabio · August 29, 2023, 9:13pm

Hello, I’m using this feed from the Financial Times (<![CDATA[News Feed]]>) and it has a weird behavior. Some articles get extracted properly (green rectangle in the screenshot), while others (red rectangle) show the paywall content.

Does anyone know why?

HolgerAusB · August 30, 2023, 4:44am

That is weird! While trying this several times I sometimes get the subscription warning and a few minutes later I get the full content of the very same article.

It seems that they deliver several alternating html-formats. Even when opening an article with catched full-text, I don’t find text from the excerpt in the original source.

@fabio If you are a subscriber you can try to export cookies while logged-in to FT and put this in your ft.com.txt (self-hoster only):
http_header(cookie): cookiename=content
or multiple cookies:
http_header(cookie): name1=content1; name2=content2

fabio · August 30, 2023, 3:43pm

Unfortunately I’m not a subscriber of FT

I was doing some digging and appears that the FT servers providing different responses depending on what the cookie contents are. If the cookie is “right”, it returns the actual article content, whereas if it is just a “visitor” it returns just the paywall HTML without the article text…

Pls bear with me, as I dont have a very good understanding of how the site patterns work or their capabilities… Is there some way to have the site pattern “fake” a good cookie, e.g. by maybe faking a valid referrer, or some other technique…?

I use this Chrome plugin that removes paywalls (magnolia1234 / Bypass Paywalls Chrome Clean · GitLab) and in one of their scripts for FT.com (contentScript.js · master · magnolia1234 / Bypass Paywalls Chrome Clean · GitLab) they remove parts of the DOM, prevent certain .JS from being loaded, etc…

Can any of these approaches work with FTR’s site patterns…?

Thank you

fabio · August 30, 2023, 3:49pm

Btw here’s a screenshot of the the same FT article. On the left side, it uses the paywall Chrome extension that allows the full article to load, on the right you can see it loading in an incognito tab. I tried to highlight (green rectangles) the main differences I saw

Screenshot

HolgerAusB · August 30, 2023, 4:07pm

I cant read that cookie content in your screenshot, try to paste as CONTENT:

http_header(referer): https://www.ft.com
http_header(cookie): ft-access-decision-policy=CONTENT

do you have a link to that browser extension?

fabio · August 30, 2023, 4:19pm

Sure… magnolia1234 / Bypass Paywalls Chrome Clean · GitLab

fabio · August 30, 2023, 4:33pm

I also tried adding
http_header(referer): https://www.ft.com
http_header(cookie): ft-access-decision-policy=CONTENT

to the site pattern ft.com.txt but still same behavior

HolgerAusB · August 30, 2023, 5:15pm

You should have changed ‘CONTEND’ with the cookie you found. But that didn’t help here.

I had success with the cookie ‘FTAllocation’:

http_header(referer): https://www.ft.com
http_header(cookie): FTAllocation=abc.................789

Use that one from your own test with your Browser Extension. I don’t know, if this expires, if you have an rss aggregator which automatically get new articles

fabio · August 30, 2023, 5:44pm

That didnt work for me… Is there any way to spoof the user-agent?

fabio · August 30, 2023, 5:45pm

Using the browser extension, it forces the user agent to

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

HolgerAusB · August 31, 2023, 3:50am

I see, just checked it. The referer is the problem, that makes it worse. And just to make shure I explained it right:

Go to ft.com in your Browser, use your extension, load one article in fulltext, extract the 36-digit cookie with the name FTAllocation and paste that value behind the =, without hyphens. This is just an example, you need to paste your own value:
http_header(cookie): FTAllocation=12345678-9abc-def0-1234-56789abcdef0

don’t set the referer. Full example:

http_header(User-Agent): Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0
http_header(cookie): FTAllocation=12345678-9abc-def0-1234-56789abcdef0

where I just see, that this 36-digit value seem to be the same as in the article URL itself. But not all of them can be used universal for all articles.

HolgerAusB · August 31, 2023, 4:17am

@fivefilters is that a valid declaration for a cookie?:

http_header(cookie): FTAllocation=@=substring-after(//a[contains(@href, 'location=/content/')]/@href , '/content/')

it works on first tests here, and would make the site_config universal.

It should take the URL, e.g. www.ft.com/content/f47fafd0-079d-4bc9-bbf5-19e55d5649ad and extract the last part to set cookie to: FTAllocation=f47fafd0-079d-4bc9-bbf5-19e55d5649ad

EDIT:
I think it’s not valid code, but FT seems to accept nearly every content on that cookie-name (at the moment). Is there a way to dynamically set a cookie based on parts of the URL or from the html-code? Last one would mean, that the page has to be loaded twice, I think.

fabio · August 31, 2023, 5:33pm

It’s weird, the cookie approach works for SOME articles, but others continue hitting the paywall.

Isn’t it strange that the same server will behave differently depending on which article you’re trying to get access to…? Some articles get extracted just fine regardless of any cookies being set, while others get the paywall response from the server…

I cant make sense out of it…

HolgerAusB · September 1, 2023, 3:34am

could be a geolocation thing, too. I am from Germany. What is your country?

fabio · September 1, 2023, 4:22pm

I’m located in the south of the USA

Extraction fails for *SOME* articles from the Financial Times

Extraction fails for SOME articles from the Financial Times