Determining Soft vs. Hard Paywall & Content Extraction Queries

yanbam · September 5, 2023, 7:19am

Hello!

I’m quite new to the world of web structure and content extraction but have found the Full-Content RSS tool to be incredibly valuable. However, I’ve stumbled across a question regarding how to differentiate between sites that have a soft paywall versus those that have a hard paywall, and subsequently, how to extract the content.

For instance, articles on the NZZ site China: Immobilienkrise und Geisterstadt von Country Garden which seems to have a paywall. But with the existing site.config, I can extract its full content. On the other hand, the Oltnertagblatt site Berset-Nacholfge: SP-Kandidaten auf dem Prüfstand, where site.config doesn’t exist, presents its full articles only via login or using known paywall removers.

Given my limited knowledge, I was hoping to get some guidance on:

How can one ascertain whether a site is using a soft paywall (where content is loaded but hidden, and can often be accessed by certain means) vs. a hard paywall (where the content isn’t loaded at all unless you’re a subscriber)?
If a site is using a soft paywall, what would be the recommended steps to check if content extraction is feasible, and how would I go about doing it, especially for sites without an existing site.config?

I truly appreciate any insights, guidance, or resources you can provide. Thank you in advance!

Best regards,
Yannick

HolgerAusB · September 5, 2023, 8:54am

There is no easy way to find out, if there is a pay-curtain or pay-wall. You can just try to read the HTML-code, try out some standards and maybe analyzing the output of your-paywall blocker.

For NZZ I don’t even see, that there is a paywall at first, because there is no advertising (might be my pihole). I just saw the difference, when opening a second browser with a paywall-blocker. Where FTR (Fulltext-RSS) didn’t show the full content either. Your NZZ-link is a free article. I tried instead:
https://www.nzz.ch/international/indien-erfreut-sich-am-erfolg-seiner-mondmission-chandrayaan-3-ld.1754300

A good idea to start is allways to set a user_agent, and a referrer. And you can check if prune: yes|no and/or tidy: yes|no makes differences.

For both sites, this line helps:
http_header(User-Agent): Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Not finished config for oltnertagblatt.ch.txt:

http_header(User-Agent): Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
http_header(referer): https://www.oltnertagblatt.ch

body: //div[@class='article']

strip_id_or_class: headline__meta
strip_id_or_class: progressbar__wrapper
strip_id_or_class: pageelement--top-articles

strip: //svg

prune: no
tidy: no

test_url: https://www.oltnertagblatt.ch/schweiz/berset-nachfolge-das-bundesrats-jekami-der-sp-jositsch-ist-mit-seinen-ambitionen-nicht-allein-der-andere-blick-aufs-kandidaten-karussell-ld.2508415?reduced=true

@fivefilters How do I prevent that this symbolic image is used, when article has no image? <img class="ff-og-image-inserted"...

EDIT:
OK, found out, that the image was not generic, it belongs to the article. But generally I want to know, if it is possible to prevent FTR from taking the image from the og-header, if no images are in the article?

~~And: it seems, that replace_string doesn’t work in this case, I tried to remove the ‘Abo+’ insert on picture by removing &wmark=aboplus from the image link.~~ (it works, just a typo)

HolgerAusB · September 5, 2023, 2:10pm

@yanbam This may not last for long. Other websites of chmedia.ch don’t get fulltext content with that user_agent. So maybe they will block this trick here too.

I already identified a new fingerprint for about 40-50 of their 80 sites, so we could get all of them with just one config. But as 99% of all articles are paywalled…

yanbam · September 5, 2023, 2:26pm

@HolgerAusB, thank you immensely for your efforts. Even if the configuration doesn’t hold up in the end, your assistance has been invaluable. I’ve gained a deeper understanding of the extraction process and how configuration files operate.

HolgerAusB · September 5, 2023, 2:38pm

Beginners start here:

fivefilters · September 6, 2023, 6:25pm

A little late, but to add to what @HolgerAusB has said…

I don’t think you should think of a soft paywall as necessarily one where the content is loaded but hidden. Sometimes that is the case and viewing the source HTML of the page will tell you. But often the HTTP request determines if you get the full content or not. Here are some examples:

Cookies: a news site uses a cookie to track visitors and when they see the 10th request come in for the same user, they send a paywalled page back instead of the full content. When you use Full-Text RSS on such sites, you’ll likely never encounter a paywall because Full-Text RSS by default does not send cookies.
Referrer: this tells the site where you clicked on their link. Some sites will let users clicking on Google search results see the full content (it’s not just good will, there’s a history to this which we wrote briefly about here: https://www.fivefilters.org/2019/soft-paywalls/)
User-agent: this tells the site which service/browser/device is making the request. It could be used by some sites to provide the full content (e.g. to Google’s indexers) and not to others. Not sure if this is used much to influence the paywall.

It’s easy to make requests and provide whatever data you want for the above, so publishers know it’s not foolproof, but for those who only want a soft paywall, that’s all they may rely on. So if you’re curious, you’ll just have to play around and see.

Probably the quickest way to determine if the paywall is a soft one is to copy the URL from your browser’s address bar and open a new private window (CTRL+Shift+P on Firefox) and paste it there. That will ensure no cookies are sent. If the site relied on cookie tracking for its paywall display, it may now show you the full content.

If you want to experiment with HTTP headers, using the browser’s developer tools you can see HTTP requests and re-send them with edited headers. That might help you figure out if a particular HTTP header affects the paywall.

Should also add that some sites will use a method that looks like a soft paywall, but is in fact a hard one. They’ll publish a piece and make it fully public - hoping to get shares on social media for example - and then later on enforce a hard paywall to try and get new subscribers. You’ll often be able to read these articles on archive sites like the Wayback Machine or archive.today, archived before the hard paywall was enforced.

fivefilters · September 6, 2023, 6:37pm

I don’t know if we’ve documented this, but you can do it with this (from the changelog):

New site config directive: insert_detected_image: yes/no (default yes) - places image in og:image in the body if no other images extracted

It’s also possible to pass &images=0 as a query string parameter to Full-Text RSS, but that will remove all images.