I’m quite new to the world of web structure and content extraction but have found the Full-Content RSS tool to be incredibly valuable. However, I’ve stumbled across a question regarding how to differentiate between sites that have a soft paywall versus those that have a hard paywall, and subsequently, how to extract the content.
For instance, articles on the NZZ site China: Immobilienkrise und Geisterstadt von Country Garden which seems to have a paywall. But with the existing
site.config, I can extract its full content. On the other hand, the Oltnertagblatt site Berset-Nacholfge: SP-Kandidaten auf dem Prüfstand, where
site.config doesn’t exist, presents its full articles only via login or using known paywall removers.
Given my limited knowledge, I was hoping to get some guidance on:
- How can one ascertain whether a site is using a soft paywall (where content is loaded but hidden, and can often be accessed by certain means) vs. a hard paywall (where the content isn’t loaded at all unless you’re a subscriber)?
- If a site is using a soft paywall, what would be the recommended steps to check if content extraction is feasible, and how would I go about doing it, especially for sites without an existing
I truly appreciate any insights, guidance, or resources you can provide. Thank you in advance!