Help with custom patterns

ungovernable · October 10, 2020, 10:47pm

I’m trying to scrape this feed: https://www.itsgoingdown.org/feed
But the result is messed up

So I created this config file:
/full-text-rss/site_config/custom/.itsgoingdown.org.txt
with this content:
body: //section[@class=‘red-border’]
autodetect_on_failure: no

But it’s not doing anything and the result is the same, even with caching disabled. What am I doing wrong?

fivefilters · October 11, 2020, 11:02am

Hi there, two things I notice that need to be changed:

So I created this config file:
/full-text-rss/site_config/custom/.itsgoingdown.org.txt

Full-Text RSS will only look for .itsgoingdown.org.txt if the URL of the feed item being processed points to a sub-domain such as https://something.itsgoingdown.org/... If the URL is https://itsgoingdown.org/... or https://www.itsgoingdown.org/..., then the site config file should be named itsgoingdown.org.txt (without the preceding dot).

with this content:
body: //section[@class='red-border']

If I look at the first URL in the feed you provided, it’s currently this article.

If I open ‘view source’ in my browser and search for “red-border”, I only find one instance in a section element:

<section class="col-lg-6 col-md-6 col-sm-12 red-border double front">

If this is the element you’re trying to extract, the XPath selector //section[@class='red-border'] will not work as it only matches <section> elements with exactly that class attribute value, i.e, <section class="red-border">.

You will need to use something like the following:

body: //section[contains(@class, 'red-border')]

This selector should work fine on this site, but using contains() in this way means you could also be selecting elements such as <section class="red-border-round double front">, which might not be what you want. The closest equivalent of the CSS selector section.red-border in XPath would be the following XPath expression:

body: //section[contains(concat(' ',normalize-space(@class),' '),' red-border ')]

Hope that’s some help.

ungovernable · October 14, 2020, 9:20am

Thank you for the great support! Very helpful and detailed response