Missing text in extracted body

anarcho · November 7, 2023, 5:07pm

Hi,

We use full-text-RSS to extract the articles content on an RSS aggregator website. A bug was reported to us today:

It seems like anarchistfoderation is only grabbing the main part of indymedia germany articles with the class " field-name-body", and not the “introduction/abstract” with the class “field-name-field-abstract”. Many times people write a different abstract that is NOT included in the main article part. That means when only the article part is grabbed, many times is missing a first part of the article, because people most times cut it out to put in the abstract/summary/introduction…

Here one example:
the indymedia germany article WITH abstract/summary/introduction: (B) Demo am 25.11. Gegen Tom Schwarz: Frauenschläger aus der City jagen! | de.indymedia.org
The grabbed article at anarchistischefoderation.de WITHOUT the abstract, only main part: (B) Demo am 25.11. Gegen Tom Schwarz: Frauenschläger aus der City jagen! – 🏴 Anarchistische Föderation

We tried to fix the issue by creating a custom config file for de.indymedia.org.txt and directly targeted elements containing the full article text with the introduction, like this:

`body: //div[@id='#block-system-main']
body: //div[@class='content']
body: //div[@class='node']
body: //div[@class='node-artikel.content']
`

But the result is the same and the introduction is still missing. What are we doing wrong?

Thanks!

HolgerAusB · November 7, 2023, 6:11pm

Thank you you for trying yourself first. That is good way to understand the way it works.

Here are some tips, you should know:

The first matching body selector wins. Every following will not match.

To concatenate different parts of the source page, you can use the pipe symbol |
see example below

The xpath-format //div[@class='foobar'] only matches, when foobar is the ONLY value of class field, which is not the case here. Instead you need (mind the pipe):
body: //div[contains(@class, 'field-name-field-abstract')] | //div[contains(@class, 'field-name-body')]

If you are now missing the image (it is outside the abstract-class) you can concat field-name-field-bild between or in front of the body-selector from above.

Personally I prefer to use a single selector which returns the hole article and then stripping unnecessary parts.

body: //div[contains(@class, 'node-artikel')]

### remove fringe

# easy mode, this matches, even when class has more than one value
strip_id_or_class: field-name-field-lizenzliste
# OR the full xpath
strip: //div[contains(@class, 'field-name-field-lizenzliste')]

test_url: https://de.indymedia.org/node/315930

This example is incomplete, you may want to add more strips.

If you want to use body: //div[@class='content'] from your example, you additionally need to prevent some auto cleanup by adding a prune: no in a single line:

body: //div[@class='content']
prune: no

and more strips too.

Try it out and come back here to get more help or report success. You may also share your config in our repository at Github or you can post it here instead.

anarcho · November 7, 2023, 6:23pm

Thank you for the fast answer! It works perfectly with this custom config.