Unable to get second page from https://www.pfefferminzia.de/strategie-diese-fehler-in-der-altersvorsorge-sollten-sie-vermeiden/


Article: Diese Fehler in der Altersvorsorge sollten Sie vermeiden

next_page_link: //li[@class=‘next’]/a
body: //p[@class=‘teaser’] | //div[@class=‘artical-content-area’]

We get an error ‘This article appears to continue on subsequent pages which we could not extract’, any ideas what’s happening? It’s the same for any multipage article on that site.


Don’t know why this happens, I am just a user. But the link to next page is a second time in the html source. I didn’t need a body-statement. FTR found all neccassary things without any config, except for the next page link. Maybe other articles of this site need the body-identifier.

Additionaly I stripped the “mehr-zum-thema” box.

Complete working config is:

next_page_link: //link[@rel='next']
strip_id_or_class: mehr-zum-thema-outer

test_url: https://www.pfefferminzia.de/strategie-diese-fehler-in-der-altersvorsorge-sollten-sie-vermeiden/

Please let me know if this works and if you have aditional tweaks.
Will you provide your config at Github so others could benefit from it? If not, I can do this for you?

OK, just saw, the article teaser is missing, so we indeed need a body-identifyer. But yours did not include the teaser so here is my suggestion:

body: //div[@class='artical-outer']
#body: //p[@class='teaser'] | //div[@class='artical-content-area']
author: //div[@class='artical-author-info']/p/strong

next_page_link: //link[@rel='next']

strip_id_or_class: mehr-zum-thema-outer
strip_id_or_class: artical-download
strip_id_or_class: reading-time-ad-word

test_url: https://www.pfefferminzia.de/strategie-diese-fehler-in-der-altersvorsorge-sollten-sie-vermeiden/

I fixed another small problem and have now uploaded my config to Github: PR1068

1 Like

Thanks, that worked. I didn’t spot the next page link in the header.

1 Like