Unable to get second page from https://www.pfefferminzia.de/strategie-diese-fehler-in-der-altersvorsorge-sollten-sie-vermeiden/

extractor_fan · March 17, 2023, 4:39pm

Hi,

Article: Diese Fehler in der Altersvorsorge sollten Sie vermeiden

next_page_link: //li[@class=‘next’]/a
body: //p[@class=‘teaser’] | //div[@class=‘artical-content-area’]

We get an error ‘This article appears to continue on subsequent pages which we could not extract’, any ideas what’s happening? It’s the same for any multipage article on that site.

Thanks!

HolgerAusB · March 17, 2023, 5:37pm

Don’t know why this happens, I am just a user. But the link to next page is a second time in the html source. I didn’t need a body-statement. FTR found all neccassary things without any config, except for the next page link. Maybe other articles of this site need the body-identifier.

Additionaly I stripped the “mehr-zum-thema” box.

Complete working config is:

next_page_link: //link[@rel='next']
strip_id_or_class: mehr-zum-thema-outer

test_url: https://www.pfefferminzia.de/strategie-diese-fehler-in-der-altersvorsorge-sollten-sie-vermeiden/

Please let me know if this works and if you have aditional tweaks.
Will you provide your config at Github so others could benefit from it? If not, I can do this for you?

HolgerAusB · March 17, 2023, 6:01pm

OK, just saw, the article teaser is missing, so we indeed need a body-identifyer. But yours did not include the teaser so here is my suggestion:

body: //div[@class='artical-outer']
#body: //p[@class='teaser'] | //div[@class='artical-content-area']
author: //div[@class='artical-author-info']/p/strong

next_page_link: //link[@rel='next']

strip_id_or_class: mehr-zum-thema-outer
strip_id_or_class: artical-download
strip_id_or_class: reading-time-ad-word

test_url: https://www.pfefferminzia.de/strategie-diese-fehler-in-der-altersvorsorge-sollten-sie-vermeiden/

HolgerAusB · March 19, 2023, 8:35am

I fixed another small problem and have now uploaded my config to Github: PR1068

extractor_fan · March 21, 2023, 9:07am

Thanks, that worked. I didn’t spot the next page link in the header.