Next page problem

Hi
unfortunately I can’t find a way to call the next pages with the help of the next_page_link: function. Where is my thinking error? Thank you

single_page_link: //a[contains(@href, ‘slide=1’)]
next_page_link: //*[@id="__next"]
body: //div[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ content ')]
test_url: https://guestofaguest.com/new-york/restaurants/the-10-trendiest-outdoor-brunch-spots-in-nyc?slide=1

Already tried out:

  • //*[@id="__next"]
  • /html/body/div[1]/div[1]/div/div[1]/div[1]/div[2]/div/div/div[1]/article/header/div[4]/div/a[2]/div
  • /html/body/div[1]/div[1]/div/div[1]/div[1]/div[2]/div/div/div[1]/article/header/div[4]/div/a[2]
  • //span[@class=‘fa fa-2x fa-fw fa-vc fa-angle-right’]/a
  • //div[@class=‘fa fa-2x fa-fw fa-vc fa-angle-right’]

A couple things:

  1. You shouldn’t expect next_page_link to work if single_page_link is present and matches - single_page_link is supposed to be the link that loads the entire thing on one page, e.g. print view or single page view. It was always intended as a better alternative to next_page_link (if the site offers a single-page view of course). In this case, it doesn’t look like you’re using it to match a single-page view, but the first page.

  2. Based on a quick look at the source of the page, I think your next_page_link XPath expression needs to select based on the inner child element. Something like:

    //a[contains(@href, 'slide=') and ./div[contains(@class, 'fa-angle-right')]]/@href
    

Should match the right link given something like the below:

<div>
<a href="the-10-trendiest-outdoor-brunch-spots-in-nyc?slide=3"><div class="fa fa-angle-left"></div></a>
<a href="the-10-trendiest-outdoor-brunch-spots-in-nyc?slide=5"><div class="fa fa-angle-right"></div></a>
</div>

See http://www.xpathtester.com/xpath/8cfbbec801d826a75a1545fd66d59b74 (click ‘Test’)

Thank you - the page offers a single page view function but unfortunately this does not work ;-(. If I change the post link to makefulltextfeed.php?url=https://guestofaguest.com/new-york/restaurants/the-10-trendiest-outdoor-brunch-spots-in-nyc?slide=1, it works otherweise no way. So the problem is the combination of single_page_link and next_page_link. What is the reason for this and are there alternatives?

single_page_link: //a[contains(@href, ‘slide=1’)]
next_page_link: //a[contains(@href, ‘slide=’) and ./div[contains(@class, ‘fa-angle-right’)]]/@href
body: //div[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ content ')]
test_url: https://guestofaguest.com/new-york/restaurants/the-10-trendiest-outdoor-brunch-spots-in-nyc?slide=1

Suggestion - couldn’t you make it so that the custom config is processed in order and the results are cached and merged at the end? With this source:

  1. you would have to read the first page
  2. then check if link single/next page exists
  3. if yes all these read out
  4. bring everything together.

body: //div[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ item-content ‘)]
single_page_link: //a[contains(@href, ‘slide=1’)]
next_page_link: //a[contains(@href, ‘slide=’) and ./div[contains(@class, ‘fa-angle-right’)]]/@href
body: //div[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ content ')]
test_url: https://guestofaguest.com/new-york/restaurants/the-10-trendiest-outdoor-brunch-spots-in-nyc

The problem here is that you’re using single_page_link for a different purpose than it was intended for. We do that ourselves in some site config files (e.g. to bypass cookie warnings which require users to click a ‘continue’ button). But it was intended for articles split across multiple pages. In that scenario you’re usually offered a ‘single page’ view and/or ‘next page/previous page’ links. By following the ‘single page’ link, we expect to be getting the full content, so it doesn’t make sense to then look for next page/previous page links on the resulting page.

What’s needed here is something that works in a similar way to single_page_link, but is instead intended to be a simple, follow this link directive with no assumption as to what the new page will contain. That’s unfortunately not available at the moment.

Thanks for the feedback - understood. It would be great if there was such logic - so I can determine by the sequence in the config what is processed and how. Would cover a lot of/most scenarios and give you much more flexibility. thx