NYT LIsticle is missing all content

When I try to send the article below, I get the big top image with the books and no other content. The entire book list is missing :frowning:

-Using Push to Kindle browser extension 
-URL: https://www.nytimes.com/interactive/2024/books/best-books-21st-century.html

Not sure if you need a subscription to check this out, but here is a share link if you need it to see the entire article:

@fivefilters I can’t fix that without a test system. I tried with my wallabag and wallabagger browser addon, which works similar to P2K addon, but its weird.

NYT uses several different layouts and therefore there are already 7 different body selectors in the config. With a concatenating one at first place, which could prevent the following selectors to work.

When deactivating all of them, I got all the real content, BUT I also get a large section with consent stuff, cookie management and other things, not belonging to the article text. And sadly, it is above the real content. I used the giftcode-URL.

I tried several selectors that should work, but the result is allways the same:

# tested only one body selector at a time
body: //main[1]
body: //main[@id='site-content']
body: //article[1]
body: //article[@id='interactive']
body: //section[@id='best-books-21st-century']
# and even, to test only for the first book:
body: //section[@id='book-100']

and of course, I tried to strip the overlay fringe, but that is also not working.

strip: //script
strip: //*[contains(@class, 'overlay')]
strip_id_or_class: fides-overlay
strip_id_or_class: fides-modal-footer
strip_id_or_class: app
strip_id_or_class: fides-modal

When using the plain url without the gift-code part with an unlocker plugin, I don’t get the consent stuff, but a literal json/css block above the content: {"css":["https://static01.nytimes.com/new.... And the following content also contains all 100 books, instead of only the first one, when using body: //section[@id='book-100'] which should only result with the first book.

I don’t know, how P2K is doing here, but as I need to wait for the next full hour, before I can check the result after I uploaded a new version, that could cause problems with other users/customers on regular NYT pages for hours or days.

2 Likes

Hi @HolgerAusB,

I need to revisit this but I think the way we handle NYT content is via the JSON. There’s no way for us to express this with XPath, so I think what happens is we look for JSON that contains the content and if we find it, we extract the content from the JSON and if we don’t, we continue with the site config file.

NYT is pretty much the only site we do this for, if I remember correctly.

1 Like