The Cut Article Sent Text with No Headings

The etiquette guide here only sent the main text without the numbers/section headings when I used Push to Kindle. I basically got a wall of text that skips between subjects with no obvious reason.

Thanks for taking a look! I know there is a lot going on in this particular article.

Side-by-side comparison (original article vs. what showed up on my Kindle):

I am not a dev to Fulltext-RSS. I could partly solve your problem by adding a third body-identifyer body: //section[@class='body']. The sub headings then will shown but there is a line break between the number and the sub-heading:

grafik

I can’t find a way to re-write the html to prevent this, which might be original by css or script in the browser:

<h2 class="clay-subheader" data-editable="text" data-uri="www.thecut.com/_components/clay-subheader/instances/cldm04gt6004v3b6vczei3idn@published">
  <style></style>
  <span class="ordered-list-item"><p class="list-item-text">1.</p></span>
    You don’t have to read everyone’s book.
</h2>

You now have to wait, until @fivefilters or other dev approved my PR #1045 already has been approved within minutes. Thanks Jérémy.

Thanks for reporting this @Ana, and thanks @HolgerAusB for the improvements in extraction.

The extraction looks a little better for the Kindle now @Ana, so please feel free to try sending it again please let us know if you still have content issues.

@fivefilters I could manage to prevent the line-break between the list-number and the sub-heading. by raplacing the <p class...> with <span>.

But I was quite surprised that also the closing </p> behind it was automatically/magically replaced by a </span>. Is that safe?

find_string: <p class="list-item-text">
replace_string: <span>

See PR #1047

Response explaining why that happens here:

:slight_smile:

Yes, @fivefilters. But (because of my English) I still don’t know, if it is OK for you to use this for minor cosmetic changes as in my suggestion/PR above (it’s allready merged, by the way).

Or should I only use this for neccessary changes like in frmplus.de.txt line 17+18, where I transform a not shown image galery (carausel) to single images (see 1st test_url).

I think you noticed that I try to implement requests from other users here and that I become a regular contributor on github. So I need to know for future commits, when I should use these ‘incomplete’ tag-replacements and when not.

Hi @HolgerAusB, generally speaking, our approach to site config files is (1) to prioritise capture of the entire article text content, and then (2) as a secondorary priority to improve formatting/image preservation. With changes like the one we’re discussing, I’m always a little wary that rules intended for (2) might cause issues for (1).

So I’d say if using this approach, where we rely on the HTML parser to fix incorrect markup that we introduce, for a better final result, falls into category 1, we should do it, but if it falls into 2, maybe we don’t. For minor comsetic changes, my recommendation is that you don’t use this method, just because it is so dependent on HTML parsing rules, which themselves are trying to make the best of a bad situation. If the source HTML changes in such a way that the parser chooses to apply different rules for the opening and closing tags, we may end up with unpredictable results.

No need to change anything already committed. And in the future there should be site config rules to better handle such cases with relying on replace_string(), once those are in place, these kind of changes can be made without worrying about the parser.

Finally, thank you for all the help with site config files here and on Github. Very much appreciated.