The Cut Article Sent Text with No Headings

Ana · February 9, 2023, 5:35am

The etiquette guide here only sent the main text without the numbers/section headings when I used Push to Kindle. I basically got a wall of text that skips between subjects with no obvious reason.

Thanks for taking a look! I know there is a lot going on in this particular article.

Side-by-side comparison (original article vs. what showed up on my Kindle):

HolgerAusB · February 9, 2023, 2:58pm

I am not a dev to Fulltext-RSS. I could partly solve your problem by adding a third body-identifyer body: //section[@class='body']. The sub headings then will shown but there is a line break between the number and the sub-heading:

grafik

I can’t find a way to re-write the html to prevent this, which might be original by css or script in the browser:

<h2 class="clay-subheader" data-editable="text" data-uri="www.thecut.com/_components/clay-subheader/instances/cldm04gt6004v3b6vczei3idn@published">
  <style></style>
  <span class="ordered-list-item"><p class="list-item-text">1.</p></span>
    You don’t have to read everyone’s book.
</h2>

~~You now have to wait, until @fivefilters or other dev approved~~ my PR #1045 already has been approved within minutes. Thanks Jérémy.

fivefilters · February 9, 2023, 3:31pm

Thanks for reporting this @Ana, and thanks @HolgerAusB for the improvements in extraction.

The extraction looks a little better for the Kindle now @Ana, so please feel free to try sending it again please let us know if you still have content issues.

HolgerAusB · February 11, 2023, 3:39am

@fivefilters I could manage to prevent the line-break between the list-number and the sub-heading. by raplacing the <p class...> with <span>.

But I was quite surprised that also the closing </p> behind it was automatically/magically replaced by a </span>. Is that safe?

find_string: <p class="list-item-text">
replace_string: <span>

See PR #1047

fivefilters · February 16, 2023, 3:03pm

Response explaining why that happens here:

github.com/fivefilters/ftr-site-config

replace_string for <tag> also replaces </tag>. That is nice, but is this safe?

opened 08:54PM - 11 Feb 23 UTC

HolgerAusB

@fivefilters, @j0k3r : I just stumbled upon a nice feature when using `replace_s…tring` to change html-tags. If I change a html tag to a different tag, the matching closing-tag is also changed magically in Fulltext-RSS. So my questinion is: Will this work with other projects using this site_config as well? graby, wallabag, others? Or might it cause problems, if someone does not explicitly replace the closing tag too. Or might there be problems with FTR/P2K too? Example: Original html: ``` <h2 class="clay-subheader" ...> <span class="ordered-list-item"> <p class="list-item-text">1.</p> </span> You don’t have to read everyone’s book. </h2> ``` site-config: ``` find_string: <p class="list-item-text"> replace_string: <anydifferenttag> ``` result: ``` <h2 class="clay-subheader" ...> <span class="ordered-list-item"> <anydifferenttag>1.</anydifferenttag> </span> You don’t have to read everyone’s book. </h2> ``` single-line replacement also works as well as with tag fragments (missing parameters and `>`): ``` replace_string(<h2): <someuglytag ``` result: ``` <someuglytag class="clay-subheader" ...> <span class="ordered-list-item"> <p class="list-item-text">1.</p> </span> You don’t have to read everyone’s book. </someuglytag> ```

HolgerAusB · February 16, 2023, 4:51pm

Yes, @fivefilters. But (because of my English) I still don’t know, if it is OK for you to use this for minor cosmetic changes as in my suggestion/PR above (it’s allready merged, by the way).

Or should I only use this for neccessary changes like in frmplus.de.txt line 17+18, where I transform a not shown image galery (carausel) to single images (see 1st test_url).

I think you noticed that I try to implement requests from other users here and that I become a regular contributor on github. So I need to know for future commits, when I should use these ‘incomplete’ tag-replacements and when not.

fivefilters · February 19, 2023, 10:09am

Hi @HolgerAusB, generally speaking, our approach to site config files is (1) to prioritise capture of the entire article text content, and then (2) as a secondorary priority to improve formatting/image preservation. With changes like the one we’re discussing, I’m always a little wary that rules intended for (2) might cause issues for (1).

So I’d say if using this approach, where we rely on the HTML parser to fix incorrect markup that we introduce, for a better final result, falls into category 1, we should do it, but if it falls into 2, maybe we don’t. For minor comsetic changes, my recommendation is that you don’t use this method, just because it is so dependent on HTML parsing rules, which themselves are trying to make the best of a bad situation. If the source HTML changes in such a way that the parser chooses to apply different rules for the opening and closing tags, we may end up with unpredictable results.

No need to change anything already committed. And in the future there should be site config rules to better handle such cases with relying on replace_string(), once those are in place, these kind of changes can be made without worrying about the parser.

Finally, thank you for all the help with site config files here and on Github. Very much appreciated.