Text body is missing

Jean-Louis · February 12, 2024, 1:16pm

Hello,
I try to retrieve an article from a public web site “cell” which is a science review on Biology.
Most of text has disappeared : I got one schema and list of references but the body of the article is missing. The URL is:
https://www.cell.com/cell/fulltext/S0092-8674(22)01377-0
Is there anything I can try to improve ?
Thanks & best regards Jean-Louis

HolgerAusB · February 12, 2024, 3:31pm

@Jean-Louis, welcome and thank you for reporting this. The html source code of that URL does not contain any content. The real content will be loaded by JavaScript within a real browser. P2K is not able to do this itself.

Normally this could be done withe the browser plugin you find at Push to Kindle Installation. Desktop browser only!

In this case, additional tweaks need to be made on P2K server side. Main Problem is, that I am only a normal user and have no direct access to P2K server. I try to make site depended configs for ‘wallabag’ and hope that they work for P2K as well, which does most times.

But your article is very, very long, so wallabag run into a timeout while trying to process. Could you provide me with an URL of a significant shorter article on cell.com?

Maybe I can look into that issue on the weekend.

Jean-Louis · February 12, 2024, 4:28pm

Good afternoon HolgerAusB,

Thanks for your reply.

Yes indeed they are quite long. I have here some shorter ones having same issue:
https://www.cell.com/cell/fulltext/S0092-8674(23)01349-1
https://www.cell.com/cell/fulltext/S0092-8674(23)01402-2

Is there any way to tell “Push to Kindle”, to start working at a given word e.g. “Summary” and to stop at a given word e.g. “Acknowledgment” to improve result ?

What I have tried is to click on “pastepad” to open an editor, then copy/paste in it the content of the browser I want to see on my kindle. Then remove some silly text corresponding to references.
This works quite fine except that all diagrams & figures are lost.

Have a nice day.
With Best Regards Jean-Louis

HolgerAusB · February 13, 2024, 12:19am

Yes, that is the way it works. But not by choice of the user. The P2K server tries to predict the relevant parts without any site depended config. In about 50 percent of websites that works in a more or less usable way.

Most times a configfile delivers better results and sometimes we don’t get any result without that config.

If you are interested, you’ll find these configs at github. Same configs are also used by other products like Fulltext-RSS or wallabag.

Just to be clear: The decision as to what can and cannot be seen from an article is made on the server and always applies to all users and all products.

Removing the citations and cross-references would make the catch smaller, but scientists, journalists or book authors etc would probably miss them.

I’ll try to write a config at the weekend. But cell is using cloudflare to prevent crawlers to catch their content, which makes things difficult. No promisses.

Jean-Louis · February 14, 2024, 8:14am

Good morning HolgerAusB

Thanks for your reply.

I have read some examples in github such as “wsj.com.text” but this is not obvious!

When using “print edit we” extension of Firefox, I select the frame containing the body of the main text then click on “delete except” and all surrounding frames disappear in 2 clicks. Is there something like “unstrip” or “keep only” in the config files ?

Issue : when loading “push to kindle” to export this work, I am requested and forced to close the page and all manual editing is lost.
So I copy/paste to another page or save in HTML to relaunch…before using “Push to Kindle”.

Have a nice week…

With Best regards Jean-Louis

HolgerAusB · February 16, 2024, 10:30pm

Hi @Jean-Louis,

that’s the way it works. With body: <xpath> we select the part of the page we want to have. And then strip unwanted elements within this body. Here you find a small introduction.

I just made a config for cell.com. But as I wrote before: as ‘cell’ is using cloudflares anti-bot-service, you need to use the Pushtokindle browser plugin for a desktop browser. So try again.

Unfortunately the big article from your first post doesn’t contain any images - and I don’t know why. For the two shorter articles, P2K plugin fetches all images.

You will be facing some ugly newlines prior to every citation footnote. I could not manage to have these footnotes in the same line as the preceding text. I could remove the footnotes in total, but that would not help other users (students, scientists), who may need this, when archiving the article e.g. with wallabag.

Maybe @fivefilters has some ideas about the missing images from https://www.cell.com/cell/fulltext/S0092-8674(22)01377-0 or about the citation footnotes?

Jean-Louis · February 18, 2024, 7:13pm

Hi HolgetAusB,

Thanks for your cell.com file.
I am indeed using “push to kindle” Firefox plugin and I get the images.

The high number of newlines each time there is a citation make it difficult to read.
So even if I agree that references are useful in general, having them this way is useless : I think nobody can use it like this. So I think it would be better to drop all of them, if you agree (and easy to implement).

With Best regards Jean-Louis

fivefilters · February 20, 2024, 4:45am

Thanks for the site config, @HolgerAusB! I had a look at the footnote issue and my suggestion would be to try string replacement to change the block element <div> to an inline element <span>. I haven’t tested this, but I’m thinking something like:

replace_string(<div class="dropBlock reference-citations"><sup>):<span><sup>
replace_string(</sup><div):</sup><span><div

HolgerAusB · February 20, 2024, 6:01am

@fivefilters I already tried similar replacements, which I unfortunately deleted from my config.

The problem is, your and mine replacement only works for single footnotes but not if there are two and more comma-seperated footnotes ^{2, 3, 4}

I tested that with wallabag and I think as an archive software, wallabag should keep these footnotes, the config has to work there. Main problem there, is that wallabag is auto-stripping span-elements.

Your replace-code is also cutting the article of first test_url after footnote 5 (in wallabag)

HolgerAusB · February 21, 2024, 1:30am

@fivefilters I found a dirty way for wallabag (self-hosters only) now. I set a custom.css in my wallabag:

.nolinebreak {
    display: inline;   ## or ##
    display: inline-block;
}

and then I added that class to the div via config-rule:

replace_string(<div class="dropBlock reference-citations">):<div class="dropBlock reference-citations nolinebreak">

I don’t know the difference between inline and inline-block but both versions worked here.

So, is there any class in P2K’s css which does not much more than ‘display: inline[-block]’ which we could inject by rule. Or could you set such a css-rule, please?

Jean-Louis · February 24, 2024, 9:20am

Good morning HolgerAusB,

Based on this exchange I have tried wallabag but I get an error on both URL:
wallabag can’t retrieve contents for this article. Please troubleshoot this issue. …

Does “my” wallabag take into account “your” new cell.com.txt file taken from github ?
I am using “https://app.wallabag.it/” with new account I have created today as new user.

Thanks & best Regards Jean-Louis

HolgerAusB · February 24, 2024, 12:23pm

Unfortunately, there is no workflow, that wallabag is auto-updating site-configs from the FiveFilters repo. That only happens, when they releasing a new version. The last was early in January.

Self-hosters of wallabag may copy the site-config to
/<path-to-wallabag>/vendor/j0k3r/graby-site-config/

After each change of the config you need do clear wallabag’s cache, for my debian that is:
sudo -u www-data /<path-to-wallabag>/bin/console --env=prod cache:clear

Don’t ask me, how that works for a dockerized wallabag.

For your css, you should add a file custom.css to the folder <path-to-wallabag>/web

fivefilters · February 25, 2024, 5:01am

Thanks @HolgerAusB. It would be interesting to test and see if the Kindle will respect that CSS rule. I’ll see if I can create a file to experiment with.

If it doesn’t work I’ll experiment with some string replacement to see if we can remove the wrapping <div> elements which cause the new lines. I think it’s an odd choice from the site to use block-level <div> tags around inline <sup> elements and then use CSS to override the block-level display.

HolgerAusB · February 25, 2024, 5:38pm

I don’t think I’ve given it enough thought. P2K probably sends the HTML to the Kindle as it is.

This means that there should be a corresponding rule in Amazon’s css, we can re-use.

Alternatives :
Can we send a custom.css file with the e-book? or inject a <style> in the <head> of the page? or write the style attribute directly into the <div class="..." style="display: linline-block;">?