incomplete text retrieval

bugmenot · April 2, 2023, 1:41pm

HolgerAusB · April 2, 2023, 7:21pm

I am not staff of fivefilters but I tried to write a site_config for you. But as I am not speaking russian, it is hard to me, to detect where the article starts and where it ends. So could you please provide us with the first sentence and the last sentence of this article, you expect to be in the pushtokindle excerpt?

I already managed to scrape a lot of text OR the images. But get confused, that the text will be cut when having the code for rewrite the image-code active. The article then ends with который их обрамляет and the middle image of the following gallery with small images, which seems not neccessary for the article, or?

The other sections of polka.academy seems to use different structure, so I will need a while to figure that out. So this could last maybe to the weekend, sorry.

HolgerAusB · April 5, 2023, 2:46pm

@fivefilters first version of site_config is ready for check at PR1074

I could not manage to strip a footer part after ’ Другие Статьи’

I tried:

body: //main/div/div
strip: //main/div[1]/div[2]

and

body: //main/div/div[1]

I also want to check other pages and sections of website at the weekend.

@bugmenot it would still be helpful to know, where the article ends.

fivefilters · April 5, 2023, 3:37pm

@HolgerAusB Thanks! Will look into why the strip XPath didn’t match.

HolgerAusB · April 11, 2023, 2:29pm

Hi @fivefilters, I tried to find out myself but couldn’t get behind it. I extracted the <main> part of the source, I got by curl, and put a html+body stuff arround it and pasted that to XPather but only a //* matches anything, but neither //main nor //div nor anything else finds a match, which confused me a lot. I had the idea that maybe the Cyrillic letters and/or a wrong doctype could be part of the problem.

Nonetheless stripping of the header part in FTR is already doing its job:
strip: //main/div[1]/div[1]/div[1]

but not the footer (teaser to other/similar articles):
strip: //main/div[1]/div[2]

EDIT 1:
OK, just found the problem with xpather. There are four blocks in the article, similar to

<i class="_2Yu0T _3_dH7">
  <svg xmlns="http://www.w3.org/2000/svg" width="30" height="15" viewBox="0 0 30 15">
    <path fill-rule="evenodd" fill="#333" d="M21.467 14.998l-.784-.8L27.726 8H0V7h27.733L20.683.796l.784-.8 8.523 7.5L21.467 15z">
    </path>
  </svg>
</i>

after renaming the svg-parameter xmlns to xmlnx, xpather finds all xpath querys I. wish.

So is it possible that FTR has the same problem as xpather?

Unfortunately I can’t replace ‘xmlns’ or ‘<svg’ prior to the body definition or the strip: //main/div[1]/div[2] which means that there has to be a fix or workaround in your code.

EDIT 2:
I think the parameter value has to be in single quotation marks not in the used double quotes:

wrong: <svg xmlns="http://www.w3.org/2000/svg"...
correct: <svg xmlns='http://www.w3.org/2000/svg'...

The search queries only works with the second example.

fivefilters · April 13, 2023, 1:48pm

Still haven’t had a chance to look, but will do. This could be a namespace issue, or maybe some kind of parsing issue.

HolgerAusB · April 13, 2023, 4:17pm

‘namespace’ was a good hint, to google for. Thank you! xpather finds queries when using a different format even with the wrong namespace in the svg-tag:

//*[name()='main']/*[name()='div']/*[name()='div'][1]
# or
//*[local-name()='main']/*[local-name()='div']/*[local-name()='div'][1]

But unfortunately FTR still does not find that part in the source with this new query.

But now I realized that the html-code changed at some point so I could manage to strip footer with.:
strip: //div[@class='_1AXbp _2QZGT'][last()]/div[@class='_2ezKP _1Abfj _3AP5M'][last()]

And I did some more cleanup in PR1087

But of course you should look into that namespace issue, when you find some time.

bugmenot · April 14, 2023, 11:36pm

The desired text starts with В издательстве Individuum выходит книга and ends with Необходимы, я бы даже сказал.

bugmenot · April 14, 2023, 11:43pm

It seems for me now that the text from the article is retrieved as expected, only the images are not preserved. At the same time, some other articles produce empty result (1, 2).

HolgerAusB · April 15, 2023, 5:54am

@bugmenot The kindle-preview looks good for all three links. Much text and much images.

Maybe you tried before my new configs arrived at pushtokindle.com?

But for the last link P2K delivers an incomplete preview. This article is very, very, very long and has a huge amount of images. @fivefilters, is there any limit for the length of articles or the preview? FTR indeed delivers the full content.

bugmenot · April 15, 2023, 10:12pm

Here’s how the previews look for me. For the article from the OP only the first image is displayed:

The other two are just empty:

bugmenot · April 15, 2023, 10:15pm

Oh I see, the preview seems to look as expected when “Use browser-retrieved content” option is disabled

fivefilters · April 19, 2023, 8:20pm

@bugmenot We’ll try to make that switch more visible in a future update. It can often help with retrieval problems.

@HolgerAusB Thank you for the site config. As for limits, there are no hard limit in what we process at the moment, but large documents can take a long time to process, especially if they have a lot of images as in this case. And there are size limits to how big the final EPUB can be when it’s sent by email. In this case, sending with images failed for me, but after I disabled images in Push to Kindle, I could send okay.

FigrHed · June 14, 2023, 11:40am

Hi, I had the same problem with this article:

It Only gave me the last two paragraphs.
Never had this problem with P2K before!
Tried on two separate devices with same result.

HolgerAusB · June 14, 2023, 2:12pm

The result is site dependent. For some sources P2K needs a config file, for some sites it works without one.
I now wrote a config which might fix that. Please wait until a dev accept my PR1092

HolgerAusB · June 15, 2023, 6:01am

@FigrHed Config should now be live, please reload. The text should be complete now. Please report if something is missing or if some crap should be removed.

@fivefilters Unfortunately some of the images are doubled in P2K-Preview while all is pefect in FTR. As I can’t test this in P2K you should overlook this. Maybe by stripping //noscript ?

fivefilters · June 19, 2023, 9:36pm

@HolgerAusB, that does happen occasionally when sites use lazy loading. Push to Kindle browser extensions send the HTML after JS has been executed. So the HTML being processed isn’t exactly the same as the one FTR sees. Hopefully improved in future versions. At the moment users can click ‘Edit’ and toggle the ‘Use browser-retrieved content’ switch. Then Push to Kindle will make a server request for the content and the results should look closer to FTR.