Hi @bugmenot ,
I am not staff of fivefilters but I tried to write a site_config for you. But as I am not speaking russian, it is hard to me, to detect where the article starts and where it ends. So could you please provide us with the first sentence and the last sentence of this article, you expect to be in the pushtokindle excerpt?
I already managed to scrape a lot of text OR the images. But get confused, that the text will be cut when having the code for rewrite the image-code active. The article then ends with
который их обрамляет and the middle image of the following gallery with small images, which seems not neccessary for the article, or?
The other sections of polka.academy seems to use different structure, so I will need a while to figure that out. So this could last maybe to the weekend, sorry.
@fivefilters first version of site_config is ready for check at PR1074
I could not manage to strip a footer part after ’ Другие Статьи’
body: //main/div/div strip: //main/div/div
I also want to check other pages and sections of website at the weekend.
@bugmenot it would still be helpful to know, where the article ends.
@HolgerAusB Thanks! Will look into why the strip XPath didn’t match.
Hi @fivefilters, I tried to find out myself but couldn’t get behind it. I extracted the
<main> part of the source, I got by curl, and put a html+body stuff arround it and pasted that to XPather but only a
//* matches anything, but neither
//div nor anything else finds a match, which confused me a lot. I had the idea that maybe the Cyrillic letters and/or a wrong doctype could be part of the problem.
Nonetheless stripping of the header part in FTR is already doing its job:
but not the footer (teaser to other/similar articles):
OK, just found the problem with xpather. There are four blocks in the article, similar to
<i class="_2Yu0T _3_dH7"> <svg xmlns="http://www.w3.org/2000/svg" width="30" height="15" viewBox="0 0 30 15"> <path fill-rule="evenodd" fill="#333" d="M21.467 14.998l-.784-.8L27.726 8H0V7h27.733L20.683.796l.784-.8 8.523 7.5L21.467 15z"> </path> </svg> </i>
after renaming the svg-parameter
xmlnx, xpather finds all xpath querys I. wish.
So is it possible that FTR has the same problem as xpather?
Unfortunately I can’t replace ‘xmlns’ or ‘
<svg’ prior to the body definition or the
strip: //main/div/div which means that there has to be a fix or workaround in your code.
I think the parameter value has to be in single quotation marks not in the used double quotes:
The search queries only works with the second example.
Still haven’t had a chance to look, but will do. This could be a namespace issue, or maybe some kind of parsing issue.
‘namespace’ was a good hint, to google for. Thank you! xpather finds queries when using a different format even with the wrong namespace in the svg-tag:
//*[name()='main']/*[name()='div']/*[name()='div'] # or //*[local-name()='main']/*[local-name()='div']/*[local-name()='div']
But unfortunately FTR still does not find that part in the source with this new query.
But now I realized that the html-code changed at some point so I could manage to strip footer with.:
strip: //div[@class='_1AXbp _2QZGT'][last()]/div[@class='_2ezKP _1Abfj _3AP5M'][last()]
And I did some more cleanup in PR1087
But of course you should look into that namespace issue, when you find some time.
The desired text starts with
В издательстве Individuum выходит книга and ends with
Необходимы, я бы даже сказал.
It seems for me now that the text from the article is retrieved as expected, only the images are not preserved. At the same time, some other articles produce empty result (1, 2).
@bugmenot The kindle-preview looks good for all three links. Much text and much images.
Maybe you tried before my new configs arrived at
But for the last link P2K delivers an incomplete preview. This article is very, very, very long and has a huge amount of images. @fivefilters, is there any limit for the length of articles or the preview? FTR indeed delivers the full content.
Here’s how the previews look for me. For the article from the OP only the first image is displayed:
The other two are just empty:
Oh I see, the preview seems to look as expected when “Use browser-retrieved content” option is disabled
@bugmenot We’ll try to make that switch more visible in a future update. It can often help with retrieval problems.
@HolgerAusB Thank you for the site config. As for limits, there are no hard limit in what we process at the moment, but large documents can take a long time to process, especially if they have a lot of images as in this case. And there are size limits to how big the final EPUB can be when it’s sent by email. In this case, sending with images failed for me, but after I disabled images in Push to Kindle, I could send okay.