Partial grab from wp.pl

gawkla · January 8, 2023, 9:39am

Only small part of articles are grab from wp.pl

example:

HolgerAusB · January 9, 2023, 9:07pm

Welcome. Please use the categories for your posts. I assume you mean Fulltest-RSS, did you?

If you are selfhosting Fulltext-RSS place a file .wp.pl.txt (dont forget the leading dot!) in your site_config/custom folder, the only content of this file is body: //article.

Unfortunately I don’t speak polish. So I don’t know where the article realy ends and what are unwanted user comments or advertisments or other rubbish.

So maybe you want to give us more examples and name start and end of the good article AND the rubbish. The we might be able to write a config for the hosted service.

gawkla · January 9, 2023, 9:41pm

Thanks! sorry for dumb questions, I am new here.
Could you tell me where is site_config/custom folder? I will try your solution. I am using firefox / chrome extension to send web articles to my kindle.

This article is well parsed by firefox ‘reader mode’ - the last word should be: “Jednak do dziś, kiedy latają samoloty, boję się. Zostało mi to z tyłu głowy, że zaraz będzie ostrzał.”

HolgerAusB · January 9, 2023, 9:58pm

You can do this by yourself ONLY when you are self-hostig Fulltext-RSS (FTR) on your own server! So your question about the path mean, that you are not self-hosting, otherwise your would know, I think. Its on the root path of your FTR installation.

If your are using the service at http://ftr.fivefilters.org/ you have to wait until someone have the time to overlook more details of the site config. Maybe I find some time on the weekend. After aproving by owner, this will be live then.

For our example the extracted article ends indeed with given sentence. I just need to kill some advertisment-placeholders/‘Reklama’-icons.

Could you send some more example-links from different subdomains/ressorts for testing? Because wp.pl has more subdomains. Or do you need it just for wiadomosci.wp.pl?

fivefilters · January 12, 2023, 2:30pm

Thanks @HolgerAusB and @gawkla,

We’ll have a separate forum for Push to Kindle issues soon, so we can better handle the separate issues. But Push to Kindle also uses the site config rules, and we’ve now added one for this site, which should fix the extraction problem: ftr-site-config/.wp.pl.txt at master · fivefilters/ftr-site-config · GitHub

@gawkla, if you’re able to, please try using Push to Kindle again on this article and let us know if you still have a problem.

I don’t think this is easy to do because of their reliance on React and generated class names (a sad trend). The image URL itself isn’t that different from the other images, so we opted to just ignore them.

HolgerAusB · January 12, 2023, 3:22pm

I had not investigated the placeholder images yet. Thanks for looking at it, @fivefilters

HolgerAusB · January 12, 2023, 3:30pm

@fivefilters. I don’t know if it is reliable. But the adverts seems to start with a div which @class contains two classnames each 7 digits long, while other classes have 2x8 or more. I tried to use a strip: //div[string-length()... but couldn’t manage to form a correct xpath.

HolgerAusB · January 12, 2023, 6:35pm

found it:
strip: //div[string-length(@class) = 15]

and found some more things to strip PR follows soon

gawkla · January 12, 2023, 7:57pm

@HolgerAusB Many thanks! It looks far better. I have checked few more articles from this site and really it looks like full content can be parsed.
Need some polish with advertisements but most important is that full content is parsed!

fivefilters · January 13, 2023, 1:43pm

@gawkla Great!

@HolgerAusB Thanks for the additional changes. I commented out the ad placeholder stripping because it feels a little too brittle and unreliable (as you commented) relying on the class attribute string length. It does work, but in these situations we usually like to err on the side of allowing a few undesirable elements through rather than risk desired content being stripped in the future due to changes to the source site. (Also kept your more targeted removals in the header without removing the header itself.)

HolgerAusB · January 26, 2023, 2:33am

@fivefilters how do you think about stripping the advert-placeholder-image by img size?

This one works:
strip: //img[@width='56' and @height='45']

fivefilters · January 27, 2023, 12:40pm

@HolgerAusB That sounds good!

HolgerAusB · January 27, 2023, 12:48pm

PR 1039

fivefilters · January 27, 2023, 12:56pm

Thank you! Merged.