More and more often missing h2 Headlines and Weblinks

didi · August 29, 2019, 10:06am

More and more often I have the problem that only parts of the content are extracted. In this case the headlines are missing. So far I haven’t found a way to include them via config. Is this a bug?

Config: body: //div[@id=‘content-left’] or/combi //div[@class=‘body-text__paragraph-header’] with prune: no

Next Problem - missing Website Links - Source: https://www.tripsavvy.com/the-best-museums-in-dallas-4767608

Config: tidy: no
body: //div[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ chop-content ‘)] | //ul[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ ordered-list__list ')] | //div[@class=‘mntl-sc-block-location__website’]

Thanks Dieter

didi · September 3, 2019, 9:27am

Bug or my fault - an answer would be nice. thanks

fivefilters · September 3, 2019, 11:03pm

Hi Dieter,

I’ve created a site config file for Thrillist and it works for me here in preserving the content. I used prune: no and added a few rules to strip out elements: https://github.com/fivefilters/ftr-site-config/blob/master/thrillist.com.txt

You say you’re using prune: no too, so I’m not sure why it’s not working for you. Perhaps the file is not named correctly. Can you enable debugging to see if it’s being loaded?

I haven’t looked at the second one, but I would say this is a pruning issue too. If you add prune: no (I don’t see it in the config you posted) to the site config file, it should keep those elements.

The use of tidy isn’t really necessary in recent versions of Full-Text RSS as we’re defaulting to the bundled HTML5 parser, which handles difficult markup better than PHP’s libxml. We used tidy before as it would often help prepare HTML that the old parser by itself wouldn’t be able to handle.

Let me know if you still have trouble with these.

didi · September 4, 2019, 7:48am

PEFREKT - thank you, I set prune to no but maybe in a wrong combination, had tried a lot. regards dieter