More and more often I have the problem that only parts of the content are extracted. In this case the headlines are missing. So far I haven’t found a way to include them via config. Is this a bug?
Config: body: //div[@id=‘content-left’] or/combi //div[@class=‘body-text__paragraph-header’] with prune: no
Next Problem - missing Website Links - Source: https://www.tripsavvy.com/the-best-museums-in-dallas-4767608
Config: tidy: no
body: //div[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ chop-content ‘)] | //ul[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ ordered-list__list ')] | //div[@class=‘mntl-sc-block-location__website’]
Bug or my fault - an answer would be nice. thanks
I’ve created a site config file for Thrillist and it works for me here in preserving the content. I used
prune: no and added a few rules to strip out elements: https://github.com/fivefilters/ftr-site-config/blob/master/thrillist.com.txt
You say you’re using
prune: no too, so I’m not sure why it’s not working for you. Perhaps the file is not named correctly. Can you enable debugging to see if it’s being loaded?
I haven’t looked at the second one, but I would say this is a pruning issue too. If you add
prune: no (I don’t see it in the config you posted) to the site config file, it should keep those elements.
The use of tidy isn’t really necessary in recent versions of Full-Text RSS as we’re defaulting to the bundled HTML5 parser, which handles difficult markup better than PHP’s libxml. We used tidy before as it would often help prepare HTML that the old parser by itself wouldn’t be able to handle.
Let me know if you still have trouble with these.
PEFREKT - thank you, I set prune to no but maybe in a wrong combination, had tried a lot. regards dieter