I am trying to extract some elements of a page. These elements do not appear on the original domain whose rss feed I am trying to pull. My original domain and feed is:
But when I run it through FiveFilters full text feed I get problems. Here is the FiveFilters feed:
If you look at the FiveFilters feed there are empty bullets “” between the text bullets in the articles in the feed.
I am trying to grab these empty
elements and remove them using custom extraction rules. So far I have not had any success despite trying multiple rules. Part of my confusion is that I do not know if I should be dealing with the original source or the FiveFilters source.
Please help walk me through what I should do to try and clean up this source using extraction rules.
This looks like it’s a result of our parsing. A few points that might help you:
- Try disabling Tidy in the site config file and enabling HTML5 parsing:
- The extraction rules (except for find_string and replace_string) apply to the parsed HTML, so if altering how the document is parsed (as suggested above) doesn’t work, you should be able to remove the empty list elements with something like:
Let me know how you get on - happy to help if you still have trouble.
Thanks for your help on this!
It worked to get rid of the extraneous li tags by adding the tidy and parser rules.
A new problem arose in that the 1/2 fraction character now reads as &12; instead of 1/2:
&12; cup vegetable oil
Would there be a fix for that as well?
Thanks for the report. I notice that the fraction appears correctly when it’s not parsed by HTML5-PHP, so this looks to me like a bug in the HTML5-PHP parser we use - https://github.com/Masterminds/html5-php (they might have fixed it in a newer release, so we’ll have to test and see). The fraction should be encoded as ½ not &12;
In the mean time you can try something like this:
And see if that helps.
This worked perfectly without the tidy and parser changes.
This is a splendid bit of code. I am quite pleased to have found it.
Thanks Josen, glad to hear it.