Problem with Extraction Rules

desk-user · August 4, 2014, 1:38am

I am trying to extract some elements of a page. These elements do not appear on the original domain whose rss feed I am trying to pull. My original domain and feed is:

http://www.missourilife.com/blogs/show-me-flavor/index.rss

But when I run it through FiveFilters full text feed I get problems. Here is the FiveFilters feed:

http://rss.shoutcloudstudios.com/makefulltextfeed.php?url=www.missourilife.com%2Fblogs%2Fshow-me-flavor%2Findex.rss&max=20&links=preserve&exc=1&submit=Create+Feed

If you look at the FiveFilters feed there are empty bullets “

” between the text bullets in the articles in the feed.

I am trying to grab these empty

elements and remove them using custom extraction rules. So far I have not had any success despite trying multiple rules. Part of my confusion is that I do not know if I should be dealing with the original source or the FiveFilters source.

Please help walk me through what I should do to try and clean up this source using extraction rules.

Thanks!

fivefilters · August 4, 2014, 9:53am

Hi Josen,

This looks like it’s a result of our parsing. A few points that might help you:

Try disabling Tidy in the site config file and enabling HTML5 parsing:

tidy: no
parser: html5php

The extraction rules (except for find_string and replace_string) apply to the parsed HTML, so if altering how the document is parsed (as suggested above) doesn’t work, you should be able to remove the empty list elements with something like:

strip: //li[not(node())]

Let me know how you get on - happy to help if you still have trouble.

desk-user · August 4, 2014, 2:01pm

Keyvan,

Thanks for your help on this!

It worked to get rid of the extraneous li tags by adding the tidy and parser rules.

A new problem arose in that the 1/2 fraction character now reads as &12; instead of 1/2:

&12; cup vegetable oil

Would there be a fix for that as well?

Josen Ruiseco

fivefilters · August 4, 2014, 2:26pm

Hi Josen,

Thanks for the report. I notice that the fraction appears correctly when it’s not parsed by HTML5-PHP, so this looks to me like a bug in the HTML5-PHP parser we use - https://github.com/Masterminds/html5-php (they might have fixed it in a newer release, so we’ll have to test and see). The fraction should be encoded as ½ not &12;

In the mean time you can try something like this:

replace_string(½): ½
replace_string(&12;): ½

And see if that helps.

desk-user · August 4, 2014, 2:32pm

Nevermind…

This worked perfectly without the tidy and parser changes.

strip: //li[not(node())]

This is a splendid bit of code. I am quite pleased to have found it.

Josen

Josen Ruiseco

fivefilters · August 4, 2014, 2:40pm

Thanks Josen, glad to hear it.