Problem with Extraction Rules

I am trying to extract some elements of a page. These elements do not appear on the original domain whose rss feed I am trying to pull. My original domain and feed is:

http://www.missourilife.com/blogs/show-me-flavor/index.rss

But when I run it through FiveFilters full text feed I get problems. Here is the FiveFilters feed:

http://rss.shoutcloudstudios.com/makefulltextfeed.php?url=www.missourilife.com%2Fblogs%2Fshow-me-flavor%2Findex.rss&max=20&links=preserve&exc=1&submit=Create+Feed

If you look at the FiveFilters feed there are empty bullets “

  • ” between the text bullets in the articles in the feed.

    I am trying to grab these empty

  • elements and remove them using custom extraction rules. So far I have not had any success despite trying multiple rules. Part of my confusion is that I do not know if I should be dealing with the original source or the FiveFilters source.

    Please help walk me through what I should do to try and clean up this source using extraction rules.

    Thanks!

  • Hi Josen,

    This looks like it’s a result of our parsing. A few points that might help you:

    • Try disabling Tidy in the site config file and enabling HTML5 parsing:

    tidy: no
    parser: html5php

    • The extraction rules (except for find_string and replace_string) apply to the parsed HTML, so if altering how the document is parsed (as suggested above) doesn’t work, you should be able to remove the empty list elements with something like:

    strip: //li[not(node())]

    Let me know how you get on - happy to help if you still have trouble.

    Keyvan,

    Thanks for your help on this!

    It worked to get rid of the extraneous li tags by adding the tidy and parser rules.

    A new problem arose in that the 1/2 fraction character now reads as &12; instead of 1/2:

    &12; cup vegetable oil

    Would there be a fix for that as well?

    Josen Ruiseco

    Hi Josen,

    Thanks for the report. I notice that the fraction appears correctly when it’s not parsed by HTML5-PHP, so this looks to me like a bug in the HTML5-PHP parser we use - https://github.com/Masterminds/html5-php (they might have fixed it in a newer release, so we’ll have to test and see). The fraction should be encoded as ½ not &12;

    In the mean time you can try something like this:

    replace_string(½): ½
    replace_string(&12;): ½

    And see if that helps.

    Nevermind…

    This worked perfectly without the tidy and parser changes.

    strip: //li[not(node())]

    This is a splendid bit of code. I am quite pleased to have found it.

    Josen

    Josen Ruiseco

    Thanks Josen, glad to hear it. :slight_smile: