Howto remove repeated text from filtered feed?

Looking at the standard site_config folder I saw that one of my favourite feeds (http://www.anandtech.com/rss/) was profiled however the profile is from Instapaper and contains a directive that isn’t supported by FTR (next_page_link) and consequently the filtered feed only shows the first page of the multi-page articles.

So I did a little work and developed the site_config below using single_page_link but there is a frustrating problem in the sites “PRINT THIS ARTICLE” code - it displays the feed description as well as the article, because the feed description is repeated in the article the result is a duplicated section at the beginning of the filtered feed.

I’ve tried to work around with “body: substring-after(//div, ‘Page 1’)” but that doesn’t appear to be evaluated. In NPP with XPatherizerNPP it evaluates to the string beginning at the first page of the article (eliminating the description from the feed).

Any suggestions of alternative methods to acomplish the goal of eliminating the chunk of repeated text (in the test URL that’s everything before “It Was a Monster Mash…”)?

full-text-rss/site_config/custom/anandtech.com.txt

single_page_link: concat(‘http://www.anandtech.com/print/’, substring-after(//meta[@property=‘og:url’]/@content, ‘/show/’))

author: //a[@class=‘b’][1]
date: substring-after(substring-before(//div, ‘Posted in’), ’ on ')
strip: //h2
strip_image_src: /content/images/globals/

test_url: http://www.anandtech.com/show/5812/eurocom-monster-10-clevos-little-monster/

Ian, thanks a lot for this.

These things can get a little tricky at times, and usually more than one way to do it. :slight_smile:

Here’s what I’d suggest:

Replace your strip: //h2 line with

strip: //h2[contains(., ‘Page 1’)]/preceding::stuck_out_tongue:
strip: //h2

All the strip expressions are evaluated in the order they appear, usually that’s not important, but in this case it does matter.

Hope that helps.

Quick update: in case the site publishes articles with 10 pages or more, better to use:

strip: //h2[. = ‘Page 1’]/preceding::stuck_out_tongue:

My previous suggestion with contains() would match Page 10.

Keyvan,

What can I say - that is exactly the outcome I was looking for - perfect, thank you. I’m going to have to pick up a good book on Xpath and find out more of those “other ways” :wink:

Ian

Ian B

Glad to hear it! :slight_smile:

Word of warning, though: many books and online references cover XPath 2.0. PHP does not yet, as far as I know, support XPath 2.0, so if you do look anything up, make sure it’s XPath 1.0 compatible.