Combining two XPATHs in body: (site configuration for


I’ve been trying to edit a custom config site for and the body: is not working. Here’s my config:

single_page_link: //link[contains(@href, ‘m.theregister’)]
strip: //div[@class=‘wptl btm’]
body: //div[contains(@class,‘article_head’)]//h2 | //div[@id=‘body’]

The strip is one I’d recommend adding to the custom config in the next release of FTR, the single_page_link I haven’t yet had the occasion to test (but the Reg doesn’t seem to use /PRINT/ anymore so the default site config for it will need updating anyway). The body is what is not working; it gets the body of the articles (//div[@id=‘body’]) fine, but not the subtitle (//div[contains(@class,‘article_head’)]//h2)

I’ve tried instead the more complex XPATH generated by
//div[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ article_head ')]//h2 | //div[@id=‘body’]

Or the even simpler XPATH:
//div[@class=“article_head multi_page”)]//h2 | //div[@id=‘body’]

It doesn’t make any difference, the subtitle is never included. Am I making a stupid mistake, and if so what is it?

Hi David,

Thanks for the detailed info.

I’ve looked into this, and I hope the response below helps. I should add that I haven’t tested my suggestions yet, so I can’t claim they work.

I think the issue here is the use of single_page_link. If the XPath expression you’ve listed here matches (having just looked at, it looks to me like it will), then Full-Text RSS will issue a new request for the URL matched. The site config file which held the single_page_directive might no longer be the same site config file used to process the response Full-Text RSS gets after issuing the new request. Or if it is, the HTML returned in the new response might not match the XPath expressions in the site config file.

In this case it looks to me like it’s both these issues:

  1. The single_page_link expression returns a URL with - the site config file will no longer be used here as that only matches URLs beginning with or So you’ll need a new site config file called

  2. The XPath expression you’re using in the body directive is targeting the HTML of a regular article, not the HTML structure found on articles on You’ll notice on the subtitle is not marked up the same way it is on the main article page - so the second half of your XPath expression matches, but not the first.

I don’t have time to submit a site config fix for this right now, so I hope this is enough for you to fix it yourself. Otherwise let us know and we’ll try to create these two site config files to handle this better.

And thanks for letting us know about the extraction issue with

I should also add that enabling debug mode (the debug checkbox on the form) will show you a lot of what’s happening behind the scenes. It’s often useful in catching things like this as it will show if the single page link expression matches and results in a new request, the new URL requested, the site config files used, etc…

Thank you. I got it to work. This is what I put in

strip: //div[@class=‘wptl btm’]
body: //div[@id=‘article’]//h2 | //div[@id=‘body’]

You may want to make it clearer in the documentation that applies to both and but not to any subdomain. I read in the support centre an explanation that would apply to just for and not to its www subdomain so I wrongly assumed that if the config file did not start with a dot it applied to all subdomains.

(Point taken using the debug option, which I used on other feeds and should have used this time as well.)


Hi David, glad you got it to work. Thanks for the update.

And yeah, the site config file naming is very confusing, unfortunately. Might have been simpler having match and and all subdomains, unless a more specific site config file was found.

But just to clarify, the way things are now: will match but not and not will only match and

Oh, and thanks for the site config file for and changes to - we’ve added these to our Github repository now.