Combining two XPATHs in body: (site configuration for theregister.co.uk)

desk-user · July 6, 2015, 4:26pm

Hi,

I’ve been trying to edit a custom config site for theregister.co.uk and the body: is not working. Here’s my config:

single_page_link: //link[contains(@href, ‘m.theregister’)]
strip: //div[@class=‘wptl btm’]
body: //div[contains(@class,‘article_head’)]//h2 | //div[@id=‘body’]

The strip is one I’d recommend adding to the custom config in the next release of FTR, the single_page_link I haven’t yet had the occasion to test (but the Reg doesn’t seem to use /PRINT/ anymore so the default site config for it will need updating anyway). The body is what is not working; it gets the body of the articles (//div[@id=‘body’]) fine, but not the subtitle (//div[contains(@class,‘article_head’)]//h2)

I’ve tried instead the more complex XPATH generated by http://siteconfig.fivefilters.org:
//div[contains(concat(’ ‘,normalize-space(@class),’ ‘),’ article_head ')]//h2 | //div[@id=‘body’]

Or the even simpler XPATH:
//div[@class=“article_head multi_page”)]//h2 | //div[@id=‘body’]

It doesn’t make any difference, the subtitle is never included. Am I making a stupid mistake, and if so what is it?

fivefilters · July 6, 2015, 9:17pm

Hi David,

Thanks for the detailed info.

I’ve looked into this, and I hope the response below helps. I should add that I haven’t tested my suggestions yet, so I can’t claim they work.

I think the issue here is the use of single_page_link. If the XPath expression you’ve listed here matches (having just looked at theregister.co.uk, it looks to me like it will), then Full-Text RSS will issue a new request for the URL matched. The site config file which held the single_page_directive might no longer be the same site config file used to process the response Full-Text RSS gets after issuing the new request. Or if it is, the HTML returned in the new response might not match the XPath expressions in the site config file.

In this case it looks to me like it’s both these issues:

The single_page_link expression returns a URL with m.theregister.co.uk - the site config file theregister.co.uk.txt will no longer be used here as that only matches URLs beginning with theregister.co.uk or www.theregister.co.uk. So you’ll need a new site config file called m.theregister.co.uk.txt
The XPath expression you’re using in the body directive is targeting the HTML of a regular theregister.co.uk article, not the HTML structure found on articles on m.theregister.co.uk. You’ll notice on m.theregister.co.uk the subtitle is not marked up the same way it is on the main article page - so the second half of your XPath expression matches, but not the first.

I don’t have time to submit a site config fix for this right now, so I hope this is enough for you to fix it yourself. Otherwise let us know and we’ll try to create these two site config files to handle this better.

And thanks for letting us know about the extraction issue with theregister.co.uk

fivefilters · July 6, 2015, 9:23pm

I should also add that enabling debug mode (the debug checkbox on the form) will show you a lot of what’s happening behind the scenes. It’s often useful in catching things like this as it will show if the single page link expression matches and results in a new request, the new URL requested, the site config files used, etc…

desk-user · July 6, 2015, 9:49pm

Thank you. I got it to work. This is what I put in m.theregister.co.uk.txt:

strip: //div[@class=‘wptl btm’]
body: //div[@id=‘article’]//h2 | //div[@id=‘body’]

You may want to make it clearer in the documentation that theregister.co.uk.txt applies to both theregister.co.uk and www.theregister.co.uk but not to any subdomain. I read in the support centre an explanation that .theregister.co.uk.txt would apply to just for theregister.co.uk and not to its www subdomain so I wrongly assumed that if the config file did not start with a dot it applied to all subdomains.

(Point taken using the debug option, which I used on other feeds and should have used this time as well.)

David

fivefilters · July 6, 2015, 9:57pm

Hi David, glad you got it to work. Thanks for the update.

And yeah, the site config file naming is very confusing, unfortunately. Might have been simpler having example.org.txt match example.org and and all subdomains, unless a more specific site config file was found.

But just to clarify, the way things are now:

.example.com.txt will match anything.example.com but not www.example.com and not example.com

example.com.txt will only match example.com and www.example.com

fivefilters · July 6, 2015, 10:09pm

Oh, and thanks for the site config file for m.theregister.co.uk and changes to theregister.co.uk - we’ve added these to our Github repository now.