What’s the difference between
First of all, be careful copying and pasting from pages where the quotes get turned into fancy quotes which won’t be recognised by XPath. I just noticed that in the first
strip:... line when pasting the above, but I’m going to assume that’s not the problem you’re facing.
text() because the latter is too granular for much of what we do with Full-Text RSS. It selects text nodes. The problem is that you can’t be sure how the parser has constructed the DOM. So you can’t assume, when looking at
<p>This is a sentence</p> that It’s one element with a single text node inside. The text inside the
<p> element might be constructed from more than one text node or a single text node. In either case if you convert it back to XML/HTML, it’ll look the same. But how it’s represented in the DOM might not be as you expect.
. in XPath in the
contains() function avoids that, becuase it treats everything as if it were a single string, regardless of how many text nodes the element contains.
It’s also useful in situations when the element you want to target (in this case to remove) contains text which appears in child elements. For example:
<p><strong>Advertising:</strong> Read this!</p>
If you use
strip: //p[contains(text(), "Advertising:")]
It won’t match, because there’s no single text node directly underneath the
<p> element containing that text. But you can do:
strip: //p[contains(., "Advertising:")]
And it’ll treat everything contained within the
<p> element as a string and test it against the given string.
You can experiment with this example on the XPath tester site: http://www.xpathtester.com/xpath/38a2b7b8abb9eb70d83a231559e8cfc2
If a website is using subdomains for sections, e.g. news.abcdef.com, sports.abcdef.com, politics.abcdef.com, then I save the file as .abcdef.com and that matches all the subdomains. Is that right?
Yes, but remember the ‘.txt’ at the end: save it as
And important to remember that while this file will match those subdomains, it won’t match
abcdef.com - you’ll have to save a copy of the file as
abcdef.com.txt for it to match the last two.
We clarified this recently here: https://github.com/fivefilters/ftr-site-config#file-naming
sport.example.com.txt to target just that sub-domain:
.example.com.txt will not match