What’s the difference between
strip: //*[contains(text(),’-----’)]
and
strip: //p[contains(.,'-----')]
First of all, be careful copying and pasting from pages where the quotes get turned into fancy quotes which won’t be recognised by XPath. I just noticed that in the first strip:...
line when pasting the above, but I’m going to assume that’s not the problem you’re facing.
We use .
over text()
because the latter is too granular for much of what we do with Full-Text RSS. It selects text nodes. The problem is that you can’t be sure how the parser has constructed the DOM. So you can’t assume, when looking at <p>This is a sentence</p>
that It’s one element with a single text node inside. The text inside the <p>
element might be constructed from more than one text node or a single text node. In either case if you convert it back to XML/HTML, it’ll look the same. But how it’s represented in the DOM might not be as you expect.
Using .
in XPath in the contains()
function avoids that, becuase it treats everything as if it were a single string, regardless of how many text nodes the element contains.
It’s also useful in situations when the element you want to target (in this case to remove) contains text which appears in child elements. For example:
<div id="article">
<p>Para 1</p>
<p>Para 2</p>
<p>Para 3</p>
<p><strong>Advertising:</strong> Read this!</p>
</div>
If you use
strip: //p[contains(text(), "Advertising:")]
It won’t match, because there’s no single text node directly underneath the <p>
element containing that text. But you can do:
strip: //p[contains(., "Advertising:")]
And it’ll treat everything contained within the <p>
element as a string and test it against the given string.
You can experiment with this example on the XPath tester site: http://www.xpathtester.com/xpath/38a2b7b8abb9eb70d83a231559e8cfc2
If a website is using subdomains for sections, e.g. news.abcdef.com, sports.abcdef.com, politics.abcdef.com, then I save the file as .abcdef.com and that matches all the subdomains. Is that right?
Yes, but remember the ‘.txt’ at the end: save it as .abcdef.com.txt
And important to remember that while this file will match those subdomains, it won’t match www.abcdef.com
or abcdef.com
- you’ll have to save a copy of the file as abcdef.com.txt
for it to match the last two.
We clarified this recently here: GitHub - fivefilters/ftr-site-config: Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications.
File naming
Use example.com.txt
for
Use .example.com.txt
for
Use sport.example.com.txt
to target just that sub-domain:
Note: .example.com.txt
will not match www.example.com
or example.com