Help stripping an element

extractor_fan · September 2, 2019, 3:42pm

For the URL below, there is a string of text toward the end of the article:

dpa-AFX Broker - die Trader News von dpa-AFX

I am trying to remove the ‘-----------------------’, but I can’t get it to work.

body: //div[@id=‘grantexto’]
strip: //*[contains(text(),’-----’)]

http://ch.marketscreener.com/MICHELIN-4672/news/UBS-belasst-Michelin-auf-Buy-Ziel-124-Euro-28978797/?utm_medium=RSS&utm_content=20190730

Any help? Thanks!

Lostinlodos · September 2, 2019, 5:42pm

There isn’t much you can remove. The content is the post I’d. So the only thing expendable is the medium code, rss.

http://ch.marketscreener.com/MICHELIN-4672/news/UBS-belasst-Michelin-auf-Buy-Ziel-124-Euro-28978797/&utm_content=20190730

fivefilters · September 2, 2019, 11:20pm

That looks okay to me.

I’d use something like

strip: //p[contains(.,'-----')]

But yours should match too. Are you sure the site config file is named correctly and is being applied? If you enable debugging with the checkbox you should see if it gets loaded or not.

For this site, the file should be named either ch.marketscreener.com.txt or .marketscreener.com.txt

extractor_fan · September 3, 2019, 8:02am

Thanks, your code worked.

Could you please quickly explain two things to me?

1 - What’s the difference between strip: //*[contains(text(),’-----’)] and strip: //p[contains(.,'-----')]

I know the * is any element, and you’ve specified a paragraph. What exactly does the . mean (as opposed to text)?

2 - If a website is using subdomains for sections, e.g. news.abcdef.com, sports.abcdef.com, politics.abcdef.com, then I save the file as .abcdef.com and that matches all the subdomains. Is that right?

fivefilters · September 3, 2019, 1:40pm

What’s the difference between
strip: //*[contains(text(),’-----’)]
and
strip: //p[contains(.,'-----')]

First of all, be careful copying and pasting from pages where the quotes get turned into fancy quotes which won’t be recognised by XPath. I just noticed that in the first strip:... line when pasting the above, but I’m going to assume that’s not the problem you’re facing.

We use . over text() because the latter is too granular for much of what we do with Full-Text RSS. It selects text nodes. The problem is that you can’t be sure how the parser has constructed the DOM. So you can’t assume, when looking at This is a sentence that It’s one element with a single text node inside. The text inside the  element might be constructed from more than one text node or a single text node. In either case if you convert it back to XML/HTML, it’ll look the same. But how it’s represented in the DOM might not be as you expect.

Using . in XPath in the contains() function avoids that, becuase it treats everything as if it were a single string, regardless of how many text nodes the element contains.

It’s also useful in situations when the element you want to target (in this case to remove) contains text which appears in child elements. For example:

<div id="article">
    <p>Para 1</p>
    <p>Para 2</p>
    <p>Para 3</p>
    <p><strong>Advertising:</strong> Read this!</p>
</div>

If you use

strip: //p[contains(text(), "Advertising:")]

It won’t match, because there’s no single text node directly underneath the  element containing that text. But you can do:

strip: //p[contains(., "Advertising:")]

And it’ll treat everything contained within the  element as a string and test it against the given string.

You can experiment with this example on the XPath tester site: http://www.xpathtester.com/xpath/38a2b7b8abb9eb70d83a231559e8cfc2

If a website is using subdomains for sections, e.g. news.abcdef.com, sports.abcdef.com, politics.abcdef.com, then I save the file as .abcdef.com and that matches all the subdomains. Is that right?

Yes, but remember the ‘.txt’ at the end: save it as .abcdef.com.txt

And important to remember that while this file will match those subdomains, it won’t match www.abcdef.com or abcdef.com - you’ll have to save a copy of the file as abcdef.com.txt for it to match the last two.

We clarified this recently here: GitHub - fivefilters/ftr-site-config: Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications.

File naming

Use example.com.txt for

Use .example.com.txt for

Use sport.example.com.txt to target just that sub-domain:

sport.example.com

Note: .example.com.txt will not match www.example.com or example.com