Can I remove all text after a certain point?

extractor_fan · May 31, 2022, 9:32am

For this article, Madison Square Garden Entertainment Corp. (NYSE:MSGE) Stock Position Cut by Private Advisor Group LLC - Ticker Report, the text ‘Featured Stories’ and the list of articles after it are all inside the div with itemProp = articleBody.

I want to extract text up to (but not including) ‘Featured Stories’. I’ve been messing around with descendant but can’t figure it out. Is something like this even possible?

strip: //p[contains(.,'Featured Stories')]/descendant::*

or

strip: //*[contains(text(),'Featured Stories')]/descendant::*

fivefilters · May 31, 2022, 9:15pm

Hi there,

I think what you’re after is following-sibling::*, to select elements that follow the <p>Featured Stories</p> element.

strip: //p[contains(.,'Featured Stories')]/following-sibling::*

See interactive example.

But while this will remove the following elements, it will leave the <p>Featured Stories</p> element in place.

To remove that too, you can do something like this:

strip: //p[contains(.,'Featured Stories')] | //p[contains(.,'Featured Stories')]/following-sibling::*

See interactive example.

Full-Text RSS allows multiple strip XPath expressions and processes them in the order they appear, so you could also do something like this:

# First remove the elements that follow the <p> element:
strip: //p[contains(.,'Featured Stories')]/following-sibling::*
# Next remove the <p> element itself
strip: //p[contains(.,'Featured Stories')]

Hope that’s some help.

extractor_fan · June 1, 2022, 2:49pm

Amazing! Thanks so much.