For this article, Madison Square Garden Entertainment Corp. (NYSE:MSGE) Stock Position Cut by Private Advisor Group LLC - Ticker Report, the text ‘Featured Stories’ and the list of articles after it are all inside the div with itemProp = articleBody.
I want to extract text up to (but not including) ‘Featured Stories’. I’ve been messing around with descendant but can’t figure it out. Is something like this even possible?
strip: //p[contains(.,'Featured Stories')]/descendant::*
or
strip: //*[contains(text(),'Featured Stories')]/descendant::*
Hi there,
I think what you’re after is following-sibling::*
, to select elements that follow the <p>Featured Stories</p>
element.
strip: //p[contains(.,'Featured Stories')]/following-sibling::*
See interactive example.
But while this will remove the following elements, it will leave the <p>Featured Stories</p>
element in place.
To remove that too, you can do something like this:
strip: //p[contains(.,'Featured Stories')] | //p[contains(.,'Featured Stories')]/following-sibling::*
See interactive example.
Full-Text RSS allows multiple strip
XPath expressions and processes them in the order they appear, so you could also do something like this:
# First remove the elements that follow the <p> element:
strip: //p[contains(.,'Featured Stories')]/following-sibling::*
# Next remove the <p> element itself
strip: //p[contains(.,'Featured Stories')]
Hope that’s some help.