How to prevent the stripping of //header

Panda · June 3, 2021, 12:20am

I’m trying to write a config file to include the initial image in posts of Bill’s feed such as
http://www.astickadogandaboxwithsomethinginit.com/its-that-audiomo-time-of-the-year/

Using the debug option and after several tries (the experimental xpath generation tool was useful) I do have a body XPath that is matched:

* Stripping 1 elements (strip: //nav)
* Stripping 2 elements (strip: //header)
* Body matched
* ...XPath match: //div[@id='primary']//div[@class='entry-content'] | //img[contains(concat(' ',normalize-space(@class),' '),' attachment-featured-image ')]

However looking at the generated html I don’t see the img at all. I guess that it is stripped by the stripping of //header indicated in the debug information. I haven’t asked it to strip that! I have set prune: no but that didn’t help. I can’t see any unstrip command.

So how can I prevent this stripping of //header so that the img can be included? (Or am I on a completely wrong path to achieve this?)

fivefilters · June 3, 2021, 12:46pm

looking at the generated html I don’t see the img at all. I guess that it is stripped by the stripping of //header indicated in the debug information. I haven’t asked it to strip that! I have set prune: no but that didn’t help. I can’t see any unstrip command.

You’re right, it’s getting stripped because the site is is generated by WordPress so Full-Text RSS loads and applies the WordPress.com site config file to do additional filtering. If you look further down in the debug info you should see something like this:

* Checking fingerprints... 
* Found match: fingerprint.wordpress.com
* ... [snipped]
* Appending site config settings from fingerprint.wordpress.com (fingerprint match)
* ... [snipped]
* Appending site config settings from global.txt

The WordPress.com site config file contains the strip: //header rule, so that’s where that’s coming from.

So how can I prevent this stripping of //header so that the img can be included?

There are a few options you can try:

One option is to edit the wordpress.com.txt site config file to remove the header stripping. But at this point we’re not sure if that’s going to negatively affect other WordPress sites, so wouldn’t advise it.

Another option is to use string replacement in your astickadogandaboxwithsomethinginit.com.txt site config file to rename the <header> tags so the strip rule from the WordPress site config file doesn’t remove the element:

replace_string(<header): <div
replace_string(</header): </div

Another option is to tell Full-Text RSS not to load additional extraction rules (e.g. via the fingerprint method or the global site config file) when it’s processing pages from astickadogandaboxwithsomethinginit.com. To do that, you can add the following line to your site config file:

autodetect_on_failure: no

This is perhaps a little overkill for what you’re trying to achieve - it will also disable automatic article and title extraction if the XPath expressions in your site config file don’t match. So if the site gets redesigned, you will have to update the site config file to get results again.

We’ve written more about how the fingerprint/global site config files get applied in our documentation pages, under site patterns.

Hope that’s some help.

Panda · June 3, 2021, 1:09pm

Thank you for your detailed explanation and suggested options.

It does all make sense now. Your suggestion to replace the header tag with a div one worked first time.

It does indeed make much more sense to leave the handling of WP sites in general to you as you’ll most likely do a much better job of it.