looking at the generated html I don’t see the img at all. I guess that it is stripped by the stripping of //header indicated in the debug information. I haven’t asked it to strip that! I have set prune: no
but that didn’t help. I can’t see any unstrip command.
You’re right, it’s getting stripped because the site is is generated by WordPress so Full-Text RSS loads and applies the WordPress.com site config file to do additional filtering. If you look further down in the debug info you should see something like this:
* Checking fingerprints...
* Found match: fingerprint.wordpress.com
* ... [snipped]
* Appending site config settings from fingerprint.wordpress.com (fingerprint match)
* ... [snipped]
* Appending site config settings from global.txt
The WordPress.com site config file contains the strip: //header
rule, so that’s where that’s coming from.
So how can I prevent this stripping of //header so that the img can be included?
There are a few options you can try:
One option is to edit the wordpress.com.txt site config file to remove the header stripping. But at this point we’re not sure if that’s going to negatively affect other WordPress sites, so wouldn’t advise it.
Another option is to use string replacement in your astickadogandaboxwithsomethinginit.com.txt
site config file to rename the <header>
tags so the strip rule from the WordPress site config file doesn’t remove the element:
replace_string(<header): <div
replace_string(</header): </div
Another option is to tell Full-Text RSS not to load additional extraction rules (e.g. via the fingerprint method or the global site config file) when it’s processing pages from astickadogandaboxwithsomethinginit.com. To do that, you can add the following line to your site config file:
autodetect_on_failure: no
This is perhaps a little overkill for what you’re trying to achieve - it will also disable automatic article and title extraction if the XPath expressions in your site config file don’t match. So if the site gets redesigned, you will have to update the site config file to get results again.
We’ve written more about how the fingerprint/global site config files get applied in our documentation pages, under site patterns.
Hope that’s some help.