Full-Text RSS Always skipping images

I tested many feeds and 75% of them returns no images… Can you please provide a complete documentation for writing the config file for each website ?

Hi there,

By default there is a cleanup (pruning) process which runs on the identified content block. Depending on how the images are marked up in HTML, they may be flagged as unrelated to the content and stripped.

The simplest way to avoid this is to add the following rule to the appropriate site config file:

prune: no

Sometimes that line on its own is enough to preserve the images (Full-Text RSS will try to detect the content block but will not clean it up).

Hope that helps. Let me know if you still have trouble.

Thanks for your answer…
Adding prune:no will just show the text with the original style but the problem that the image is not within the text content. Here is an exemple:


Content Text... Content Text... Content Text... Content Text... ------------------------

Thanks for the support.

hypercyber

Hi again,

If the image is not within the main content block, you will have to create a site config file telling Full-Text RSS what to extract. There are two examples in our site patterns documentation: http://help.fivefilters.org/customer/portal/articles/223153-site-patterns

For example, to get it to extract two div elements (these can be anywhere in the document):

body: //div[@id=‘image’] | //div[@id=‘content’]

In these cases it’s a good idea to tell it not to do automatic cleanup (prune: no) to ensure nothing mysteriously disappears.

We’ll be putting up more detailed examples for common extraction issues that have been reported to us soon.

If you’ve purchased Full-Text RSS form us and not requested your free site config file, you can email details of the page you’re trying to extract from and we’ll write one for you.

I sent an email but still got no answer…

hypercyber

Should have received a reply now.

Thanks a lot for the site config.

Hypercyber

I’m using the script now, without any specific site config. I just use the script, and it extracts full texts pretty well.

So, as I have the same problem of image extraction, should I create a specific config file of a given website, and put inside : <<prune: no>> and just put that inside ?

Cause, I don’t want to customize the script for every website I crawl. I just want to set the script to get image, and let it continue to use his default settings to extract full text.

So, for any site, just create a file, like afrikeo.txt (or whatever) and put prune: no inside ?

Nino

Hi Nino,

For many sites, images will be preserved, even with pruning enabled. Pruning can sometimes remove relevant images (depending on the structure of the HTML page).

If the image is inside the detected content block (the HTML element we extract) but it’s not appearing in the Full-Text RSS output, then the prune: no line on its own should ensure the image is preserved. And if that’s the case, you can create a custom site config file with just that line. So if it’s example.org you’re extracting from, the custom config file should be called example.org.txt and placed inside the site_config/custom/ folder.

There is currently no way to enable pruning by default. So if images are removed due to pruning, you’ll have to enable it on a site-by-site basis.

Of course disabling pruning will not help if the image itself is not within the content block we extract. For these cases you’ll need to specify which element(s) should be extracted - see earlier reply for an example.

Hope that’s some help.

very thanks for this post … it is very useful

adnan