Does "Refresh Feed" interact with "Clean HTML" and "Extract Full Text"?

mcw · March 5, 2023, 5:07pm

I am trying to use the Webpage to RSS feature in Feed Control to generate feeds of image galleries. The typical case is to have a main webpage that hosts the most current gallery, and to just try to scrape all the images on that main page as the item description (since Feed Creator only pulls the first image, not all matching images, on the Image selector).

Another use case is where the site has a galleries list as a list of links to galleries. And so in that case, the link would be the item selector and I would use Extract Full Text.

In both of these cases, Clean HTML is getting enabled by default when I go from the Preview to setting up the feed in Feed Control. I can’t toggle that setting off until I save the feed, then edit the feed settings.

Clean HTML seems to be stripping out the images in both of these cases, giving me what little text content there is on the gallery page itself. That’s a bummer.

When I go to edit the feed and turn Clean HTML off, then Save, and hit Refresh Feed, it seems not to do anything. It doesn’t replace the stripped items that were initially generated with new, unstripped, uncleaned items.

Is this intentional? The help articles I’ve found seem to say that other features (enabling RSS, testing webhooks, wanting to test for new items that have been added recently, etc) will cause the feed to refresh itself. This page implies that it should work after editing any feed settings.

I would love for this to work with Clean HTML and Extract Full Text so that I can see what the feed content will be at the time I’m building and previewing it.

Please let me know if I’m mistaken, and there’s a way to do this after all?

fivefilters · March 5, 2023, 5:51pm

At the moment, “Clean HTML” and “Extract Full Text” are options that are checked when a new item is being pulled in by Feed Control. If you change these options, you will have to delete existing feed items and then refresh the feed (so Feed Control pulls in the items again and applies the new settings). We’ll try to make this easier to understand in a future update, or perhaps improve the processing so it happens automatically when you change the settings.

When it comes to missing images, it’s a good idea to test the page URL with Full-Text RSS first to see how well it extracts the content you’re after. Our article extractor works better on web articles than other types of content, so if there are missing images, it might be because of that rather than the Clean HTML settings.

If it is Full-Text RSS that’s not returning the images, we might be able to improve extraction by adding site-specific extraction rules, but we’ll need the page URL to see if that’s possible.

You’ll have to delete the existing fetched feed items (“Delete feed items” in the actions menu), and then refresh the feed.

mcw · March 5, 2023, 6:27pm

Thanks for the quick reply!

“Clean HTML” and “Extract Full Text” are options that are checked when a new item is being pulled in by Feed Control. If you change these options, you will have to delete existing feed items and then refresh the feed (so Feed Control pulls in the items again and applies the new settings).

Aha! I thought “deleting items” would mean it would never re-index those items, not just “reset”. I’ll try that!

… it’s a good idea to test the page URL with Full-Text RSS first to see how well it extracts the content you’re after. Our article extractor works better on web articles than other types of content, so if there are missing images, it might be because of that rather than the Clean HTML settings.

How do you do this? I don’t see an option anywhere in Feed Control prior to preview?

fivefilters · March 5, 2023, 8:01pm

You can test article URLs using Full-Text RSS here: http://ftr.fivefilters.org

Enter the article URL in the first field
Click ‘Create feed’ to see results