Processing backlogs

desk-user · March 23, 2015, 5:58pm

I have a ton of archived HTML files that I’d like to process through Full-Text RSS for content extraction only. How can I do this efficiently? The files are all on disk, and there’s no direct URL to hit (aside from the local pathnames). I started poking through the source and it looks like I could use the ContentExtractor class directly, but I see the makefulltextfeed.php does quite a bit more than just call process on HTML.

Is there a blessed way to handle this type of task?

fivefilters · March 25, 2015, 3:59pm

Hi Joseph,

In Full-Text RSS 3.3 we introduced the inputhtml parameter. You can use this with the extract.php endpoint by submitting a POST request to extract.php with the HTML you have and Full-Text RSS will try to extract the content. You can try it out via Mashape if you want to test how well it’ll work: https://www.mashape.com/fivefilters/full-text-rss#!

Here’s the description for the inputhtml parameter:

If you already have the HTML, you can pass it here. We will not make any HTTP requests for the content if this parameter is used. Note: The input HTML should be UTF-8 encoded. And you will still need to give us the URL associated with the content (the URL may determine how the content is extracted, if we have extraction rules associated with it).

Note: you can submit a fake URL if you don’t have one, e.g. example.org, but the results might not be as good - especially if we have custom extraction rules associated with the sites your archived HTML came from.