Questions - Cache & Curl

Feeder · April 30, 2024, 7:39pm

I’m interested in purchasing this Full-Text RSS, however I have a few questions.

Caching - Is caching performed at the feed [xml output] or article level? The reason I ask, I have a sizable feed which does not order the items. This means, I must process all entries every time. Does this tool support article-level cache (as opposed to caching only the generated feed output)? Article level caching helps with not hammering the target webserver with many requests over and over.
Does this tool support curl-impersonate or similar? If not, is it constructed in such a way that I modify the code to use command-line curl binaries?
Is there a docker-compose.yml for the latest version?

fivefilters · May 1, 2024, 6:26am

Hi there,

At the moment caching is performed for the feed output. If you are processing article URLs and not feeds, then it’s effectively caching each article. We’ll have more caching options in the next version of Full-Text RSS. For the kind of feed you describe, it’s probably best to sort and filter before processing with Full-Text RSS.
No, we don’t use curl-impersonate. The HTTP library uses either PHP’s HTTP extension, or curl, or file_get_contents as fallback. I’m not sure how easy it would be to adapt it to work with curl-impersonate (we don’t make command-line calls when making HTTP requests).
No Docker support yet, but it’s coming with the next version.

Feeder · May 12, 2024, 10:16pm

Thanks for the feedback! I ended up purchasing a copy of this software for personal use to show my support. It’s clear much thought and effort went into developing this over the years.

It took some time, but I was finally able to build ‘my’ optimal experience by combining a number of open-source projects – leaving this here in case anyone is interested:

RSSBridge - Used this as a base for the following:

Excellent caching interface, so that I could do article and/or feed level caching in any way I choose.
Easy to use frontend – especially suited for feeds. The framework allows using a common frontend, but modifying to my inputs (e.g., checkboxes, form fields, etc.)
Ability to create RSS feeds for sites that don’t have any – the application exposes interfaces that can be used to do so.
Ability to modify existing RSS feeds in any way that you like using the interfaces provided.

Curl-Impersonate - For me, this was quite important to access sites that have levels of bot protection.

Full-Text RSS - For content extraction.

Wonderfully written and capable of extracting full text content from raw HTML.
An amazing and thoughtful design that allows config-files per site.

Headless Browser - For extracting html from dynamic websites (javascript). This as we know is resource intensive, but is my fallback when everything else fails.

By combining all of the above (picking best capabilities from the above), layering the code between the different projects, I ended up with my near perfect solution. Once AI models are tuned to run on low-end hardware, will look to pull that in.

Thanks!

fivefilters · May 28, 2024, 1:30pm

Thank you for this. Nice to hear that you’ve combined these tools to get the results you want!