Apply autodetect_on_failure setting thru the request


#1

Is it possible to set the autodetect_on_failure setting as a request parameter, or thru some other means other than a global or site specific config file? The use case is that I want to first try to extract content with the default “yes” setting but if extraction fails, then try turning this off. I find that turning it off frequently cures extraction issues.


#2

I’m not sure I understand the use case. If extraction fails in the default mode (autodetect_on_failure is on), it will also fail when it’s off.

The thinking behind the the autodetect_on_failure setting is that in some situations users will have very precise rules for what should be extracted from a given site - especially if that’s not a typical article site. In that situation, the autodetect mode will not return anything useful for the user, and there needs to be a mechanism to force Full-Text RSS to use the extraction rules in the site config file without attempting to detect the desired content by itself if those rules do not match.

But that doesn’t appear to be what you’re describing here.

I should add that you can pass autodetect_on_failure: no using the siteconfig request parameter. See https://help.fivefilters.org/full-text-rss/usage.html#_1-article-extraction


#3

Thanks for the tip about submitting rules in a request using siteconfig. I missed that in the docs. That works.
Regarding my use case, I have found that some of my feeds return “unable to retrieve…” unless I create a config for the domain and add autodetect_on_failure: no. Upon further investigation, I have found that I can remove some rules in my /custom/global.txt file and extract correctly. For example, this feed is one such case: http://www.ustrademonitor.com/feed/

A sample article from that feed :
https://www.ustrademonitor.com/2019/04/ustr-releases-annual-special-301-intellectual-property-report/

When I remove these lines from /custom/global.txt, it pulls fine:

strip_id_or_class: breadcrumb
strip_id_or_class: breadcrumbs

This is despite that there is no breadcrumb class in the source. But there is this body class: lxb_af-main-content-breadcrumbs-hide_breadcrumbs. Related?

Not sure if these couple lines are the culprits for other feeds that behave the same way. I can test that.

So when setting autodetect_on_failure: no in a site config, does that also turn off rules in my /custom/global.txt? That would strongly indicate that the issue is related to rules there.

Thanks for the prompt response!


#4

At the moment strip_id_or_class will remove elements where breadcrumb appears anywhere in the class or id attribute value. So in your example above the second breadcrumbs entry is redundant. But we might change this in the future - there’s been talk that this isn’t obvious or particularly useful.

So when setting autodetect_on_failure: no in a site config, does that also turn off rules in my /custom/global.txt ? That would strongly indicate that the issue is related to rules there.

Yes, if autodetect_on_failure is off, global.txt rules will not apply.