[unable to retrieve full-text content] for many sites

cxy007 · December 10, 2019, 4:14pm

How do I use these? I found most of the sites I want shows this error. for example:
https://www.kens5.com/article/news/local/police-standoff/273-7b2ee18d-0bfc-412d-a0bb-5240b8c9714c
https://www.kark.com/news/local-news/hot-springs-man-arrested-in-connection-to-sunday-morning-apartment-shooting/
https://www.easttexasmatters.com/crime/houston-man-arrested-in-nacogdoches-with-10-pounds-of-cocaine-two-stolen-guns/
https://13wham.com/news/local/sunday-night-shooting-in-rochester

do I need to change some config to do that? I am using the API calls to do it for now. So I need to host my own to make it work? Thanks!

fivefilters · December 10, 2019, 11:45pm

Some of these sites appear to be blocking access to EU users and servers.

For developers we do offer a hosted Full-Text RSS service running on servers outside the EU: https://rapidapi.com/fivefilters/api/full-text-rss-us

If you’re running it yourself or are thinking about hosting it yourself, you should install Full-Text RSS on servers in the US or Canada and some of those access denied messages will go away.

cxy007 · December 11, 2019, 12:49am

Thank you very much. I am in US and will try your new endpoint. Quick question: usually what may cause the parsing failed? what’s the percentage of sites that block this server? If we host our own, is there some config we can use to enable more sites manually?
Thanks again!

fivefilters · December 11, 2019, 4:48pm

There can be any number of reasons why we cannot extract content from a given web page. It might have to do with not being able to access the server (either because it’s down or because it blocks our request). It might be content that’s behind a paywall or a GDPR/cookie wall. It might be that we get a response but the way the content is presented results in bad article extraction.

In some cases we can get around these with site config files, but in others we simply can’t.

We don’t have any numbers on who blocks the servers we use. In some of the examples above, where the site has chosen to block access to EU IPs, you should find that running the software on a US server will get you the content you need. But we encourage you test for yourself with sites that matter to you.

cxy007 · December 11, 2019, 6:13pm

Thank you very much for the tips. I just purchased the self hosted version and hosted on my site successfully. I got one more question: when it get data for description and some other fields, it will truncate at certain length, showing dot dot dot at the end. And obviously the raw data is there and longer. Can I config to make that length longer instead of truncate?
Thanks!

fivefilters · December 11, 2019, 6:49pm

We don’t currently let you set that in the config file, but if you open makefulltextfeed.php you’ll be able to search for the get_excerpt function. It’s set to use 55 words, which you can change by editing the function definition:

get_excerpt($text, $num_words=55, $more=null) {

We hope to move this number to the config file in a future release.