Page doesn't render

WebFlyer · May 23, 2019, 11:47am

Hi,
I’m testing now various site configuration for my new project.
And now I met with the issue - some sites don’t render.
I have a message “[unable to retrieve full-text content]” with zero-filled json output.
Moreover, when I’m trying to select a body section with http://siteconfig.fivefilters.org/
then show a message “You look like a bot, go away.”
What’s the matter?
Sites and pages I tested:
https://www.eurointegration.com.ua/news/2019/05/23/7096521/ (and any other page from the site)
https://vybory.pravda.com.ua/news/2019/05/23/7149993/ (and anyl other page from the site)
custom XPath configurations don’t work

Please help to find a solution for such pages.
Thanks a lot for a fast reply.

WebFlyer · May 25, 2019, 10:21am

Moreover, I see in logs errors maybe for ALL non-latin pages
2019/05/25 13:15:15 [error] 1058#1058: 2074604 FFastCGI sent in stderr: "PHP message: PHP Notice: https://***** is invalid XML, likely due to invalid characters. XML error: ******* (a variety of errors from SimplePie) in /var/www//ftr/libraries/simplepie/library/SimplePie.php on line 1496" while reading response header from upstream
(most of latin-letters site work well)

I have installed the newest SimplePie library (v.1.5.2) with the same result

Added:
But all XML errors gone after installing SimplePie library v.1.3.2

Added again:
SimplePie function parse() uses html entities, which brake the testing.
I replace one from the v.1.3.2 and all the errors gone

Added:
SimplePie 1.4 & 1.5.2 demo (without ftr) works like a charm and extract feeds without any error.

So, it’s not a problem of the above-mentioned feeds
IMHO, its a problem in FTR.

fivefilters · June 13, 2019, 5:33pm

Thanks for those problem URLs. Some sites implement systems which try to detect automated access. Either through examining IP addresses or looking for clues in the request headers. You can sometimes get past these measures by running Full-Text RSS from a server whose IP address isn’t blocked (if IP blocking is the cause) or by changing the request headers Full-Text RSS sends - in the site config files we allow you to override the User-Agent string with something like:

http_header(user-agent): Mozilla....

But this will require some trial and error, and if the sites in question update their blocking rules, you might find you’re faced with similar messages in the future.

fivefilters · June 13, 2019, 5:34pm

Oh, and thanks for letting us know about the error log messages you’re seeing. We’ll take a look to see if we can reproduce this.

WebFlyer · June 15, 2019, 10:55am

Hello,
I tried to use your solution - put real headers from browsers (Safari, Chrome, Mozilla)
It’s not working. Alas.
Maybe another way exists?

fivefilters:

http_header(user-agent): Mozilla.... 
But this will require some trial and error, and if the sites in question update their blocking rules, you might find you’re faced with similar messages in the future.

WebFlyer · June 24, 2019, 5:12pm

Hello,
I’ll be very glad if you could suggest headers for the sites listed above.
All my attempts ended without any success
Thank you in advance!