More on: Different results in extract.php vs makefulltextfeed.php


#1

This is the URL I’m trying to extract content from: https://www.cbc.ca/news/world/iraq-violence-video-games-banned-1.5102126?cmp=rss

makefulltextfeed.php returns the correctly extracted content, while extract.php has an empty string in “content” key of the returned json. Any ideas what’s going on here?

Adding the debug parameter to the query on extract.php get me the following output :-

* APC is disabled or not available on server
* Supplied URL: https://www.cbc.ca/news/world/iraq-violence-video-games-banned-1.5102126?cmp=rss 
* ** Loading class HumbleHttpAgent (humble-http-agent/HumbleHttpAgent.php) 
* ** Loading class ContentExtractor (content-extractor/ContentExtractor.php) 
* ** Loading class SiteConfig (content-extractor/SiteConfig.php) 
* -------- 
* Constructing a single-item feed from URL 
* ** Loading class FeedWriter (feedwriter/FeedWriter.php) 
* -------- 
* Fetching feed items 
* Starting parallel fetch (curl_multi_*)
 * Processing set of 1 
* ...https://www.cbc.ca/news/world/iraq-violence-video-games-banned-1.5102126?cmp=rss 
* ......adding to pool 
* . looking for site config for cbc.ca in custom folder 
* ... found site config (cbc.ca.txt) 
* Cached site config with key cbc.ca 
* . looking for site config for cbc.ca in standard folder 
* ... site config for cbc.ca already loaded in this request 
* . merging config files 
* Cached site config with key cbc.ca.merged 
* Checking fingerprints... 
* No fingerprint matches 
* . looking for site config for global in custom folder 
* . looking for site config for global in standard folder 
* ... found site config in standard folder (global.txt) 
* Cached site config with key global 
* Cached site config with key global.merged.ex 
* Appending site config settings from global.txt 
* ......user-agent set to: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36 
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1 
* Sending request... 
* Received responses 
* ... site config for cbc.ca.merged already loaded in this request 
* Checking fingerprints... 
* No fingerprint matches 
* ... site config for global.merged.ex already loaded in this request 
* Appending site config settings from global.txt 
* -------- 
* Processing feed item 1 
* Item URL: https://www.cbc.ca/news/world/iraq-violence-video-games-banned-1.5102126?cmp=rss 
* ** Loading class FeedItem (feedwriter/FeedItem.php) 
* URL already fetched - in memory (https://www.cbc.ca/news/world/iraq-violence-video-games-banned-1.5102126?cmp=rss, effective: https://www.cbc.ca/news/world/iraq-violence-video-games-banned-1.5102126?cmp=rss) 
* Done!

As you can probably see, the output ends abruptly. Adding ‘rawhtml’ or ‘parsedhtml’ to the debug parameter in the query on extract.php yields the same output. Any ideas?


#2

Update: Changing this line in extract.php file :-

*// Don't process URL as feed*
$_POST['accept'] = 'html';

to this :-

*// Don't process URL as feed*
$_POST['accept'] = 'auto';

fixes this issue, for whatever reason.
Might be something worth investigating.


#3

I couldn’t reproduce this using Full-Text RSS 3.9.5. Both regular output and JSON output from extract.php produced content.

If the change you describe worked for you, it could well be because with &accept=auto we use a User-Agent HTTP header to identify as PHP (because in auto mode we assume the URL we fetch first is a feed). In &accept=html mode, we use a browser User-Agent HTTP header. So if the site returns content differently depending on the value of this header, that could explain why your change produced results for you. (You don’t mention which version of Full-Text RSS you use, so it could be that these values differ in the version you’re using.)


#4

I have tested this with both version 3.9.5 & version 3.9.1, and seen this issue versions. Thanks for the input about user-agent, I’ll try and test this with different user-agent strings in site patterns, and report back.


#5

Update: Yepp, it is indeed the User-Agent string that makes the difference.
How strange!
And, even more strange that it’s not an issue on your installation of FTR, but, it is on mine.


#6

Yes, it can be a little difficult debugging issues when sites respond differently depending on the User-Agent or other HTTP headers.