This is the URL I’m trying to extract content from: https://www.cbc.ca/news/world/iraq-violence-video-games-banned-1.5102126?cmp=rss
makefulltextfeed.php returns the correctly extracted content, while extract.php has an empty string in “content” key of the returned json. Any ideas what’s going on here?
Adding the debug parameter to the query on extract.php get me the following output :-
* APC is disabled or not available on server
* Supplied URL: https://www.cbc.ca/news/world/iraq-violence-video-games-banned-1.5102126?cmp=rss
* ** Loading class HumbleHttpAgent (humble-http-agent/HumbleHttpAgent.php)
* ** Loading class ContentExtractor (content-extractor/ContentExtractor.php)
* ** Loading class SiteConfig (content-extractor/SiteConfig.php)
* --------
* Constructing a single-item feed from URL
* ** Loading class FeedWriter (feedwriter/FeedWriter.php)
* --------
* Fetching feed items
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://www.cbc.ca/news/world/iraq-violence-video-games-banned-1.5102126?cmp=rss
* ......adding to pool
* . looking for site config for cbc.ca in custom folder
* ... found site config (cbc.ca.txt)
* Cached site config with key cbc.ca
* . looking for site config for cbc.ca in standard folder
* ... site config for cbc.ca already loaded in this request
* . merging config files
* Cached site config with key cbc.ca.merged
* Checking fingerprints...
* No fingerprint matches
* . looking for site config for global in custom folder
* . looking for site config for global in standard folder
* ... found site config in standard folder (global.txt)
* Cached site config with key global
* Cached site config with key global.merged.ex
* Appending site config settings from global.txt
* ......user-agent set to: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1
* Sending request...
* Received responses
* ... site config for cbc.ca.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* --------
* Processing feed item 1
* Item URL: https://www.cbc.ca/news/world/iraq-violence-video-games-banned-1.5102126?cmp=rss
* ** Loading class FeedItem (feedwriter/FeedItem.php)
* URL already fetched - in memory (https://www.cbc.ca/news/world/iraq-violence-video-games-banned-1.5102126?cmp=rss, effective: https://www.cbc.ca/news/world/iraq-violence-video-games-banned-1.5102126?cmp=rss)
* Done!
As you can probably see, the output ends abruptly. Adding ‘rawhtml’ or ‘parsedhtml’ to the debug parameter in the query on extract.php yields the same output. Any ideas?