Question: In what cases does debug=rawhtml return nothing?

My YouTube extraction hasn’t been working for the past few weeks. I’ve been trying to figure out what could be wrong. Running makefulltextfeed.php with debug=rawhtml returns something like this :-

* APC is disabled or not available on server
* Supplied URL: https://www.youtube.com/watch?v=JOZ8l6c8zh0
* ** Loading class HumbleHttpAgent (humble-http-agent/HumbleHttpAgent.php)
* ** Loading class ContentExtractor (content-extractor/ContentExtractor.php)
* ** Loading class SiteConfig (content-extractor/SiteConfig.php)
* --------
* Attempting to process URL as feed
* ** Loading class SimplePie_HumbleHttpAgent (humble-http-agent/SimplePie_HumbleHttpAgent.php)
* ** Loading class DisableSimplePieSanitize (DisableSimplePieSanitize.php)
* Fetching URL (https://www.youtube.com/watch?v=JOZ8l6c8zh0)
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://www.youtube.com/watch?v=JOZ8l6c8zh0
* ......adding to pool
* . looking for site config for youtube.com in custom folder
* ... found site config (youtube.com.txt)
* Cached site config with key youtube.com
* . looking for site config for youtube.com in standard folder
* ... site config for youtube.com already loaded in this request
* . merging config files
* Cached site config with key youtube.com.merged
* Checking fingerprints...
* No fingerprint matches
* . looking for site config for global in custom folder
* . looking for site config for global in standard folder
* ... found site config in standard folder (global.txt)
* Cached site config with key global
* Cached site config with key global.merged.ex
* Appending site config settings from global.txt
* ......user-agent set to: PHP/7.2
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1
* Sending request...
* Received responses
* ... site config for youtube.com.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* --------
* Constructing a single-item feed from URL
* ** Loading class FeedWriter (feedwriter/FeedWriter.php)
* --------
* Fetching feed items
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://www.youtube.com/watch?v=JOZ8l6c8zh0
* ......in memory
* --------
* Processing feed item 1
* Item URL: https://www.youtube.com/watch?v=JOZ8l6c8zh0
* ** Loading class FeedItem (feedwriter/FeedItem.php)
* URL already fetched - in memory (https://www.youtube.com/watch?v=JOZ8l6c8zh0, effective: https://www.youtube.com/watch?v=JOZ8l6c8zh0)
* Failed to extract, so skipping (due to exclude on fail parameter)
* Done!

Wondering what could be going wrong.
Thanks in advance for any insight.

This looks like Google is perhaps blocking the request from your server. I get the same result when I try with our free, hosted service at http://ftr.fivefilters.org but when I try locally with Full-Text RSS or a different server, I get the HTML output.

Thanks for your reply.
That’s a bummer.

So, just for future reference when I run a URL through Full-Text RSS with debug=rawhtml, and there is no HTML returned in the output, does that generally mean that there was no response from the server? (could be any of several reasons, such as blocking, server is down etc.)

Also, one more question (I hope this will be my last question regarding this issue): Is there any way, to make Full-text RSS return a preset html string for content based on given URL?

Like say for example, I pass this URL to FTR: https://www.youtube.com/watch?v=Nc8DLi12Y-Q
Can I make FTR parse the Nc8DLi12Y-Q out of that URL’s query items, and create the following output for content:-

<iframe width="560" height="315" src="https://www.youtube.com/embed/Nc8DLi12Y-Q" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

This way, we can completely avoid having to get any response from Google’s YouTube servers.

And, if the is something we can define in site patterns, we can address any changes YouTube makes on their end in the future.

I would say generally yes.

Not at the moment. But you can get creative with single_page_link, find_string and replace_string. See for example how we handle Youtube links at the moment: https://github.com/fivefilters/ftr-site-config/blob/master/youtube.com.txt

The problem here is the main goal of Full-Text RSS is to extract text content from article pages. Beyond Youtube and other similar video sites where an iframe with a query string parameter value inserted into a string template is all that’s required in the response, I think there are very few situations a solution like this would be useful.

Thanks for your response.

But you can get creative with single_page_link , find_string and replace_string .

I don’t see any way to solve this one using any of those. But, thanks!

I understand that YouTube and other similar video sites may not be a priority for FTR, but, you’d be surprised how many people subscribe to YouTube in their RSS Readers, so it’d be a nice to have feature. Eitherway, I’m happy to use my own workaround & not relying in FTR for this at all.

Thanks again for taking the time to write back. I really appreciate it.

Sorry, should’ve made clear that this example doesn’t help if the request to the server is being blocked, as is the case here. Simply wanted to highlight use of the find_string/replace_string to produce an iframe element that Full-Text RSS can extract.

We will see if there’s a nice way of supporting what you’ve suggested here in Full-Text RSS. So appreciate you posting about it.

Simply wanted to highlight use of the find_string/replace_string to produce an iframe element that Full-Text RSS can extract.

Thanks. I have been using FTR since around 2012, and have come to love it, and, know it well. (Know it as well as someone who doesn’t write/build it can know it :slight_smile: )

We will see if there’s a nice way of supporting what you’ve suggested here in Full-Text RSS.

Thanks again!

1 Like