As you can see in there is nothing in the content key of the JSON dictionary received from extract.php, but, the XML received from makefulltextfeed.php contains the correctly extracted iframe from the youtube URL. Any ideas what’s going on here?
So, this seems to be due to XSS filtering. Which makes sense.
But, passing xss=0 as a querystring parameter in a GET request to extract.php doesn’t seem to disable XSS. I have verified this by using the debug parameter, that even after passing xss=0, it still loads & runs through htmLawed. Any ideas what I could be doing wrong here?
The only fix I have been able to find is to modify the if/else statements at the bottom end of the extract.php file, which is not an ideal fix. So, any help will be highly appreciated.
Thanks for highlighting this, and sorry for the confusion.
I think the reason for having xss enabled by default for the extract.php endpoint and not for makefulltextfeed.php is that for the latter we expect most users will be running it through feed readers which will be filtering the content according to their own security policies. But we of course want to give users the option of overriding the xss filtering for the extract.php endpoint. We’ll take a look at this for the next release.
Just an update. Since you said you will look into this.
It looks like, extract.php (even with the file modified to remove XSS filtering) is giving me different results from what makefulltextfeed.php gives me.
makefulltextfeed.php will always give me the iframe to the actual video, while extract.php always ends up giving me an iframe to google’s login page. Looking something like :-
We’ll fix the xss bug in the next release so &xss=0 will actually disable XSS processing in our extract.php endpoint. You can do this by using this code instead of what’s there at the moment in extract.php.
As for the different result you got when using extract.php, that has to do with how YouTube responds to the User-Agent string in the HTTP request.
When you use the &accept=html parameter in Full-Text RSS (as we do automatically in the extract.php endpoint), we treat the request slightly differently. Here’s our description for this parameter:
Tell Full-Text RSS what it should expect when fetching the input URL. By default Full-Text RSS tries to guess whether the response is a feed or regular HTML page. It’s a good idea to be explicit by passing the appropriate type in this parameter. This is useful if, for example, a feed stops working and begins to return HTML or redirecs to a HTML page as a result of site changes. In such a scenario, if you’ve been explicit about the URL being a feed, Full-Text RSS will not parse HTML returned in response. If you pass accept=html (previously html=1), Full-Text RSS will not attempt to parse the response as a feed. This increases performance slightly and should be used if you know that the URL is not a feed.
Note: If excluded, or set to auto, Full-Text RSS first tries to parse the server’s response as a feed, and only if it fails to parse as a feed will it revert to HTML parsing. In the default parse-as-feed-first mode, Full-Text RSS will identify itself as PHP first and only if a valid feed is returned will it identify itself as a browser in subsequent requests to fetch the feed items. In parse-as-html mode, Full-Text RSS will identify itself as a browser from the very first request.
That last highlighted line is what’s causing the issue here. It’s possible to override how we present ourselves to a site by setting a custom user agent in the site config file for the site. We’ve just done this for YouTube.com so if you update the site config file (and make the XSS change above) the issue should be resolved:
Awesome!
Thanks for taking the time to respond. I really appreciate it.
Sorry for missing that last bit in the documentation, it all makes sense now.
Sorry again, and, Thanks again!