Different results in extract.php vs makefulltextfeed.php

kunalsood · February 22, 2019, 10:17am

The URL I’m trying to extract from is : https://www.youtube.com/watch?v=E6umTIBZub8

The results I get from makefulltextfeed.php is :-
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="css/feed.xsl"?>

<atom:link rel=“self” href="…" />
<atom:link rel=“alternate” title=“Source URL” href=“https://www.youtube.com/watch?v=E6umTIBZub8” />
<atom:link rel=“related” title=“Subscribe to feed” href="…" />
Why the Oscars took 91 years to nominate a superhero movie for Best Picture
https://www.youtube.com/watch?v=E6umTIBZub8
Content extracted from https://www.youtube.com/watch?v=E6umTIBZub8

https://www.youtube.com/watch?v=E6umTIBZub8
Why the Oscars took 91 years to nominate a superhero movie for Best Picture
https://www.youtube.com/watch?v=E6umTIBZub8
<iframe id=“video” width=“480” height=“270” src=“https://www.youtube.com/embed/E6umTIBZub8?feature=oembed” frameborder=“0” allow=“accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture” allowfullscreen></iframe>
dc:formattext/xml</dc:format>
dc:identifier http://www.youtube.com/oembed?format=xml&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DE6umTIBZub8</dc:identifier>

While, the result I get from extract.php is:-
{“title”:“YouTube”,“excerpt”:"",“date”:null,“author”:null,“language”:“en-US”,“url”:“https://www.youtube.com/watch?v=E6umTIBZub8",“effective_url”:“https://www.youtube.com/watch?v=E6umTIBZub8”,“domain”:“youtube.com”,“word_count”:0,“og_url”:null,“og_title”:null,“og_description”:null,“og_image”:null,“og_type”:null,“twitter_card”:null,“twitter_site”:null,“twitter_creator”:null,“twitter_image”:null,“twitter_title”:null,“twitter_description”:null,“content”:"”}

As you can see in there is nothing in the content key of the JSON dictionary received from extract.php, but, the XML received from makefulltextfeed.php contains the correctly extracted iframe from the youtube URL. Any ideas what’s going on here?

kunalsood · February 24, 2019, 5:44am

So, this seems to be due to XSS filtering. Which makes sense.
But, passing xss=0 as a querystring parameter in a GET request to extract.php doesn’t seem to disable XSS. I have verified this by using the debug parameter, that even after passing xss=0, it still loads & runs through htmLawed. Any ideas what I could be doing wrong here?

The only fix I have been able to find is to modify the if/else statements at the bottom end of the extract.php file, which is not an ideal fix. So, any help will be highly appreciated.

fivefilters · March 3, 2019, 4:59pm

Thanks for highlighting this, and sorry for the confusion.

I think the reason for having xss enabled by default for the extract.php endpoint and not for makefulltextfeed.php is that for the latter we expect most users will be running it through feed readers which will be filtering the content according to their own security policies. But we of course want to give users the option of overriding the xss filtering for the extract.php endpoint. We’ll take a look at this for the next release.

kunalsood · March 4, 2019, 3:47am

I understand. Thanks for taking a look!

kunalsood · March 13, 2019, 12:00pm

Just an update. Since you said you will look into this.

It looks like, extract.php (even with the file modified to remove XSS filtering) is giving me different results from what makefulltextfeed.php gives me.

makefulltextfeed.php will always give me the iframe to the actual video, while extract.php always ends up giving me an iframe to google’s login page. Looking something like :-

<iframe src=\"https:\/\/accounts.google.com\/ServiceLogin?uilel=3&amp;service=youtube&amp;passive=true&amp;hl=en&amp;continue=https%3A%2F%2Fwww.youtube.com%2Fsignin%3Ffeature%3Dpassive%26hl%3Den%26next%3D%252Fsignin_passive%26action_handle_signin%3Dtrue%26app%3Ddesktop\" style=\"display: none\"><\/iframe>

I just thought this might be worth investigating as well.

PS: This is what my site config file for youtube.com looks like :-

title: //title
body: //iframe

find_string: <html>&lt;iframe 
replace_string: <iframe id="video" 

find_string: &gt;&lt;/iframe&gt;</html>
replace_string: ></iframe>

single_page_link: //link[@type='text/xml+oembed']/@href

prune: no
tidy: no

test_url: http://www.youtube.com/watch?v=F6gLH0r3iVU
test_url: https://www.youtube.com/watch?v=E6umTIBZub8
test_url: https://www.youtube.com/watch?v=wF3nhile788

kunalsood · March 14, 2019, 9:56am

Last update. (I hope)

It seems that commenting the following line out from extract.php makes youtube URLs work as expected:-

$_POST['accept'] = 'html';

So, it would seem that Full-text RSS processes URLs differently when it is explicitly told to expect html vs when it does not know what to expect.

fivefilters · March 16, 2019, 3:48pm

Hi there, and thanks again for reporting back.

We’ll fix the xss bug in the next release so &xss=0 will actually disable XSS processing in our extract.php endpoint. You can do this by using this code instead of what’s there at the moment in extract.php.

// Enable XSS filtering (unless explicitly disabled)
if (isset($_POST['xss']) && $_POST['xss'] === '0') {
    $_POST['xss'] = '0';
} elseif (isset($_GET['xss']) && $_GET['xss'] === '0') {
    $_GET['xss'] = '0';
} else {
    $_POST['xss'] = '1';
}

As for the different result you got when using extract.php, that has to do with how YouTube responds to the User-Agent string in the HTTP request.

When you use the &accept=html parameter in Full-Text RSS (as we do automatically in the extract.php endpoint), we treat the request slightly differently. Here’s our description for this parameter:

Tell Full-Text RSS what it should expect when fetching the input URL. By default Full-Text RSS tries to guess whether the response is a feed or regular HTML page. It’s a good idea to be explicit by passing the appropriate type in this parameter. This is useful if, for example, a feed stops working and begins to return HTML or redirecs to a HTML page as a result of site changes. In such a scenario, if you’ve been explicit about the URL being a feed, Full-Text RSS will not parse HTML returned in response. If you pass accept=html (previously html=1), Full-Text RSS will not attempt to parse the response as a feed. This increases performance slightly and should be used if you know that the URL is not a feed.

Note: If excluded, or set to auto, Full-Text RSS first tries to parse the server’s response as a feed, and only if it fails to parse as a feed will it revert to HTML parsing. In the default parse-as-feed-first mode, Full-Text RSS will identify itself as PHP first and only if a valid feed is returned will it identify itself as a browser in subsequent requests to fetch the feed items. In parse-as-html mode, Full-Text RSS will identify itself as a browser from the very first request.

That last highlighted line is what’s causing the issue here. It’s possible to override how we present ourselves to a site by setting a custom user agent in the site config file for the site. We’ve just done this for YouTube.com so if you update the site config file (and make the XSS change above) the issue should be resolved:

Thanks again for reporting this.

kunalsood · March 17, 2019, 4:15am

Awesome!
Thanks for taking the time to respond. I really appreciate it.
Sorry for missing that last bit in the documentation, it all makes sense now.
Sorry again, and, Thanks again!

fivefilters · March 18, 2019, 10:45am

No problem. Happy to help, and good to hear about issues like this.