Remove Javascript

desk-user · October 29, 2014, 9:02am

This is my test url:
http://www.themalaymailonline.com/videos/showbiz/watch/cover-media-fast-furious-7-gets-new-title?utm_source=twitterfeed&utm_medium=twitter

When I run it in FullTextRSS 3.4 & 3.3, I got this:

"); } /** Insert fallback. / document.getElementById(“botr_ZaeSi6jM_blsyVPO4_div”).innerHTML = “”; / Initialize player **/ jwplayer.key = “QeTEtuWqz5Ac/DEzUNv1dtNHXhvFenaaqdjqGw==”; jwplayer(“botr_ZaeSi6jM_blsyVPO4_div”).setup({ advertising: { client: “googima”, schedule: { preroll_1: { tag: “http://pubads.g.doubleclick.net/gampad/ads?sz=480x270&iu=/32246135/Video-480x270&ciu_szs&impl=s&gdfp_req=1&env=vp&output=xml_vast3&unviewed_position_start=1&url=[referrer_url]&correlator=[timestamp]&cust_params=Video-480x270%3Dlinearpreroll&”, offset: “pre”, skipoffset: 5 }, preroll_2: { tag: “http://pubads.g.doubleclick.net/gampad/ads?sz=480x270&iu=/32246135/Video-480x270-preroll-1&ciu_szs&impl=s&gdfp_req=1&env=vp&output=xml_vast3&unviewed_position_start=1&url=[referrer_url]&correlator=[timestamp]&cust_params=Video-480x270-preroll-1%3Dlinearpreroll&”, offset: “pre” } } }, analytics: {“enabled”: true}, aspectratio: “16:9”, autostart: false, controls: true, displaytitle: false, fallback: true, flashplayer: “http://assets-jp.jwpsrv.com/player/6/6104906/jwplayer.flash.swf”, ga: {“idstring”: “title”}, height: 270, html5player: “http://assets-jp.jwpsrv.com/player/6/6104906/jwplayer.html5.js”, image: “http://content.jwplatform.com/thumbs/ZaeSi6jM-1280.jpg”, logo: {“link”: “http://www.themalaymailonline.com/videos/”, “position”: “top-right”, “margin”: “10”, “hide”: false, “file”: “http://assets-jp.jwpsrv.com/watermarks/NRW97gK2.png”}, playlist: “http://content.jwplatform.com/jw6/ZaeSi6jM.xml”, plugins: {“http://assets-jp.jwpsrv.com/player/6/6104906/ping.js”: {“pixel”: “http://content.jwplatform.com/ping.gif”}}, primary: “flash”, repeat: false, stagevideo: false, stretching: “uniform”, width: “100%” });
Fans of the Fast & Furious franchise have good reason to be excited this week: the promotional campaign for the seventh movie in the series has started its engines. And the first bit of news is the title of the upcoming sequel, now called simply Furious 7. — Cover Media

How to remove the javascript code?

I had tried ‘strip: //script’ in the extract rules, but not working.

p/s: this problem is not exist here:
http://ftr.fivefilters.org/makefulltextfeed.php?url=http://www.themalaymailonline.com/videos/showbiz/watch/cover-media-fast-furious-7-gets-new-title?utm_source=twitterfeed&utm_medium=twitter

desk-user · November 7, 2014, 1:15pm

Hi, any answer for this?

vhanded

fivefilters · November 7, 2014, 1:22pm

Hi there,

Sorry for the slow response.

This looks like it’s an issue related to HTML parsing. If it works on ours, but not on yours, you should check the compatibility test file to see if Tidy is enabled. The default parser (libxml) often gets confused when it comes to parsing script elements, leading to parts of the JS code contained in those elements to be treated as a regular content block. Running Tidy on the HTML before it’s passed to libxml for parsing often fixes these issues. So if it’s not enabled, you can ask your server admin or host to enable it. Alternatively, you can also tell Full-Text RSS to parse the HTML using a better HTML5 parser (which handles such cases better than libxml). To do that you can add &parser=html5php to the querystring. This should prevent Javascripts snippets leaking into the extracted content.

Let us know if you still have trouble.

Best, Keyvan

desk-user · November 11, 2014, 11:38am

Hi Keyvan,

I tried tidy: yes, but still having same result.

Then I tried html5parser, it works by ignore the javascript. However, it also ignored all my strip and body extraction rules.

Is this a bug? Or I missed something?

vhanded

desk-user · November 11, 2014, 11:38am

This is my extraction rules:

strip: //span[@class=‘img-caption’]
strip: //span[@class=‘quiet small caption’]
strip: //p[@class=‘date quiet’]
body: //div[@class=‘video’]
parser: html5lib

vhanded

fivefilters · November 11, 2014, 11:51am

Hi there, tidy: yes has no effect if Tidy is not actually available on your server. I can’t see if it is or isn’t, but you will be able to check if you access the compatibility test file (linked from the index page of your installation of Full-Text RSS).

As for your extraction rules, you might be encountering problems because the page relies on Javascript to load the video (and perhaps other elements). Full-Text RSS cannot access elements that are loaded by Javascript. The best way to test this is to view the page source (raw HTML sent back by the server) and look for those elements you’re trying to extract. If you can’t see them there, then they won’t be available to Full-Text RSS either. Many sites modify the document through Javascript, inserting additional elements, classes, loading videos, etc. When you inspect the document using a DOM inspector, you’re seeing the final result (after Javascript has been executed). If there are elements that have appeared because of Javascript, those won’t be available to Full-Text RSS. Another way to quickly test for this is to disable Javascript in your browser and try reloading the page (if I do that to your test URL, I see that the video disappears).

You can also try using our simple site config creator, which loads page content without Javascript: http://siteconfig.fivefilters.org

Hope that’s some help.

desk-user · November 11, 2014, 3:39pm

Hi Keyvan,

I will check if I have Tidy available or not.

For the javascript part, I am fully aware that Full-Text RSS is unable to get javascript content.

However, my problem is not that. I can view the content in page source.

My problem is, if I set the parser=html5parser, then my extraction rules will be completely ignored. Full-Text-RSS will extract the content, just like there is not extraction rules.

vhanded

fivefilters · November 12, 2014, 6:04pm

Hi there,

My earlier reply was based on the test URL you supplied - http://www.themalaymailonline.com/videos/showbiz/watch/cover-media-fast-furious-7-gets-new-title - that’s why I assumed you were trying to extract the video on that page.

If it’s another issue related to the HTML5 parser, can you please give us the following:

URL of your Full-Text RSS installation
URL of the article you’re processing
Name of the site config file you’ve created for the URL

You can email these to us at help@fivefilters.org if you’d rather not share the URL here. I’ll be happy to take a look then. My only guess at this point is that the parser is producing a different DOM tree to the one your XPath expression expects. If you use &debug=parsedhtml in the URL, you might see how this differs.