Error using "parser: html5lib"

Hi,

Experimentally I have full-text-rss 3.1 running on appfog. I was testing a feed (feeds.labnol.org/labnol) that displayed some garbage where an ad is removed near the start of each article:

';}google_adnum=google_adnum … google_skip=google_adnum;

As AF doesn’t support Tidy I thought to try the html5lib parser via a custom site_config in it’s place, but when I do that and attempt to generate a feed there is an error report:

Warning: DOMDocument::createElementNS(): Namespace Error in /mnt/var/vcap.local/dea/apps/full-text-rss-0-3a8575ed7db507ee8c5fa36f1b61ff57/app/libraries/html5/TreeBuilder.php on line 3160

Fatal error: Call to a member function hasAttribute() on a non-object in /mnt/var/vcap.local/dea/apps/full-text-rss-0-3a8575ed7db507ee8c5fa36f1b61ff57/app/libraries/html5/TreeBuilder.php on line 3164

Any ideas?

I should also mention that the feed renders perfectly (without the garbage) using the full-test-rss instance on the five-filters site and on another 3.1 instance.

Ian

Hi Ian, yes, unfortunately AppFog does not have the Tidy package installed. The html5lib parser also has its own issues, as you’ve discovered.

To fix this particular html5lib problem, please use the patch submitted in this thread: https://code.google.com/p/html5lib/issues/detail?id=131 - the direct link to the patch is https://code.google.com/p/html5lib/issues/attachmentText?id=131&aid=-1161059195050282731&name=html5libphp_treebuilder_custom_namespaces.patch (if you’re not familiar with these files, you have to edit your copy of TreeBuilder.php (in libraries/html5/), add the lines beginning with ‘+’ and remove the lines beginning with ‘-’). This will fix the issue for this site and many other sites, but I think there are still issues with the HTML5 parser which hopefully will get fixed.

Another solution to handle bad parsing of javascript elements without Tidy and without html5lib, is to include the following rules in your site config file:

find_string: <script
replace_string: <div style=“display:none;”
find_string:
replace_string:

This turns script elements into hidden hidden div elements (which Full-Text RSS removes). This seems to work with the few sites where we’ve seen JS snippets sneaking in due to bad parsing. We haven’t tested extensively, but if it works well, you could easily move this into the global.txt config file and have it applied to everything.

Hope that’s some help.