Confusing Readability result - div removed


#1

I am trying to better understand how Readability (?) is handling content extraction. Here is a test link:
http://ftr.fivefilters.org/makefulltextfeed.php?url=http://www.cloozle.com/testing/link1.html

If you look at the source of the url param, you will notice there is a #2 div that is missing from the parsedhtml, though it is in the rawhtml. It’s identical to div #1, except it is has a smaller amount of text. It’s identical to #3, except 3 is using a p tag, not a div.

Can someone shed some light on what is going on here, and if this can be rendered without loosing that div? I suspect I might be able to make a custom config file, but I’d prefer something that does not require this.

Thanks!


#2

The Readability code we use is based on the work done by Arc90. It’s similar to the code used by ‘Reader’ mode features of browsers (in fact, Apple borrowed the same code for use in Safari when they first introduced the feature). It works by using a set of heuristics to score elements and prune elements it thinks are unlikely to be content. For example, elements that contain a lot more links than plain text, won’t score as well. You’d have to study the code to find out why it’s removing that particular element.

Often when the article is being detected correctly but there are elements that are being removed which should be preserved, you can write a site config file to disable pruning. Your site config file could contains just one line:

prune: no

Worth trying with this to see if it helps.


#3

I’ve started to better dissect Readability and it’s logic. prune:no is something I set globally, so it won’t help in this case – but I’m sure I can find the line that does. I may just leave as is. We’ve been using this tool as is for a while now and I’ve not had any complaints on missing important markup, so I don’t want to update it based on edge case.

Thanks for the reply.


#4

Thanks for the update. Glad to hear it’s working as expected.