Parsed vs Raw HMTL - please clarify


When I extract an article (not feed) that has this raw html markup:
<div class="foo">I'm a div</div>

It is converted to this parsed html:
<p>I'm a div</p>

Is that something that can be turned off?

I have tried using makefulltextfeed.php and extract.php

    $params = array(
            'max'=> 1,
            'accept' => 'html',

UPDATE: I see that Readability does this…
Turn all divs that don't have children block level elements into p's

That explains it. I can disable. I don’t see any pref for this setting.


Hi there,

Glad you found what was causing it. I haven’t tested this, but I think if you create a site config file for the site you’re interested in and use something like the following, Readability should not be invoked.

body: //div[@class="entry-content"]

Perhaps we should have a setting to override this Readability behaviour.


Yep. You’re right. If I explicitly set the body in a config, it does not invoke Readability (I tested). But I generally only create a config for special cases where the content is not correctly targeted, so a setting would be nice if one wants to override it all the time, as I do.

Thanks for responding!