Parsed vs Raw HMTL - please clarify

drpudding · April 9, 2019, 3:54pm

When I extract an article (not feed) that has this raw html markup:
<div class="foo">I'm a div</div>

It is converted to this parsed html:
<p>I'm a div</p>

Is that something that can be turned off?

I have tried using makefulltextfeed.php and extract.php

    $params = array(
            'format'=>'json',
            'max'=> 1,
            'accept' => 'html',
            'summary'=>1,
            'url'=>$article_url);

http://ftr.fivefilters.org/makefulltextfeed.php?max=1&url=http://www.cloozle.com/testing/ckeditor-tester.html

UPDATE: I see that Readability does this…
Turn all divs that don't have children block level elements into p's

That explains it. I can disable. I don’t see any pref for this setting.

fivefilters · April 10, 2019, 6:51pm

Hi there,

Glad you found what was causing it. I haven’t tested this, but I think if you create a site config file for the site you’re interested in and use something like the following, Readability should not be invoked.

body: //div[@class="entry-content"]

Perhaps we should have a setting to override this Readability behaviour.

drpudding · April 10, 2019, 9:32pm

Yep. You’re right. If I explicitly set the body in a config, it does not invoke Readability (I tested). But I generally only create a config for special cases where the content is not correctly targeted, so a setting would be nice if one wants to override it all the time, as I do.

Thanks for responding!