not converted into line breaks

ws420 · October 13, 2019, 8:58pm

Texts containing

tags are not converted into line breaks, so the paragraphs spacing is lost
How can I fix this?

example page to extract: http://www.globenewswire.com/news-release/2019/10/11/1928642/0/en/Cronos-Group-Inc-to-Hold-Third-Quarter-2019-Earnings-Conference-Call.html

Original code:
<a href="http://www.lordjones.com/" rel="nofollow" target="_blank" title="www.lordjones.com">www.lordjones.com</a> Forward-Looking Statements This news release contains “forward-looking information” and

Code after passing it through full-text-rss:
<a href="http://www.lordjones.com/" rel="nofollow" target="_blank" title="www.lordjones.com">www.lordjones.com</a>  Forward-Looking Statements This news release contains “forward-looking information” and

fivefilters · October 14, 2019, 11:32am

Looks okay here: http://ftr.fivefilters.org/makefulltextfeed.php?url=http%3A%2F%2Fwww.globenewswire.com%2Fnews-release%2F2019%2F10%2F11%2F1928642%2F0%2Fen%2FCronos-Group-Inc-to-Hold-Third-Quarter-2019-Earnings-Conference-Call.html

Are you using the latest version of Full-Text RSS?

ws420 · October 19, 2019, 6:41am

Yes I am using the last version

I just tested with makefulltextfeed.php and I get the correct result. But I am using extract.php to get the json result instead, and I still have the same problem. It seems that the issue happen only when using extract.php

fivefilters · October 19, 2019, 1:18pm

Thanks for the update. I have been able to reproduce the problem now.

When you use extract.php by default we use htmLawed to clean up the HTML using its XSS filtering. We do not do this by default for the makefulltextfeed.php output because we assume that output will be handled by a feed reader which will apply its own filtering.

In both cases you can enable/disable that feature by passing the xss query string parameter: &xss=1 to enable or &xss=0 to disable. If you apply this to the makefulltextfeed.php output I linked above you’ll see it produces the problem you described: http://ftr.fivefilters.org/makefulltextfeed.php?xss=1&url=http%3A%2F%2Fwww.globenewswire.com%2Fnews-release%2F2019%2F10%2F11%2F1928642%2F0%2Fen%2FCronos-Group-Inc-to-Hold-Third-Quarter-2019-Earnings-Conference-Call.html

You can disable this in the extract.php output by passing &xss=0.

I’m not yet sure what’s triggering this kind of cleanup in htmLawed but my guess is that the document you provided uses an inline  element as its main content block which then contains block-level elements. Usually block-level elements contain inline elements but not not the other way around, so htmLawed is probably removing block-level elements which it sees inside inline elements.

Notice that the main article element here is an inline  element rather than a block-level <div> or <article> which is more common.

<span class="article-body" itemprop="articleBody">
   <p align="left">TORONTO, Oct.  11, 2019  (GLOBE NEWSWIRE)...</p>
</span>

I’ve marked this as an issue to look into to see if what I’ve described above is really what’s causing the removals and if it is, if we can still apply XSS filtering without this type of cleanup. But for the time being using &xss=0 will make the output similar to our makefulltextfeed.php output.

fivefilters · October 19, 2019, 1:44pm

I’ve looked into this some more and I can confirm it’s the element structure that’s causing the removal in the extract.php output. We’ll fix this in the next release, but if you’d like to do it yourself, you can open makefulltextfeed.php and look for the following line:

$html = htmLawed::hl($html, array('safe'=>1, 'deny_attribute'=>'style', 'comment'=>1, 'cdata'=>1));

Replace it with the following:

$html = htmLawed::hl($html, array('safe'=>1, 'balance'=>0, 'deny_attribute'=>'style', 'comment'=>1, 'cdata'=>1));

We’ve added 'balance'=>0 which disables htmLawd’s tag balancing that’s removing block-level elements appearing inside inline elements.

<p> not converted into line breaks