I just tested with makefulltextfeed.php and I get the correct result. But I am using extract.php to get the json result instead, and I still have the same problem. It seems that the issue happen only when using extract.php
Thanks for the update. I have been able to reproduce the problem now.
When you use extract.php by default we use htmLawed to clean up the HTML using its XSS filtering. We do not do this by default for the makefulltextfeed.php output because we assume that output will be handled by a feed reader which will apply its own filtering.
You can disable this in the extract.php output by passing &xss=0.
I’m not yet sure what’s triggering this kind of cleanup in htmLawed but my guess is that the document you provided uses an inline <span> element as its main content block which then contains block-level elements. Usually block-level elements contain inline elements but not not the other way around, so htmLawed is probably removing block-level elements which it sees inside inline elements.
Notice that the main article element here is an inline <span> element rather than a block-level <div> or <article> which is more common.
I’ve marked this as an issue to look into to see if what I’ve described above is really what’s causing the removals and if it is, if we can still apply XSS filtering without this type of cleanup. But for the time being using &xss=0 will make the output similar to our makefulltextfeed.php output.
I’ve looked into this some more and I can confirm it’s the element structure that’s causing the removal in the extract.php output. We’ll fix this in the next release, but if you’d like to do it yourself, you can open makefulltextfeed.php and look for the following line: