By default the extract.php endpoing enables xss filtering, which basically means we run the extracted content through htmLawed with a number of options enabled. I don’t know why that results in <p> elements being removed.
If you use makefulltextfeed.php, we don’t do this additional step, but if you pass &xss=1 as a paramter, you’ll see the same result:
xss=0 seem to have fixed the issue. I did try it previously and I don’t understand why it didn’t work… maybe just a server-side caching issue
I have another question unrelated to this topic. Is there a way to extract the author name from the press releases posted on prnewswire.comglobenewswire.com and newswire.ca?
The name of the company can be found in the source code:
meta name=“author” content=“Green Thumb Industries”
property="og:article:author " content=“Green Thumb Industries”
As for the xss parameter sometimes causing failure, I couldn’t reproduce that I’m afraid. The URL you supplied worked fine for me with and without the parameter enabled.