<p> tag is lost in extract.php but not makefulltext.php


#1

Hello, I recently bought a license but I am facing some issues with line breaks during extraction

You can reproduce the issue with this post:
http://ftr.fivefilters.org/makefulltextfeed.php?url=https%3A%2F%2Fwww.globenewswire.com%2Fnews-release%2F2019%2F04%2F09%2F1801637%2F0%2Fen%2FGreen-Thumb-Industries-GTI-Announces-Full-Year-2018-Revenue-of-62-5-Million-278-Year-Over-Year-Growth.html&max=3

As you can see the < p > line breaks are preserved using makefulltext.php

But using the same post with extract.php the line breaks are lost. Here is the output from extract.php https://pastebin.com/CZPydA22

I tested the solution from this topic but it didn’t work for me Different results in extract.php vs makefulltextfeed.php

How can I fix this issue?

Thanks


#2

Hi there,

Thanks for reporting this.

By default the extract.php endpoing enables xss filtering, which basically means we run the extracted content through htmLawed with a number of options enabled. I don’t know why that results in <p> elements being removed.

If you use makefulltextfeed.php, we don’t do this additional step, but if you pass &xss=1 as a paramter, you’ll see the same result:

http://ftr.fivefilters.org/makefulltextfeed.php?xss=1&url=https%3A%2F%2Fwww.globenewswire.com%2Fnews-release%2F2019%2F04%2F09%2F1801637%2F0%2Fen%2FGreen-Thumb-Industries-GTI-Announces-Full-Year-2018-Revenue-of-62-5-Million-278-Year-Over-Year-Growth.html&max=3

So it’s something we’ll have to look at.

If you want to disable that in extract.php you’ll have to explicitly set xss to 0:

...extract.php?xss=0&url=...

I’ll update again once we’ve had a chance to figure out what’s going on.


#3

Thanks for the reply,

xss=0 seem to have fixed the issue. I did try it previously and I don’t understand why it didn’t work… maybe just a server-side caching issue

I have another question unrelated to this topic. Is there a way to extract the author name from the press releases posted on prnewswire.com globenewswire.com and newswire.ca?

Example using the same press release https://www.globenewswire.com/news-release/2019/04/09/1801637/0/en/Green-Thumb-Industries-GTI-Announces-Full-Year-2018-Revenue-of-62-5-Million-278-Year-Over-Year-Growth.html

The name of the company can be found in the source code:
meta name=“author” content=“Green Thumb Industries”
property="og:article:author " content=“Green Thumb Industries”

It’s also in the RSS feed contributor field: https://www.globenewswire.com/RssFeed/orgclass/1/feedTitle/GlobeNewswire%20-%20News%20about%20Public%20Companies

The 2 other sites (newswire.ca and prnewswire.com) also have the meta author field with the correct author


#4

xss=0 sometimes causes content extraction to fail

Example with this post: https://www.newswire.ca/news-releases/khiron-appointments-larry-holifield-former-u-s-dea-regional-director-mexico-and-central-america-as-khiron-security-and-compliance-director-mexico-joins-chief-compliance-officer-matt-murphy-to-expand-company-s-leadership-in-compliance-and-se-845959066.html

I get the content without xss=0 but if I use xss=0 the content is empty

Oh and I ended up writing my own PHP function to get the author meta tags so I found a workaround.


#5

Hi, if you have something like this in your HTML…

<meta name=“author” content=“Green Thumb Industries”>
<meta property="og:article:author" content=“Green Thumb Industries”>

You can create a site config file to target the content attribute as follows:

author: //meta[@name="author"]/@content

or

author: //meta[@property="og:article:author"]/@content

As for the xss parameter sometimes causing failure, I couldn’t reproduce that I’m afraid. The URL you supplied worked fine for me with and without the parameter enabled.