Hello, I recently bought a license but I am facing some issues with line breaks during extraction
You can reproduce the issue with this post:
As you can see the < p > line breaks are preserved using makefulltext.php
But using the same post with extract.php the line breaks are lost. Here is the output from extract.php https://pastebin.com/CZPydA22
I tested the solution from this topic but it didn’t work for me Different results in extract.php vs makefulltextfeed.php
How can I fix this issue?
Thanks for reporting this.
By default the extract.php endpoing enables xss filtering, which basically means we run the extracted content through htmLawed with a number of options enabled. I don’t know why that results in
<p> elements being removed.
If you use makefulltextfeed.php, we don’t do this additional step, but if you pass &xss=1 as a paramter, you’ll see the same result:
So it’s something we’ll have to look at.
If you want to disable that in extract.php you’ll have to explicitly set xss to 0:
I’ll update again once we’ve had a chance to figure out what’s going on.
Thanks for the reply,
xss=0 seem to have fixed the issue. I did try it previously and I don’t understand why it didn’t work… maybe just a server-side caching issue
I have another question unrelated to this topic. Is there a way to extract the author name from the press releases posted on prnewswire.com globenewswire.com and newswire.ca?
Example using the same press release https://www.globenewswire.com/news-release/2019/04/09/1801637/0/en/Green-Thumb-Industries-GTI-Announces-Full-Year-2018-Revenue-of-62-5-Million-278-Year-Over-Year-Growth.html
The name of the company can be found in the source code:
meta name=“author” content=“Green Thumb Industries”
property="og:article:author " content=“Green Thumb Industries”
It’s also in the RSS feed contributor field: https://www.globenewswire.com/RssFeed/orgclass/1/feedTitle/GlobeNewswire%20-%20News%20about%20Public%20Companies
The 2 other sites (newswire.ca and prnewswire.com) also have the meta author field with the correct author
Hi, if you have something like this in your HTML…
<meta name=“author” content=“Green Thumb Industries”>
<meta property="og:article:author" content=“Green Thumb Industries”>
You can create a site config file to target the content attribute as follows:
As for the xss parameter sometimes causing failure, I couldn’t reproduce that I’m afraid. The URL you supplied worked fine for me with and without the parameter enabled.