Only half article were successfully extracted

Hii Team,

There is any words/characters limitation for Full Text Extraction ?
I test this link, only half article is extracted.

https://ftr.fivefilters.net/makefulltextfeed.php?url=https%3A%2F%2Fwww.trendmicro.com%2Fen_us%2Fresearch%2F24%2Fc%2Funveiling-earth-kapre-aka-redcurls-cyberespionage-tactics-with-t.html&max=3

Thank you @harboot for reporting this. It seems they (Trendmicro) changed their layout and they are not longer using the subdomain blog.trendmirco.com

I fixed that with a new config.
You may wait an hour until ftr.fivefilters got the update

Hii… Need help again. same issue.

https://ftr.fivefilters.net/makefulltextfeed.php?step=3&fulltext=1&url=https%3A%2F%2Fwww.cleafy.com%2Fcleafy-labs%2Fon-device-fraud-on-the-rise-exposing-a-recent-copybara-fraud-campaign&max=3&links=preserve&exc=&submit=Create+Feed

Iam still trying to understand the script. I dont know how to remove some part, example on the last post i want to remove “meet the authors”, but its not working

body: //div[contains(concat(' ',normalize-space(@class),' '),' full ') and (contains(concat(' ',normalize-space(@class),' '),' blog '))]
strip: //div[contains(concat(' ',normalize-space(@class),' '),' author-block ')]

test_url: On-Device Fraud on the rise: exposing a recent Copybara fraud campaign | Cleafy Labs

@harboot: you may try to keep the config simple in the first place. If that don’t work you can try to get more specific.

I just uploaded a new config:

body: //div[contains(@class, 'full blog')]

strip_id_or_class: author-block

test_url: https://www.cleafy.com/cleafy-labs/on-device-fraud-on-the-rise-exposing-a-recent-copybara-fraud-campaign

Very often a
prune: no
is doing most of the trick.

1 Like

But your example for strip is working too.

If you post code or config examples here in the forum, you should use the ‘preformatted text’ button from the toolbar: </>

1 Like