Both Gizmodo and Lifehacker extractions are failing

naveenjn · August 6, 2021, 2:52am

As you know both are sister sites and use identical HTML codes. This is the site pattern defined for Gizmodo by FTR.

title: //head/title
author: //meta[@name="author"]/@content
body: //div[contains(@class, 'post-content')]
strip: //div[contains(@class, 'content-summary')]

strip: //aside

test_url: https://gizmodo.com/the-new-bad-tick-is-going-to-take-over-half-the-united-1831079855

But it fails to extract full content from the page. It fails in the test URL mentioned as well. Could you please update the site pattern for both Gizmodo and Lifehacker?

fivefilters · August 7, 2021, 11:55pm

Thanks for letting us know. We’ll take a look. It might be server-retrieval issues. We’ll update here when we know more.

fivefilters · August 9, 2021, 3:31pm

Can you please try updating your site config files and see if you have any luck. We’ve updated both gizmodo.com.txt and lifehacker.com.txt.

github.com

fivefilters/ftr-site-config/blob/master/gizmodo.com.txt

title: //head/title
author: //meta[@name="author"]/@content
body: //div[contains(@class, 'post-content')]
strip: //div[contains(@class, 'content-summary')]

strip_id_or_class: magnifier
strip: //svg
strip_id_or_class: js_commerce-inset-permalink
strip_id_or_class: ad-commerce

parser: libxml

prune: no

strip: //aside

test_url: https://gizmodo.com/the-new-bad-tick-is-going-to-take-over-half-the-united-1831079855

github.com

fivefilters/ftr-site-config/blob/master/lifehacker.com.txt

# Changes might need to be made to .lifehacker.com.txt, gizmodo.com.txt too

title: //head/title
author: //meta[@name="author"]/@content
body: //div[contains(@class, 'post-content')]
strip: //div[contains(@class, 'content-summary')]

strip_id_or_class: magnifier
strip: //svg
strip: //aside
strip_id_or_class: js_commerce-inset-permalink
strip_id_or_class: ad-commerce

parser: libxml

prune: no

test_url: https://lifehacker.com/find-the-cheapest-destinations-for-that-last-minute-tri-1831047448

naveenjn · August 10, 2021, 2:50am

Thank you. I deleted the custom patterns I created and updated standard patterns from the server. Both working fine now.

BTW aren’t standard patterns included for .gizmodo.com.txt and .lifehacker.com.txt (starting with dot for subdomains)?

fivefilters · August 10, 2021, 9:58am

Good to hear!

We used to have .gizmodo.com.txt and .lifehacker.com.txt to match subdomains like io9.gizmodo.com. These now all appear to redirect to the main gizmodo.com / lifehacker.com sites, so we removed those site config files as they don’t appear to be needed any more.