As you know both are sister sites and use identical HTML codes. This is the site pattern defined for Gizmodo by FTR.
title: //head/title
author: //meta[@name="author"]/@content
body: //div[contains(@class, 'post-content')]
strip: //div[contains(@class, 'content-summary')]
strip: //aside
test_url: https://gizmodo.com/the-new-bad-tick-is-going-to-take-over-half-the-united-1831079855
But it fails to extract full content from the page. It fails in the test URL mentioned as well. Could you please update the site pattern for both Gizmodo and Lifehacker?
Thanks for letting us know. We’ll take a look. It might be server-retrieval issues. We’ll update here when we know more.
Can you please try updating your site config files and see if you have any luck. We’ve updated both gizmodo.com.txt and lifehacker.com.txt.
title: //head/title
author: //meta[@name="author"]/@content
body: //div[contains(@class, 'post-content')]
strip: //div[contains(@class, 'content-summary')]
strip_id_or_class: magnifier
strip: //svg
strip_id_or_class: js_commerce-inset-permalink
strip_id_or_class: ad-commerce
parser: libxml
prune: no
strip: //aside
test_url: https://gizmodo.com/the-new-bad-tick-is-going-to-take-over-half-the-united-1831079855
# Changes might need to be made to .lifehacker.com.txt, gizmodo.com.txt too
title: //head/title
author: //meta[@name="author"]/@content
body: //div[contains(@class, 'post-content')]
strip: //div[contains(@class, 'content-summary')]
strip_id_or_class: magnifier
strip: //svg
strip: //aside
strip_id_or_class: js_commerce-inset-permalink
strip_id_or_class: ad-commerce
parser: libxml
prune: no
test_url: https://lifehacker.com/find-the-cheapest-destinations-for-that-last-minute-tri-1831047448
Thank you. I deleted the custom patterns I created and updated standard patterns from the server. Both working fine now.
BTW aren’t standard patterns included for .gizmodo.com.txt and .lifehacker.com.txt (starting with dot for subdomains)?
Good to hear!
We used to have .gizmodo.com.txt
and .lifehacker.com.txt
to match subdomains like io9.gizmodo.com . These now all appear to redirect to the main gizmodo.com / lifehacker.com sites, so we removed those site config files as they don’t appear to be needed any more.
1 Like