Fingerprint does sometimes fail to match

I tried some URLs which based on medium.com but with unique domains. And the first time I do so it fails and even a reload doesn’t work at first (FTR’s cache is deactivated). Then I copied the .medium.com.txt to fingerprint.medium.com.txt and it works. Then removed this second config, reload the FTR-result and the content is still there. Awkward!

Meanwhile I could manage to reproduce it while having debug active. After several reloads I got a result with no match in. So it is either the upstream, not containing the fingerprint or FTR which fails to locate it sometimes.

FTR 3.9.13
with this config: .medium.com.txt

FTR Debug Log
* APC is disabled or not available on server
* Supplied URL: https://fanfare.pub/a-former-tax-accountant-with-severe-adhd-breaks-down-everything-everywhere-all-at-once-b9b99f1e3b6f?gi=031f3a4f2a8b
* ** Loading class HumbleHttpAgent (humble-http-agent/HumbleHttpAgent.php)
* ** Loading class ContentExtractor (content-extractor/ContentExtractor.php)
* ** Loading class SiteConfig (content-extractor/SiteConfig.php)
* --------
* Attempting to process URL as feed
* ** Loading class SimplePie_HumbleHttpAgent (humble-http-agent/SimplePie_HumbleHttpAgent.php)
* ** Loading class DisableSimplePieSanitize (DisableSimplePieSanitize.php)
* Fetching URL (https://fanfare.pub/a-former-tax-accountant-with-severe-adhd-breaks-down-everything-everywhere-all-at-once-b9b99f1e3b6f?gi=031f3a4f2a8b)
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://fanfare.pub/a-former-tax-accountant-with-severe-adhd-breaks-down-everything-everywhere-all-at-once-b9b99f1e3b6f?gi=031f3a4f2a8b
* ......adding to pool
* . looking for site config for fanfare.pub in custom folder
* . looking for site config for fanfare.pub in standard folder
* ... no site config match for fanfare.pub
* Cached site config with key fanfare.pub.merged
* Checking fingerprints...
* No fingerprint matches
* . looking for site config for global in custom folder
* . looking for site config for global in standard folder
* ... found site config in standard folder (global.txt)
* Cached site config with key global
* Cached site config with key global.merged.ex
* Appending site config settings from global.txt
* ......user-agent set to: PHP/7.4
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1
* Sending request...
* Received responses
* Redirect detected. Valid URL: https://medium.com/m/global-identity-3?redirectUrl=https%3A%2F%2Ffanfare.pub%2Fa-former-tax-accountant-with-severe-adhd-breaks-down-everything-everywhere-all-at-once-b9b99f1e3b6f
* ** Loading class CookieJar (humble-http-agent/CookieJar.php)
* Following redirects #1...
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://medium.com/m/global-identity-3?redirectUrl=https%3A%2F%2Ffanfare.pub%2Fa-former-tax-accountant-with-severe-adhd-breaks-down-everything-everywhere-all-at-once-b9b99f1e3b6f
* ......adding to pool
* . looking for site config for medium.com in custom folder
* . looking for site config for medium.com in standard folder
* ... no site config match for medium.com
* Cached site config with key medium.com.merged
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* ......user-agent set to: PHP/7.4
* ......referer set to: http://www.google.co.uk/url?sa=t&source=web&cd=1
* Sending request...
* Received responses
* ... site config for medium.com.merged already loaded in this request
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex already loaded in this request
* Appending site config settings from global.txt
* --------
* Constructing a single-item feed from URL
* ** Loading class FeedWriter (feedwriter/FeedWriter.php)
* --------
* Fetching feed items
* Starting parallel fetch (curl_multi_*)
* Processing set of 1
* ...https://fanfare.pub/a-former-tax-accountant-with-severe-adhd-breaks-down-everything-everywhere-all-at-once-b9b99f1e3b6f?gi=031f3a4f2a8b
* ......in memory
* --------
* Processing feed item 1
* Item URL: https://fanfare.pub/a-former-tax-accountant-with-severe-adhd-breaks-down-everything-everywhere-all-at-once-b9b99f1e3b6f?gi=031f3a4f2a8b
* ** Loading class FeedItem (feedwriter/FeedItem.php)
* URL already fetched - in memory (https://fanfare.pub/a-former-tax-accountant-with-severe-adhd-breaks-down-everything-everywhere-all-at-once-b9b99f1e3b6f?gi=031f3a4f2a8b, effective: https://medium.com/m/global-identity-3?redirectUrl=https%3A%2F%2Ffanfare.pub%2Fa-former-tax-accountant-with-severe-adhd-breaks-down-everything-everywhere-all-at-once-b9b99f1e3b6f)
* Done!

1 Like

mh, maybe there is something special with this site. Just realized that FTR only gets a small part of the article, while wallabag get a lot more, with the same config.

Of course there is a paywall (membership-only) and when viewing the source html, the text is incomplete there as well. Don’t know how wallabag mangage to get the full content here.

I don’t know if this is somehow related to the fingerprint issue. But I wanted to share my discovery

1 Like

Thanks @HolgerAusB, I’ll take a look at this when I have a bit more time. Might be an FTR issue, or maybe the server is responding differently between requests. Will try to investigate as it would be good to fix Medium parsing.

1 Like

Quick update to say that I can’t see anything unusual with the handling. The full content isn’t in the source, as you say. And Full-Text RSS retrieves what’s available (I tested with your new proposed config). Maybe the mystery is how Wallabag is retrieving more :laughing:

1 Like

That part with full content / half content was clear, thats on Wallabag. And as it is kind of a paywall/payveil, that is o.k. My problem was, that my FTR doesn’t find the fingerprint at a rate of about 40-50% of tries. There was, however, also a very small chance that the two problems were related. So I named it.

1 Like