Date XPath matches, but feed does not incorporate the matched date

DASKAjA · August 19, 2020, 2:30pm

I’ve just written a definition for a local newspaper.

https://www.giessener-anzeiger.de/rss/lokales

and the definition is as followed:

title: //article/h1
body: //p[contains(@class, 'articleDetail__caption')] | //article/p[contains(@class, 'articleDetail__teaser')] | //article/div/section/div[contains(@class, 'articleDetail__text')]
date: //meta[@name = 'og:updated_time']/@content

test_url: https://www.giessener-anzeiger.de/lokales/stadt-giessen/nachrichten-giessen/wie-glucksspiel-schuler-aus-giessen-haben-bedenken-vor-ruckkehr-in-unterricht_22094811

from the debug output I can see, that it is able to match the date: (see: Date matched: 2020-08-14 10:30:00)

* Attempting to extract content
* ... site config for giessener-anzeiger.de.merged found in APCu
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex found in APCu
* Appending site config settings from global.txt
* Strings replaced: 0 (find_string and/or replace_string)
* Attempting to parse HTML with html5php
* ** Loading class Readability (readability/Readability.php)
* ** Loading class Masterminds\HTML5 (html5php/HTML5.php)
* Title matched: 
        Wie "Glücksspiel": Schüler aus Gießen haben Bedenken vor Rückkehr in Unterricht
    
* ...XPath match: //article/h1
* Language matched: de
* Extracting Open Graph elements
* Extracting Twiter Card elements
* Date matched: 2020-08-14 10:30:00
* ...XPath match: //meta[@name = 'og:updated_time']/@content
* Body matched
* ...XPath match: //p[contains(@class, 'articleDetail__caption')] | //article/p[contains(@class,     'articleDetail__teaser')] | //article/div/section/div[contains(@class, 'articleDetail__text')]
* 11 body elems found
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Done!

But the resulting feed pubDate similar to this:

<pubDate>Tue, 18 Aug 2020 22:00:00 +0000</pubDate>

It looks like it has copyed the date from the originating feed, where the date is corrupt. Do I have to specify the date format somewhere? If so, how?

fivefilters · August 20, 2020, 3:17pm

Thanks for the question.

At the moment Full-Text RSS will priorise the date found in the feed. If you process an article by itself from that domain, you should see that the date you specified will get used in the single-item feed that’s produced.

We do the same for the item title. If we’re processing a feed which has a title, we’ll keep that rather than use the extracted title. In this case, however, we let users override the behaviour with a ‘use_extracted_title’ parameter. When that’s present, the extracted titles replaces the original feed item titles.

We haven’t got such a parameter yet for the date, so perhaps it’s something we should add.

If you’re running our self-hosted copy of Full-Text RSS, you can change this behaviour if you open up makefulltextfeed.php and look for the following lines:

if ((int)$item->get_date('U') > 0) {
    $newitem->setDate((int)$item->get_date('U'));
} elseif ($extractor->getDate()) {
    $newitem->setDate($extractor->getDate());
}

If you re-order these, the extracted date will take precedence:

if ($extractor->getDate()) {
    $newitem->setDate($extractor->getDate());
} elseif ((int)$item->get_date('U') > 0) {
    $newitem->setDate((int)$item->get_date('U'));   
}

Hope that’s some help.

DASKAjA · August 21, 2020, 1:51pm

If you’re running our self-hosted copy of Full-Text RSS, you can change this behaviour if you open up makefulltextfeed.php and look for the following lines:

Thanks, that indeed has helped.