I’ve just written a definition for a local newspaper.
https://www.giessener-anzeiger.de/rss/lokales
and the definition is as followed:
title: //article/h1
body: //p[contains(@class, 'articleDetail__caption')] | //article/p[contains(@class, 'articleDetail__teaser')] | //article/div/section/div[contains(@class, 'articleDetail__text')]
date: //meta[@name = 'og:updated_time']/@content
test_url: https://www.giessener-anzeiger.de/lokales/stadt-giessen/nachrichten-giessen/wie-glucksspiel-schuler-aus-giessen-haben-bedenken-vor-ruckkehr-in-unterricht_22094811
from the debug output I can see, that it is able to match the date: (see: Date matched: 2020-08-14 10:30:00
)
* Attempting to extract content
* ... site config for giessener-anzeiger.de.merged found in APCu
* Checking fingerprints...
* No fingerprint matches
* ... site config for global.merged.ex found in APCu
* Appending site config settings from global.txt
* Strings replaced: 0 (find_string and/or replace_string)
* Attempting to parse HTML with html5php
* ** Loading class Readability (readability/Readability.php)
* ** Loading class Masterminds\HTML5 (html5php/HTML5.php)
* Title matched:
Wie "Glücksspiel": Schüler aus Gießen haben Bedenken vor Rückkehr in Unterricht
* ...XPath match: //article/h1
* Language matched: de
* Extracting Open Graph elements
* Extracting Twiter Card elements
* Date matched: 2020-08-14 10:30:00
* ...XPath match: //meta[@name = 'og:updated_time']/@content
* Body matched
* ...XPath match: //p[contains(@class, 'articleDetail__caption')] | //article/p[contains(@class, 'articleDetail__teaser')] | //article/div/section/div[contains(@class, 'articleDetail__text')]
* 11 body elems found
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Pruning content
* ...element added to body
* Done!
But the resulting feed pubDate
similar to this:
<pubDate>Tue, 18 Aug 2020 22:00:00 +0000</pubDate>
It looks like it has copyed the date from the originating feed, where the date is corrupt. Do I have to specify the date format somewhere? If so, how?