[solved] Feed is not longer created with v2.3

Since version 2.3 of Feed Creator (self-hosted AND fivefilters.org), the following doesn’t work any longer. It worked before v2.3

index.php?url=https%3A%2F%2Fwww.nordmainische-s-bahn.de%2Fhome.html&item=div.pgSlide&item_title=h3&item_date=time&feed_title=Nordmainische+S-Bahn&max=5&order=document&guid=0

and I don’t know why. Alternative item selectors also not working:
.layout_latest
.pgSlide

same problem with
index.php?url=https%3A%2F%2Fwww.knoten-stadion.de%2Fbaustellen-journal.html&item=.layout_latest&item_title=h3&item_desc=.teaser&item_date=time&feed_title=Knoten+Stadion+Journal&max=3&order=document&guid=0

Other German railway construction sites with nearly the same html code are working FINE:

index.php?url=https%3A%2F%2Fwww.fernbahntunnel-frankfurt.de%2Fmeldungen.html&item=.layout_latest&item_title=h3&item_desc=.teaser&item_date=time&feed_title=Fernbahntunnel+Frankfurt&max=3&order=document&guid=0

index.php?url=https%3A%2F%2Fwww.frmplus.de%2Faktuelles.html&item=.layout_latest&item_title=h2&item_desc=.teaser&item_date=time&feed_title=FRM+Plus&max=3&order=document&guid=0

index.php?url=https%3A%2F%2Fwww.mein-hbf-ffm.de%2Fhome.html&item=.pgSlide&item_title=h3&item_desc=.teaser&item_date=time&feed_title=Masterplan+Ffm+Hbf&max=3&order=document&guid=0

index.php?url=https%3A%2F%2Fwww.riedbahn.de%2Fhome.html&item=.layout_latest&item_title=h3&item_desc=.teaser&item_date=time&feed_title=Riedbahn&max=3

index.php?url=https%3A%2F%2Fhanau-wuerzburg-fulda.de%2Fhome.html&item=.pgSlide&item_title=h3&item_desc=.teaser&item_date=time&feed_title=Hanau+-+W%C3%BCrzburg+-+Fulda&max=3&order=document&guid=0

index.php?url=https%3A%2F%2Fwww.s6-frankfurt-friedberg.de%2Fhome.html&item=.pgSlide&item_title=h3&item_desc=.teaser&item_date=time&feed_title=S6+Frankfurt+-+Friedberg&max=3&order=document&guid=0

The last one, is the main news page of knoten-stadion, which works. Above there is the faulty sub page baustellen-journal.html

index.php?url=https%3A%2F%2Fwww.knoten-stadion.de%2Fmeldungen.html&item=.layout_latest&item_title=h3&item_desc=.teaser&item_date=time&feed_title=Knoten+Stadion&max=3&order=document&guid=0&strip_if_url%5B%5D=baustellen-journal.html

Thanks for the report, Holger. Looking into it…

Maybe I have found a hint. It seems there is a faulty implementation of an image on the source page within the div.layout_latest selector. Maybe this is the reason:

<div
  class="backpicDiv"
  style="background: url(files/page/02_Aktuelles/01_AI_Meldung/2024/20240611_Infomobil_Maintal/20240611_NMS_Infomobil_Maintal_1.jpg) 50% 50% no-repeat; background-size: cover; height: 250px")">
&nbsp;
</div>

The style element ends with an IMHO illegal ")"> instead of just ">
So the parser can’t detect the end of the div initialization.

I wrote a bug report to source site and they reacted fast and fixed it. But FC still is not creating the feed. So that wasn’t the problem.

Now I found out, that curl (Debian Trixie) is throwing an error and the html is incomplete for both articles:

curl -o test.txt https://www.nordmainische-s-bahn.de/home.html -v

result: (click to show)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Host www.nordmainische-s-bahn.de:443 was resolved.
* IPv6: (none)
* IPv4: 84.38.79.51
*   Trying 84.38.79.51:443...
* Connected to www.nordmainische-s-bahn.de (84.38.79.51) port 443
* ALPN: curl offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [19 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [2867 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [520 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / x25519 / RSASSA-PSS
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=www.nordmainische-s-bahn.de
*  start date: May 16 00:12:25 2024 GMT
*  expire date: Aug 14 00:12:24 2024 GMT
*  subjectAltName: host "www.nordmainische-s-bahn.de" matched cert's "www.nordmainische-s-bahn.de"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
*   Certificate level 0: Public key type RSA (4096/152 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 2: Public key type RSA (4096/152 Bits/secBits), signed using sha256WithRSAEncryption
} [5 bytes data]
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://www.nordmainische-s-bahn.de/home.html
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: www.nordmainische-s-bahn.de]
* [HTTP/2] [1] [:path: /home.html]
* [HTTP/2] [1] [user-agent: curl/8.8.0]
* [HTTP/2] [1] [accept: */*]
} [5 bytes data]
> GET /home.html HTTP/2
> Host: www.nordmainische-s-bahn.de
> User-Agent: curl/8.8.0
> Accept: */*
>
* Request completely sent off
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [297 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [297 bytes data]
* old SSL session ID is stale, removing
{ [5 bytes data]
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0< HTTP/2 200
< cache-control: must-revalidate, no-cache, no-store, private
< date: Thu, 27 Jun 2024 16:48:44 GMT
< x-content-type-options: nosniff
< referrer-policy: no-referrer-when-downgrade, strict-origin-when-cross-origin
< permissions-policy: interest-cohort=()
< x-frame-options: SAMEORIGIN
< x-xss-protection: 1; mode=block
< contao-cache: miss
< age: 0
< set-cookie: csrf_https-contao_csrf_token=6C8ZBOrpuJ1pA7s9e4p90X-rDflWp8Ckae_0tGRhV7M; path=/; secure; httponly; samesite=lax
< content-length: 34660
< vary: Accept-Encoding
< content-type: text/html; charset=UTF-8
< server: Apache
<
{ [15997 bytes data]
* HTTP/2 stream 1 was not closed cleanly: PROTOCOL_ERROR (err 1)
 94 34660   94 32750    0     0  24835      0  0:00:01  0:00:01 --:--:-- 24848
* Connection #0 to host www.nordmainische-s-bahn.de left intact
curl: (92) HTTP/2 stream 1 was not closed cleanly: PROTOCOL_ERROR (err 1)

the html is ending abruptly in the middle of an unclosed <element> on the same point of the code with every try and I cant find a reason in the html itself. I even tried to curl this on my paid vserver (Ubuntu noble) and the code stops on the very same point. So no timing problem.

As for this, the problem might be on the missing </body> and </html> elements, or?

A curl (8.7.1) on the windows 11 (23H2) console delivers a complete html result with no errors. I’m going crazy!

OK, error 92 is a protocol error and I could solve the error by adding --http1.1 to the curl command.

Which does not help me on FC, yet. Because I don’t know if and how to fix that in the php code.

Is simplePie able to handle http/1.1 requests? If yes, could we get an option for this, please?

EDIT:
…or kind of an automatic fallback: If there is a protocol error with http/2 or autodetect, then try http/1.1, if this fails too, try http/1

Still not had a chance to investigate, but will do, hopefully tomorrow.

1 Like

Seems to be fixed with hosted FeedCreator 2.4 beta

@fivefilters
If this was an easy patch, can you tell me how to fix that in 2.3? If it’s not that easy, i will wait for the next self-hoster release.

Hi Holger, this is an experimental thing we’re testing. It’s not actually a result of any code change in the Feed Creator application. I’ll message you with more info. If we go ahead with it, it will be in the next release.

Actually, @HolgerAusB, I don’t think this is related to the experimental change I mentioned above. Can you please try replacing your Dockerfile with the following:

New Dockerfile

FROM php:8.3-apache AS base

# Install dependencies
RUN apt-get update && \
	apt-get install -y \
    unzip \
    curl \
    brotli \
    libicu-dev \
    libzip-dev \
    zlib1g-dev \
    libssl-dev \
    libcurl4-openssl-dev \
    libonig-dev \
    libnss3 nss-plugin-pem ca-certificates \
	&& apt-get clean \
	&& rm -rf /var/lib/apt/lists/*

RUN docker-php-ext-install intl zip && \
    pecl install apcu && \
    docker-php-ext-enable apcu && \
    pecl install raphf && \
    docker-php-ext-enable raphf && \
    pecl install pecl_http && \
    docker-php-ext-enable http && \
    pecl install brotli && \
    docker-php-ext-enable brotli

COPY php.ini /usr/local/etc/php/php.ini

And then run:

docker-compose build --no-cache
docker-compose up

After the service is running again, please try your URL again and let me know if it works.

1 Like

Thank you, that need to wait at least for Sunday. I won’t find the time before. I’ll report then.

1 Like

@fivefilters: The build took about 10 minutes on a Raspi 5 with 8GB. As far as I understand docker (and I don’t fully understand it), that dockerfile doesn’t change anything at FC’s code itself, only some dependencies or base software. Correct?

Because I had done some changes to the PHP before to add own site icon and favicon, which, BTW, should be configurable without code changes. Not essentially but nice to have.

My problem from above is solved now with that new docker file. The feed is generated now. Thank you for your support.

1 Like

Hi Holger, thanks for the feedback, and glad to hear the site is now working with the changes.

We’ll try to create a base image in the future to make it faster. Most of the time goes into installing and setting up the dependencies. But Docker usually caches these, so if you were to run the commands I gave you again but without the --no-cache flag, it should not take so long next time round.

That’s right. FC’s code prefers PHP’s HTTP extension, which the new Dockerfile now installs. So you’re essentially updating the environment FC runs in, and now providing its preferred HTTP client, which it will use when available. When FC can’t detect the HTTP extension, it will try to use other methods. These usually work okay, but there are subtle differences in the way a HTTP connection is established between the methods, and some sites may refuse a request based on this. I haven’t had time to figure out exactly what the cause of the difference is for the particular site you tried, but I noticed that it worked fine when the request was sent via FC using the HTTP extension.

But even with the HTTP extension, you might come across sites that reject a request if the TLS handshake (lower level than the HTTP protocol) doesn’t look like one expected from a system running Chrome or Firefox. This is actually becoming more and more common as services like Cloudflare become the first point of contact between a website and its visitors. These services try to differentiate between requests from a real browser and those from programming languages like PHP, Python, etc.

In the past this happened by examining the User-Agent HTTP header. Today you will find that for some URLs, you can send two identical, valid HTTP requests (same URL, same headers, sent from same IP), and one will result in a normal response, the other will be refused (or result in a captcha of some form), solely based on the TLS handshake or some other indicator at a lower level of the communication. cURL for example manages the TLS handshake itself, so some services will detect that and refuse those requests because they’re associated with non-browser requests.

This has become a bit of a cat and mouse game. You’ll find many ways of getting requests accepted by making them look like they’re coming from a real browser. E.g.

You should be fine continuing to use any of the changes you made. As you said, no application code needs to be changed to fix the issue.

1 Like