html5 pictures

follow-up to #6711 I need to say, that the problem is on FTR. While shows the image without the replacing of <source srcset= FTR does not.

Just found another site, with this structure:

<picture class="embedded-image__ratio">
  <source data-srcset=";strip=all&amp;w=564&amp;type=webp&amp;sig=mwLTUf2FDWfZhs6j1dy69w,;strip=all&amp;w=1128&amp;type=webp&amp;sig=WW5Xa-g4DirLMwPxeO_Drg 2x" media="(min-width: 1200px)" srcset=";strip=all&amp;w=564&amp;type=webp&amp;sig=mwLTUf2FDWfZhs6j1dy69w,;strip=all&amp;w=1128&amp;type=webp&amp;sig=WW5Xa-g4DirLMwPxeO_Drg 2x" type="image/webp">
  <source data-srcset=";strip=all&amp;w=564&amp;type=jpg&amp;sig=zfg9iVtqzTvm773ZVytm2Q,;strip=all&amp;w=1128&amp;type=jpg&amp;sig=pesUxtidbofpQGwZmBvwrw 2x" media="(min-width: 1200px)" srcset=";strip=all&amp;w=564&amp;type=jpg&amp;sig=zfg9iVtqzTvm773ZVytm2Q,;strip=all&amp;w=1128&amp;type=jpg&amp;sig=pesUxtidbofpQGwZmBvwrw 2x" type="image/jpeg">
  <source data-srcset=";strip=all&amp;w=472&amp;type=webp&amp;sig=mnYYe4A_Xv2m9NLn7Xj7gQ,;strip=all&amp;w=944&amp;type=webp&amp;sig=akNCdECoI7FlREqaGPDhbg 2x" media="(min-width: 768px)" srcset=";strip=all&amp;w=472&amp;type=webp&amp;sig=mnYYe4A_Xv2m9NLn7Xj7gQ,;strip=all&amp;w=944&amp;type=webp&amp;sig=akNCdECoI7FlREqaGPDhbg 2x" type="image/webp">
  <source data-srcset=";strip=all&amp;w=472&amp;type=jpg&amp;sig=otI2C57J2redDzB9mPM43g,;strip=all&amp;w=944&amp;type=jpg&amp;sig=i6pBMdNE_oWDQfVVSLHbFQ 2x" media="(min-width: 768px)" srcset=";strip=all&amp;w=472&amp;type=jpg&amp;sig=otI2C57J2redDzB9mPM43g,;strip=all&amp;w=944&amp;type=jpg&amp;sig=i6pBMdNE_oWDQfVVSLHbFQ 2x" type="image/jpeg">
  <source data-srcset=";strip=all&amp;w=288&amp;type=webp&amp;sig=2Zqzok6Od8jY6pkYM0jN-Q,;strip=all&amp;w=576&amp;type=webp&amp;sig=kd8qJNrqEI1iJgP3XMxRGg 2x" media="(max-width: 767px)" srcset=";strip=all&amp;w=288&amp;type=webp&amp;sig=2Zqzok6Od8jY6pkYM0jN-Q,;strip=all&amp;w=576&amp;type=webp&amp;sig=kd8qJNrqEI1iJgP3XMxRGg 2x" type="image/webp">
  <img alt="Lisa Martino-Taylor reveals the details behind secret Cold War era experiments in Behind The Fog" class="embedded-image__image lazyloaded" data-src=";strip=all&amp;w=288&amp;sig=Q50DFA0gW6Sepl0O-iVSJA" data-srcset=";strip=all&amp;w=288&amp;sig=Q50DFA0gW6Sepl0O-iVSJA,;strip=all&amp;w=576&amp;sig=kPDHZBGOR1fGtdkhsheKtA 2x" height="402" loading="lazy" src=";strip=all&amp;w=288&amp;sig=Q50DFA0gW6Sepl0O-iVSJA" width="265" srcset=";strip=all&amp;w=288&amp;sig=Q50DFA0gW6Sepl0O-iVSJA,;strip=all&amp;w=576&amp;sig=kPDHZBGOR1fGtdkhsheKtA 2x">

Following works with including the image. On FTR the image is missing. So I need to do additional string_replacements for the picture here.

strip_id_or_class: ad__section-border
strip_id_or_class: article-meta
strip_id_or_class: visually-hidden
strip_id_or_class: article-comment

strip: //header[contains(@class, 'identity-intro')]/parent::*
strip: //section[@class='more-topic']

strip: //nav

replace_string(<section): <div
replace_string(</section>): </div>


(unfinished yet)

@fivefilters, are you working on this, too?

Thanks for looking into this @HolgerAusB. I had a quick look at the article you linked and I think the reason the one image doesn’t load is related to lazy loading and Javascript.

I’ve highlighted the srcset attributes that are causing the problem (empty image):

You’ll see that the same image doesn’t load when JS is disabled on the source site. In Full-Text RSS we should probably try and add this as another instance of lazy loading to handle. Or perhaps strip all <source> elements when a suitable <img> exists (although in this case the URL in the img element is very low quality).

You could probably fix this for FTR with:

strip_attr: //source[contains(@srcset, 'data:')]/@srcset

That should remove just the attribute on those <source> elements.

Although I didn’t know strip_attr yet, I probably would have found a way to display the images again.

But that’s not my point at all. The problem is that Wallabag and FTR behave differently.

While Wallabag, with the little config above, selects the best image for the particular platform, FTR doesn’t find any image. If I now convert the first image link via strip or string_replace, Wallabag is then also bound to this link. Which leads to problems there again.

That makes it very hard to find a config, which is good for both sides.

Wasn’t there a plan to merge the projects on this part?

When it comes to how we interpret the site config files, I do want there to be consistency. There’s been a plan (at least on my part) to document the site config files better and expand them to allow a few more rules (e.g. wrapping and unwrapping elements).

But the site config files are also intended for other applications too, so I don’t want to assume that the only projects that rely on them are Full-Text RSS and Wallabag. (Many of the rules, and the format itself, originally come from Instapaper.)

When they’re integrated into other applications, there will likely be differences in how the HTML is interpreted before or after the site config files are applied. I don’t really want to enforce too much rigidity here. So it’s to be expected that there will be differences in output. And our priority with site config files has always been first to preserve the actual text content as correctly as possible, with images and other media being a secondary consideration.

Regarding this particular issue, I don’t think it’s necessary that you update the site config file to fix the image for Full-Text RSS too. I think generally those kind of elements should be handled better by the underlying Full-Text RSS code, and not dealt with in site config files. This probably will be taken care of in the next Full-Text RSS update (which I’ll be sure to send you) when we switch to the new Readability.php code. See for example the _fixLazyImages() method and its handling of srcset:

convert images and figures that have properties like data-src into images that can be loaded without JS

1 Like