Image src is getting converted to data

drpudding · June 22, 2022, 4:52pm

I am scraping an article with an image derived via WordPress using the nitro plug-in, with these img attributes:

nitro-lazy-srcset nitro-lazy-src srcset src

The src attribute in the original article IS correct. Just a normal, accessible .jpg URL. However, when the article is scraped the src appears to be converted to data/base64, like so:
<img src="data:image/svg+xml;nitro-empty-id=MTAxNTo1MjM=-1;base64,PHN2ZyB2a...

Is there some way to prevent this and retain the original src?

Marc

fivefilters · June 22, 2022, 6:28pm

This looks like a lazy image loading technique. The more modern way of doing this is using the standard HTML loading="lazy" attribute. In that case the src attribute doesn’t change. But older techniques usually point to a spacer or small embedded image in the src attribute, and then use Javascript to swap this with the actual image URL when you scroll to that part of the page.

Do you have the URL to the page? In our experience this is often not the case, as explained above, but most people examine the HTML using their browser’s developer tools (e.g the inspector) and when they do that what they’re seeing is the resulting DOM after the browser has executed Javascript on the page, which often rewrites HTML elements and attributes.

If you right-click on the page and use ‘View page source’ you’ll likely see different attribute values for this element than you’d see using the browser’s developer tools. That’s what Full-Text RSS sees too.

If you’d like to use the developer tools, there’s an option to disable Javascript. In Firefox, after you open Web Developer Tools, you can press F1 and look for the ‘Disable JavaScript’ checkbox.

Checking it will reload the tab with Javascript disabled. Then inspecting the HTML element should give you the attribute value as it was served by the server.

We do handle some common lazy loading techniques in Full-Text RSS, but this doesn’t cover all cases. So sometimes a custom site config file is needed to fix things up.

If you have the page URL, we’ll be happy to look into it.

drpudding · June 22, 2022, 8:17pm

Thanks for the prompt, thorough response. Yes – the image is using JS to render, since turning off JS produces the same data img src I am seeing when I scrape. However, with JS allowed, the page source does show the correct src. Here is the article in question:

The image is about 1/2 way down (float right) – has text starting with “Courts today…”

I currently have a custom config file in place for this domain and could easily experiment with any config suggestions you have. It is currently just using:

title: //h1[@class='fl-post-title']
body: //div[contains(@class, 'fl-post-content')]

fivefilters · June 22, 2022, 8:50pm

Hopefully we’ll improve our automatic handling of lazy-loaded images to take care of this in future versions. But for the time being, with JS disabled, here’s what I see:

The attributes underlined in red show nitro-lazy-src, which contains the real image URL and the src attribute which contains the blank empty image (beginning with “data:image/svg…”)

In your site config for this site, you can add the following lines to first rename the src attribute to disabled-src and then rename nitro-lazy-src to src:

# change src="data:image..." to disabled-src="data:image..."
find_string: src="data:image/svg+xml;nitro-empty
replace_string: disabled-src="data:image/svg+xml;nitro-empty

# change nitro-lazy-src="https..." to src="https..."
find_string: nitro-lazy-src="http
replace_string: src="http

Please try that and let me know how you get on.

drpudding · June 22, 2022, 9:10pm

Works! Very nice. Thanks much for the assist on this. Another tool in my tool belt.

fivefilters · June 22, 2022, 10:22pm

No problem! Glad to hear it worked