I am trying to better understand how Readability (?) is handling content extraction. Here is a test link:
If you look at the source of the url param, you will notice there is a #2 div that is missing from the parsedhtml, though it is in the rawhtml. It’s identical to div #1, except it is has a smaller amount of text. It’s identical to #3, except 3 is using a p tag, not a div.
Can someone shed some light on what is going on here, and if this can be rendered without loosing that div? I suspect I might be able to make a custom config file, but I’d prefer something that does not require this.