JSON encoding issue

desk-user · December 1, 2014, 6:02am

It looks like I am being returned unicode encoding in my json. For example, \u2019 for apostrophes (\u2019 is converted into â€™). When I run debug with rawhtml it looks like the headers are specified as UTF-8. When I run as parsedhtml, I’m getting “Disallowed Key Characters.”

Any idea what’s going on? Thanks!

fivefilters · December 1, 2014, 10:47pm

Hi Marc, can you give us the URL to your hosted copy, and the URL of the article/feed you’re trying to process? We can have a look once we have that. Thanks.

desk-user · December 2, 2014, 8:43am

I sent you the requested information but have not heard back. did you receive it? Thanks!

Marc

fivefilters · December 2, 2014, 8:55am

Hi Marc, yes, we received it.

The UTF-8 encoded characters in the JSON look fine to me. If you decode the JSON, they should be handled correctly. If you’re getting strange characters, my guess is that the decoding has worked, but you’re not outputting your own results with the appropriate HTTP/HTML headers to indicate that the content is UTF-8 encoded.

As for using &debug=parsedhtml - I can’t reproduce the ‘Disallowed Key Characters’ message you’re seeing. I tested in Chrome. Are you using a different browser? Is the output from Full-Text RSS itself or some other application that’s trying to read its content? If you can provide a screenshot we can try to help. If there does not appear to be a problem with the Full-Text RSS output, it could be the way it’s being handled by the application reading it. In such cases we can’t really help, but the developers of the feed reading application might be able to.

Best, Keyvan from FiveFilters.org

desk-user · December 2, 2014, 9:07pm

It looks like when I get the json $result back, whether from a url submit or my service cURL exec, the response header is html/text. It looks fine. But when I then json_decode($result), the Response header is still html/text but the unicode characters are wrong. If I force the response header character set as UTF-8, everything looks correct. I’ve checked my pages & database connection, which are utf-8, so I’m not 100% clear on why this is happening, but I’m fine to explicitly force utf-8.

I have one other question, if you can offer any advice:
I am writing a routine that loops thru several feeds
more than one time a day, checking for newly published articles, and, if it finds
an article I have not yet captured (based on pubDate timestamp), it writes it to
the database. The process of doing full rss extraction seems process intensive,
and most of the articles will not be new each time I check.
So I’ve thought about splitting this up by first getting back articles without
full extraction(content=0), just so I can check if they are in fact new,
then only doing the full extraction on that subset.

Do you know if this would actually be a wise move, based on the fact that in
most cases, all of the articles I pull back will already have been captured?
I’m just looking to limit processing as best possible.

Thanks!

Marc

desk-user · December 3, 2014, 2:12pm

Thought about this a little more and think my idea is really not saving anything. I should just run the extraction and simple break out of the loop when I hit the first article that is no longer newer than the newset one captured in the previous run.

marc

fivefilters · December 3, 2014, 2:25pm

Hi Marc, when you say say the response header is text/html, that in itself is not the issue. The UTF8 charset is what’s important. You should always be including this in the output that you produce (your own HTTP response), otherwise your user’s browser will have to guess the character encoding, which is not something you should leave up to the browser to determine, as it’s a very tricky issue, and browser’s don’t always guess correctly. So when you say ‘If I force the response header character set as UTF-8, everything looks correct.’ to me, that’s just good practice, nothing forced about it. Leaving it out is where you’re likely to experience trouble.

And just so you know, regardless of the character encoding of the source article given to Full-Text RSS, once Full-Text RSS has dealt with it, it will have been converted to UTF-8. So as far as your own code is concerned, if you’re outputting content produced by Full-Text RSS, you should always set the charset to UTF-8, as that’s what Full-Text RSS will always return.

About your second question. Setting content=0 does not prevent content extraction. It just omits it from the output. So you won’t save much processing time. My suggestion here is to do one of two things:

If the site does not get updated a lot, limit the number of items Full-Text RSS processes, so each time it gets the feed, it’s not extracting 5 or more items.
Or
Monitor the original feed yourself, without the use of Full-Text RSS, and when you notice a new item, you can pass the item URL to Full-Text RSS so it just extracts that particular item.

Hope that’s some help.

Best, Keyvan from FiveFilters.org

desk-user · December 4, 2014, 12:25am

Keyvan – thanks for your thorough support. Glad you guys are so responsive.

As far as UTF-8. Thanks for the explanation. I have not worked much with receiving rss json and have never had to explicitly set UTF-8 in this manner. I just assumed with db, scripts, etc. set to UTF-8, that encoding would be maintained. Apparently not. What is still confusing is that the feed was received in utf-8, but that once
I ran json_decode the unicode characters where mangled. I guess I’m wondering if
this means that my browser was converting \u2019 into â€™, but it would otherwise
be stored as an apostrophe were I to write to a db. I’ll need to educate myself
on this a bit more.

As to the second question – yes, I need to set it to 10 items per my client instruction. And, the system was originally monitored more “manually” as you suggest, but they want to automate it more. I think my strategy of getting one article at a time and checking the pubDate to see if it has been captured yet will be good. In most cases, extraction will never extend past the first article.

Thanks!

Marc