Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

description tag should be in UTF-8 encoding but it is in ASCII-8BIT #15

Open
emaillenin opened this issue Jan 20, 2014 · 9 comments
Open

Comments

@emaillenin
Copy link

Tried this also:

l.description.force_encoding('UTF-8').encode!('UTF-8',:invalid => :replace,:replace => '')

But still ending up with:
Uncaught exception: invalid byte sequence in UTF-8

@emaillenin
Copy link
Author

Using

l.description.to_s.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => ''})

solves the issue but we lose the original UNICODE character that was in the source.

@eugene-nikolaev
Copy link

Got same issue

@eugene-nikolaev
Copy link

There is content.force_encoding('binary') in the if condition:

 def unescape(content)
    if content.respond_to?(:force_encoding) && content.force_encoding("binary") =~ /([^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]%)/n then
        CGI.unescape(content).gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    else
        content.gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    end
  end

force_encoding method changes string encoding inplace, so every string returned by simple-rss will be encoded to ASCII 8-bit...

I'd rewrite that the following way, but unsure that for this 'if' as well. So I don't make a pull request.

  def unescape(content)
    if content.respond_to?(:force_encoding) && encode_binary(content) =~ /([^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]%)/n then
        CGI.unescape(content).gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    else
        content.gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    end
  end

  def encode_binary(content)
    content.encode('binary', {:invalid => :replace, :undef => :replace, :replace => ''})
  end

@emaillenin
Copy link
Author

Hi @evgeniynickolaev can you please test it with a feed that has non latin characters? Meanwhile I will try to post a sample where it failed for me.

@eugene-nikolaev
Copy link

Yes, I've tested it with a feed containing the following unicode symbols - \xE2\x80\x99.
But not sure it is 100% correct as not fully understand the logic if this unescaping.

@terotil
Copy link

terotil commented Oct 2, 2014

Just as @evgeniynickolaev pointed out, the immediate source of the problem is force_encoding("binary"), which (even though the name does not end in bang) mutates the string object in place. However, apparetly the reason for adding the force_encoding was "n" flag in the regexp within the conditional introduced in ac95fb4. It says that the regex should be interpreted as binary (ASCII-8BIT) no matter what the source encoding is (see http://www.ruby-doc.org/core-2.1.3/Regexp.html#class-Regexp-label-Encoding).

I'll throw in a fix which simply removes all the fiddling with encodings. I can't figure out any reason why there would be any need for that.

@chengguangnan
Copy link

I run into the same problem. This gem is not well maintained. I'm go with other gems.

@jeremyhaile
Copy link

@chengguangnan what other gem have you found that is well maintained?

@chengguangnan
Copy link

Hi @jeremyhaile, I switched to feedjira.

mpalmer added a commit to mpalmer/discourse that referenced this issue Aug 25, 2016
Scrubbing an ASCII-8BIT string isn't ever going to remove anything, because
there's no code point that isn't valid 8-bit ASCII.  Since we'd really
prefer it if everything were UTF-8 anyway, we'll just assume, for now, that
whatever comes out of SimpleRSS is probably UTF-8, and just nuke anything
that isn't a valid UTF-8 codepoint.

Of course, the *real* bug here is that SimpleRSS [unilaterally converts
everything to
ASCII-8BIT](cardmagic/simple-rss#15).  It's
presumably *far* too much to ask that it detects the encoding of the source
RSS feed and marks the parsed strings with the correct encoding...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants