Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

response.body is ASCII-8BIT when Content-Type is text/xml; charset=utf-8 #139

Closed
auxbuss opened this issue Apr 16, 2012 · 10 comments
Closed
Labels

Comments

@auxbuss
Copy link

auxbuss commented Apr 16, 2012

First time using faraday, so I might be doing things incorrectly, but the response.body encoding in the following is ASCII-8BIT:

  def self.search(term)
    connection = Faraday.new(url: 'https://en.wikipedia.org')
    response = connection.get do |req|
      req.options = { :timeout => 5, :open_timeout => 3 }
      req.url '/w/api.php' , action: 'opensearch', format: 'xml', search: term
    end
    puts response.body.encoding
  end

In 1.9.2 this causes REXML to throw an Encoding::CompatibilityError.

I couldn't find a way to force faraday to provide response.body in UTF-8.

What is the preferred solution to this?

@ghost
Copy link

ghost commented Apr 17, 2012

Just encountered the same issue. Any ideas?

@auxbuss
Copy link
Author

auxbuss commented Apr 17, 2012

Workaround I used is:

response.body.force_encoding('utf-8')

Yahuda has a dissertation about the problem here.

@technoweenie
Copy link
Member

I'm pretty sure Faraday just passes on the response body from the underlying adapter. I'm not sure I want to raise errors or perform lossy conversions of the data in Faraday. That can be done in a custom middleware if you really need it.

@auxbuss
Copy link
Author

auxbuss commented May 1, 2012

Fair enough. If the problem is elsewhere, as it appears, I guess it will be cleaned up in time. It's not a show stopper for me.

@mislav
Copy link
Contributor

mislav commented May 29, 2012

Closing because it's not a bug with Faraday.

@mislav mislav closed this as completed May 29, 2012
@chrismo
Copy link

chrismo commented Sep 19, 2014

I'm not sure the underlying adapter - at least net/http - does any encoding transformation. You can set Ruby's Encoding.default_external to something like 'US-ASCII', then hit an endpoint with Content-Type = '...; charset=utf-8' ... net/http will parse the charset string and make it available, but does nothing to the encoding of the body string. Maybe net/http should be responsible for that, but if it isn't, the ParseJson middleware (for example) can blow up.

@chrismo
Copy link

chrismo commented Sep 20, 2014

Did some more research on this - some of the underlying adapters handle the Content-Type charset, some don't:

EM-HTTP-Request does. [commit].
Patron does. [commit].
HTTPClient does [commit] [issue].
Typhoeus and Excon (and net/http) don't appear to.

I guess the nicest thing to do would be to perhaps offer an optional middleware for adapters that don't try, but, yeah, I'd agree, this probably shouldn't be Faraday's responsibility.

knu added a commit to huginn/huginn that referenced this issue Aug 1, 2015
- The `force_encoding` option in WebsiteAgent is moved to
  WebRequestConcern so other users of the concern such as RssAgent can
  benefit from it.

- WebRequestConcern detects a charset specified in the Content-Type
  header to decode the content properly, and if it is missing the
  content is assumed to be encoded in UTF-8 unless it has a binary MIME
  type.  Not all Faraday adopters handle character encodings, and
  Faraday passes through what is returned from the backend, so we need
  to do this on our own. (cf. lostisland/faraday#139)

- WebRequestConcern now converts text contents to UTF-8, so agents can
  handle non-UTF-8 data without having to deal with encodings
  themselves.  Previously, WebsiteAgent in "json"/"text" modes and
  RssAgent would suffer from encoding errors when dealing with non-UTF-8
  contents.  WebsiteAgent in "html"/"xml" modes did not have this
  problem because Nokogiri would always return results in UTF-8
  independent of the input encoding.

This should fix #608.
knu added a commit to huginn/huginn that referenced this issue Aug 1, 2015
- The `force_encoding` and `unzip` options in WebsiteAgent is moved to
  WebRequestConcern so other users of the concern such as RssAgent can
  benefit from them.

- WebRequestConcern detects a charset specified in the Content-Type
  header to decode the content properly, and if it is missing the
  content is assumed to be encoded in UTF-8 unless it has a binary MIME
  type.  Not all Faraday adopters handle character encodings, and
  Faraday passes through what is returned from the backend, so we need
  to do this on our own. (cf. lostisland/faraday#139)

- WebRequestConcern now converts text contents to UTF-8, so agents can
  handle non-UTF-8 data without having to deal with encodings
  themselves.  Previously, WebsiteAgent in "json"/"text" modes and
  RssAgent would suffer from encoding errors when dealing with non-UTF-8
  contents.  WebsiteAgent in "html"/"xml" modes did not have this
  problem because Nokogiri would always return results in UTF-8
  independent of the input encoding.

This should fix #608.
knu added a commit to huginn/huginn that referenced this issue Aug 1, 2015
- The `force_encoding` and `unzip` options in WebsiteAgent are moved to
  WebRequestConcern so other users of the concern such as RssAgent can
  benefit from them.

- WebRequestConcern detects a charset specified in the Content-Type
  header to decode the content properly, and if it is missing the
  content is assumed to be encoded in UTF-8 unless it has a binary MIME
  type.  Not all Faraday adopters handle character encodings, and
  Faraday passes through what is returned from the backend, so we need
  to do this on our own. (cf. lostisland/faraday#139)

- WebRequestConcern now converts text contents to UTF-8, so agents can
  handle non-UTF-8 data without having to deal with encodings
  themselves.  Previously, WebsiteAgent in "json"/"text" modes and
  RssAgent would suffer from encoding errors when dealing with non-UTF-8
  contents.  WebsiteAgent in "html"/"xml" modes did not have this
  problem because Nokogiri would always return results in UTF-8
  independent of the input encoding.

This should fix #608.
@cyzgbw
Copy link

cyzgbw commented Mar 2, 2016

try https://github.com/qhwa/string_utf8

@etipton
Copy link

etipton commented Jul 4, 2016

@chrismo you're my hero. Thanks for doing that research!

@semaperepelitsa
Copy link
Contributor

This has been implemented for Net HTTP adapter: lostisland/faraday-net_http#6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants