response.body is ASCII-8BIT when Content-Type is text/xml; charset=utf-8 #139

auxbuss · 2012-04-16T10:13:17Z

First time using faraday, so I might be doing things incorrectly, but the response.body encoding in the following is ASCII-8BIT:

  def self.search(term)
    connection = Faraday.new(url: 'https://en.wikipedia.org')
    response = connection.get do |req|
      req.options = { :timeout => 5, :open_timeout => 3 }
      req.url '/w/api.php' , action: 'opensearch', format: 'xml', search: term
    end
    puts response.body.encoding
  end

In 1.9.2 this causes REXML to throw an Encoding::CompatibilityError.

I couldn't find a way to force faraday to provide response.body in UTF-8.

What is the preferred solution to this?

The text was updated successfully, but these errors were encountered:

ghost · 2012-04-17T12:10:36Z

Just encountered the same issue. Any ideas?

auxbuss · 2012-04-17T12:23:57Z

Workaround I used is:

response.body.force_encoding('utf-8')

Yahuda has a dissertation about the problem here.

technoweenie · 2012-05-01T11:56:07Z

I'm pretty sure Faraday just passes on the response body from the underlying adapter. I'm not sure I want to raise errors or perform lossy conversions of the data in Faraday. That can be done in a custom middleware if you really need it.

auxbuss · 2012-05-01T12:16:57Z

Fair enough. If the problem is elsewhere, as it appears, I guess it will be cleaned up in time. It's not a show stopper for me.

mislav · 2012-05-29T00:06:33Z

Closing because it's not a bug with Faraday.

chrismo · 2014-09-19T22:58:46Z

I'm not sure the underlying adapter - at least net/http - does any encoding transformation. You can set Ruby's Encoding.default_external to something like 'US-ASCII', then hit an endpoint with Content-Type = '...; charset=utf-8' ... net/http will parse the charset string and make it available, but does nothing to the encoding of the body string. Maybe net/http should be responsible for that, but if it isn't, the ParseJson middleware (for example) can blow up.

chrismo · 2014-09-20T02:48:10Z

Did some more research on this - some of the underlying adapters handle the Content-Type charset, some don't:

EM-HTTP-Request does. [commit].
Patron does. [commit].
HTTPClient does [commit] [issue].
Typhoeus and Excon (and net/http) don't appear to.

I guess the nicest thing to do would be to perhaps offer an optional middleware for adapters that don't try, but, yeah, I'd agree, this probably shouldn't be Faraday's responsibility.

- The `force_encoding` option in WebsiteAgent is moved to WebRequestConcern so other users of the concern such as RssAgent can benefit from it. - WebRequestConcern detects a charset specified in the Content-Type header to decode the content properly, and if it is missing the content is assumed to be encoded in UTF-8 unless it has a binary MIME type. Not all Faraday adopters handle character encodings, and Faraday passes through what is returned from the backend, so we need to do this on our own. (cf. lostisland/faraday#139) - WebRequestConcern now converts text contents to UTF-8, so agents can handle non-UTF-8 data without having to deal with encodings themselves. Previously, WebsiteAgent in "json"/"text" modes and RssAgent would suffer from encoding errors when dealing with non-UTF-8 contents. WebsiteAgent in "html"/"xml" modes did not have this problem because Nokogiri would always return results in UTF-8 independent of the input encoding. This should fix #608.

- The `force_encoding` and `unzip` options in WebsiteAgent is moved to WebRequestConcern so other users of the concern such as RssAgent can benefit from them. - WebRequestConcern detects a charset specified in the Content-Type header to decode the content properly, and if it is missing the content is assumed to be encoded in UTF-8 unless it has a binary MIME type. Not all Faraday adopters handle character encodings, and Faraday passes through what is returned from the backend, so we need to do this on our own. (cf. lostisland/faraday#139) - WebRequestConcern now converts text contents to UTF-8, so agents can handle non-UTF-8 data without having to deal with encodings themselves. Previously, WebsiteAgent in "json"/"text" modes and RssAgent would suffer from encoding errors when dealing with non-UTF-8 contents. WebsiteAgent in "html"/"xml" modes did not have this problem because Nokogiri would always return results in UTF-8 independent of the input encoding. This should fix #608.

- The `force_encoding` and `unzip` options in WebsiteAgent are moved to WebRequestConcern so other users of the concern such as RssAgent can benefit from them. - WebRequestConcern detects a charset specified in the Content-Type header to decode the content properly, and if it is missing the content is assumed to be encoded in UTF-8 unless it has a binary MIME type. Not all Faraday adopters handle character encodings, and Faraday passes through what is returned from the backend, so we need to do this on our own. (cf. lostisland/faraday#139) - WebRequestConcern now converts text contents to UTF-8, so agents can handle non-UTF-8 data without having to deal with encodings themselves. Previously, WebsiteAgent in "json"/"text" modes and RssAgent would suffer from encoding errors when dealing with non-UTF-8 contents. WebsiteAgent in "html"/"xml" modes did not have this problem because Nokogiri would always return results in UTF-8 independent of the input encoding. This should fix #608.

cyzgbw · 2016-03-02T09:07:01Z

try https://github.com/qhwa/string_utf8

etipton · 2016-07-04T07:08:34Z

@chrismo you're my hero. Thanks for doing that research!

semaperepelitsa · 2022-11-11T13:32:58Z

This has been implemented for Net HTTP adapter: lostisland/faraday-net_http#6

mislav closed this as completed May 29, 2012

jure mentioned this issue Feb 25, 2015

Set encoding of the response string, if specified by server. rest-client/rest-client#278

Closed

knu mentioned this issue Aug 1, 2015

Implement charset handling in WebRequestConcern huginn/huginn#950

Merged

fjg mentioned this issue Aug 8, 2016

fix(encoding): Handle invalid encoding oauth-xx/oauth2#267

Closed

pirj mentioned this issue Oct 30, 2017

response.body is ASCII-8BIT when Content-Type is text/html; charset=UTF-8 typhoeus/typhoeus#580

Open

tarebyte mentioned this issue Feb 12, 2020

Fix broken encoding (fixes #920) octokit/octokit.rb#1118

Merged

hackartisan mentioned this issue Feb 3, 2021

Encoding error at holdings endpoint pulibrary/bibdata#1041

Closed

gurgeous mentioned this issue May 29, 2021

honor Content-Type charset lostisland/faraday-net_http#6

Closed

orangewolf mentioned this issue Nov 23, 2021

Maremma does not encode utf8 string properly datacite/maremma#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

response.body is ASCII-8BIT when Content-Type is text/xml; charset=utf-8 #139

response.body is ASCII-8BIT when Content-Type is text/xml; charset=utf-8 #139

auxbuss commented Apr 16, 2012

ghost commented Apr 17, 2012

auxbuss commented Apr 17, 2012

technoweenie commented May 1, 2012

auxbuss commented May 1, 2012

mislav commented May 29, 2012

chrismo commented Sep 19, 2014

chrismo commented Sep 20, 2014

cyzgbw commented Mar 2, 2016

etipton commented Jul 4, 2016

semaperepelitsa commented Nov 11, 2022

response.body is ASCII-8BIT when Content-Type is text/xml; charset=utf-8 #139

response.body is ASCII-8BIT when Content-Type is text/xml; charset=utf-8 #139

Comments

auxbuss commented Apr 16, 2012

ghost commented Apr 17, 2012

auxbuss commented Apr 17, 2012

technoweenie commented May 1, 2012

auxbuss commented May 1, 2012

mislav commented May 29, 2012

chrismo commented Sep 19, 2014

chrismo commented Sep 20, 2014

cyzgbw commented Mar 2, 2016

etipton commented Jul 4, 2016

semaperepelitsa commented Nov 11, 2022