Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement charset handling in WebRequestConcern #950

Merged
merged 1 commit into from
Aug 3, 2015
Merged

Conversation

knu
Copy link
Member

@knu knu commented Aug 1, 2015

  • The force_encoding and unzip options in WebsiteAgent are moved to
    WebRequestConcern so other users of the concern such as RssAgent can
    benefit from them.
  • WebRequestConcern detects a charset specified in the Content-Type
    header to decode the content properly, and if it is missing the
    content is assumed to be encoded in UTF-8 unless it has a binary MIME
    type. Not all Faraday adopters handle character encodings, and
    Faraday passes through what is returned from the backend, so we need
    to do this on our own. (cf. response.body is ASCII-8BIT when Content-Type is text/xml; charset=utf-8 lostisland/faraday#139)
  • WebRequestConcern now converts text contents to UTF-8, so agents can
    handle non-UTF-8 data without having to deal with encodings
    themselves. Previously, WebsiteAgent in "json"/"text" modes and
    RssAgent would suffer from encoding errors when dealing with non-UTF-8
    contents. WebsiteAgent in "html"/"xml" modes did not have this
    problem because Nokogiri would always return results in UTF-8
    independent of the input encoding.

This should fix #608.

- The `force_encoding` and `unzip` options in WebsiteAgent are moved to
  WebRequestConcern so other users of the concern such as RssAgent can
  benefit from them.

- WebRequestConcern detects a charset specified in the Content-Type
  header to decode the content properly, and if it is missing the
  content is assumed to be encoded in UTF-8 unless it has a binary MIME
  type.  Not all Faraday adopters handle character encodings, and
  Faraday passes through what is returned from the backend, so we need
  to do this on our own. (cf. lostisland/faraday#139)

- WebRequestConcern now converts text contents to UTF-8, so agents can
  handle non-UTF-8 data without having to deal with encodings
  themselves.  Previously, WebsiteAgent in "json"/"text" modes and
  RssAgent would suffer from encoding errors when dealing with non-UTF-8
  contents.  WebsiteAgent in "html"/"xml" modes did not have this
  problem because Nokogiri would always return results in UTF-8
  independent of the input encoding.

This should fix #608.
# Not all Faraday adapters support automatic charset
# detection, so we do that.
case env[:response_headers][:content_type]
when /;\s*charset\s*=\s*([^()<>@,;:\\\"\/\[\]?={}\s]+)/i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. More detection logics can be added later. BOM, XML declaration, HTML <meta> elements, etc.

@cantino
Copy link
Member

cantino commented Aug 1, 2015

This looks really good!

knu added a commit that referenced this pull request Aug 3, 2015
Implement charset handling in WebRequestConcern
@knu knu merged commit d14027c into master Aug 3, 2015
@knu knu deleted the web_content_charset branch August 3, 2015 13:27
@knu
Copy link
Member Author

knu commented Aug 3, 2015

@cantino Please feel free to improve the charset detection part!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants