Skip to content

Commit

Permalink
Implement charset handling in WebRequestConcern
Browse files Browse the repository at this point in the history
- The `force_encoding` and `unzip` options in WebsiteAgent is moved to
  WebRequestConcern so other users of the concern such as RssAgent can
  benefit from them.

- WebRequestConcern detects a charset specified in the Content-Type
  header to decode the content properly, and if it is missing the
  content is assumed to be encoded in UTF-8 unless it has a binary MIME
  type.  Not all Faraday adopters handle character encodings, and
  Faraday passes through what is returned from the backend, so we need
  to do this on our own. (cf. lostisland/faraday#139)

- WebRequestConcern now converts text contents to UTF-8, so agents can
  handle non-UTF-8 data without having to deal with encodings
  themselves.  Previously, WebsiteAgent in "json"/"text" modes and
  RssAgent would suffer from encoding errors when dealing with non-UTF-8
  contents.  WebsiteAgent in "html"/"xml" modes did not have this
  problem because Nokogiri would always return results in UTF-8
  independent of the input encoding.

This should fix #608.
  • Loading branch information
knu committed Aug 1, 2015
1 parent 77ec022 commit c047ed8
Show file tree
Hide file tree
Showing 3 changed files with 65 additions and 21 deletions.
64 changes: 63 additions & 1 deletion app/concerns/web_request_concern.rb
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,46 @@ def self.decode(val)
end
end

class CharacterEncoding < Faraday::Middleware
def initialize(app, force_encoding: nil, default_encoding: nil, unzip: nil)
super(app)
@force_encoding = force_encoding
@default_encoding = default_encoding
@unzip = unzip
end

def call(env)
@app.call(env).on_complete do |env|
body = env[:body]

case @unzip
when 'gzip'.freeze
body.replace(ActiveSupport::Gzip.decompress(body))
end

case
when @force_encoding
encoding = @force_encoding
when body.encoding == Encoding::ASCII_8BIT
# Not all Faraday adapters support automatic charset
# detection, so we do that.
case env[:response_headers][:content_type]
when /;\s*charset\s*=\s*([^()<>@,;:\\\"\/\[\]?={}\s]+)/i
encoding = Encoding.find($1) rescue nil
when /\A\s*(?:text\/[^\s;]+|application\/(?:[^\s;]+\+)?(?:xml|json))\s*(?:;|\z)/i
encoding = @default_encoding
else
# Never try to transcode a binary content
return
end
end
body.encode!(Encoding::UTF_8, encoding) unless body.encoding == Encoding::UTF_8
end
end
end

Faraday::Response.register_middleware character_encoding: CharacterEncoding

extend ActiveSupport::Concern

def validate_web_request_options!
Expand All @@ -34,6 +74,23 @@ def validate_web_request_options!
rescue ArgumentError => e
errors.add(:base, e.message)
end

if (encoding = options['force_encoding']).present?
case encoding
when String
begin
Encoding.find(encoding)
rescue ArgumentError
errors.add(:base, "Unknown encoding: #{encoding.inspect}")
end
else
errors.add(:base, "force_encoding must be a string")
end
end
end

def default_encoding
Encoding::UTF_8
end

def faraday
Expand All @@ -44,14 +101,19 @@ def faraday
}

@faraday ||= Faraday.new(faraday_options) { |builder|
builder.response :character_encoding,
force_encoding: interpolated['force_encoding'].presence,
default_encoding: default_encoding,
unzip: interpolated['unzip'].presence

builder.headers = headers if headers.length > 0

builder.headers[:user_agent] = user_agent

builder.use FaradayMiddleware::FollowRedirects
builder.request :url_encoded

if boolify(options['disable_url_encoding'])
if boolify(interpolated['disable_url_encoding'])
builder.options.params_encoder = DoNotEncoder
end

Expand Down
1 change: 1 addition & 0 deletions app/models/agents/rss_agent.rb
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ class RssAgent < Agent
* `basic_auth` - Specify HTTP basic auth parameters: `"username:password"`, or `["username", "password"]`.
* `disable_ssl_verification` - Set to `true` to disable ssl verification.
* `disable_url_encoding` - Set to `true` to disable url encoding.
* `force_encoding` - Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
* `user_agent` - A custom User-Agent name (default: "Faraday v#{Faraday::VERSION}").
* `max_events_per_run` - Limit number of events created (items parsed) per run for feed.
Expand Down
21 changes: 1 addition & 20 deletions app/models/agents/website_agent.rb
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ class WebsiteAgent < Agent
Set `uniqueness_look_back` to limit the number of events checked for uniqueness (typically for performance). This defaults to the larger of #{UNIQUENESS_LOOK_BACK} or #{UNIQUENESS_FACTOR}x the number of detected received results.
Set `force_encoding` to an encoding name if the website does not return a Content-Type header with a proper charset.
Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
Set `user_agent` to a custom User-Agent name if the website does not like the default value (`#{default_user_agent}`).
Expand Down Expand Up @@ -157,19 +157,6 @@ def validate_options
errors.add(:base, "Invalid uniqueness_look_back format") unless is_positive_integer?(options['uniqueness_look_back'])
end

if (encoding = options['force_encoding']).present?
case encoding
when String
begin
Encoding.find(encoding)
rescue ArgumentError
errors.add(:base, "Unknown encoding: #{encoding.inspect}")
end
else
errors.add(:base, "force_encoding must be a string")
end
end

validate_web_request_options!
end

Expand Down Expand Up @@ -284,12 +271,6 @@ def check_url(url, payload = {})
interpolation_context.stack {
interpolation_context['_response_'] = ResponseDrop.new(response)
body = response.body
if (encoding = interpolated['force_encoding']).present?
body = body.encode(Encoding::UTF_8, encoding)
end
if interpolated['unzip'] == "gzip"
body = ActiveSupport::Gzip.decompress(body)
end
doc = parse(body)

if extract_full_json?
Expand Down

0 comments on commit c047ed8

Please sign in to comment.