Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maremma does not encode utf8 string properly #17

Open
orangewolf opened this issue Nov 23, 2021 · 0 comments
Open

Maremma does not encode utf8 string properly #17

orangewolf opened this issue Nov 23, 2021 · 0 comments

Comments

@orangewolf
Copy link

Maremma seems to be suffering from this UTF8 bug lostisland/faraday#139
Basically Excon does not properly encode the string as UTF8. This causes the string to be parsed as ASCII and then stripped of its special characters in the parse_response method.

Example:

url = "https://api.crossref.org/works/10.1038/nature14474/transform/application/vnd.crossref.unixsd+xml"

response = Maremma.get(url, accept: "text/xml;charset=utf-8", raw: true)
<?xml version="1.0" encoding="UTF-8"?>
<crossref_result xmlns="http://www.crossref.org/qrschema/3.0" version="3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.crossref.org/qrschema/3.0 http://www.crossref.org/schemas/crossref_query_output3.0.xsd">
  <query_result>
    <head>
      <doi_batch_id>none</doi_batch_id>
    </head>
    <body>
      <query status="resolved">
        <doi type="journal_article">10.1038/nature14474</doi>
        <crm-item name="publisher-name" type="string">Springer Science and Business Media LLC</crm-item>
        <crm-item name="prefix-name" type="string">Springer Science and Business Media LLC</crm-item>
        <crm-item name="member-id" type="number">297</crm-item>
        <crm-item name="citation-id" type="number">75327788</crm-item>
        <crm-item name="journal-id" type="number">3415</crm-item>
        <crm-item name="deposit-timestamp" type="number">20191101103854578</crm-item>
        <crm-item name="owner-prefix" type="string">10.1038</crm-item>
        <crm-item name="last-update" type="date">2019-11-01T11:11:21Z</crm-item>
        <crm-item name="created" type="date">2015-05-12T15:48:08Z</crm-item>
        <crm-item name="citedby-count" type="number">290</crm-item>
        <doi_record>
          <crossref xmlns="http://www.crossref.org/xschema/1.1" xsi:schemaLocation="http://www.crossref.org/xschema/1.1 http://doi.crossref.org/schemas/unixref1.1.xsd">
            <journal>
              <journal_metadata language="en">
                <full_title>Nature</full_title>
                <abbrev_title>Nature</abbrev_title>
                <issn media_type="print">0028-0836</issn>
                <issn media_type="electronic">1476-4687</issn>
              </journal_metadata>
              <journal_issue>
                <publication_date media_type="print">
                  <month>6</month>
                  <year>2015</year>
                </publication_date>
                <journal_volume>
                  <volume>522</volume>
                </journal_volume>
                <issue>7554</issue>
              </journal_issue>
              <journal_article publication_type="full_text">
                <titles>
                  <title>Observation of the rare Bs0 ?????+????? decay from the combined analysis of CMS and LHCb data</title>
                </titles>
...

I think there are lots of ways to solve this, but here are two suggestions

Force the encoding

Maremma.class_eval do
  def self.parse_response(string, options = {})
    string = string.dup
    string =
        if options[:skip_encoding]
            string
        else
            string.force_encoding('utf-8').encode(
                Encoding.find("UTF-8"),
                invalid: :replace,
                undef: :replace,
                replace: "?"
            )
        end
    return string if options[:raw]

    from_json(string) || from_xml(string) || from_string(string)
  end
end

Note the addtion of force_encoding('utf-8')

faraday-encoding middleware

Another option would be to use the faraday-encoding middleware. That's probably a less blunt solution, but I didn't try implementing it. https://github.com/ma2gedev/faraday-encoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant