Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iptc.Envelope.CharacterSet not reported correctly #1003

Closed
tester0077 opened this issue Sep 19, 2019 · 11 comments
Closed

Iptc.Envelope.CharacterSet not reported correctly #1003

tester0077 opened this issue Sep 19, 2019 · 11 comments
Assignees
Labels
notReproducible Reported bugs not confirmed
Milestone

Comments

@tester0077
Copy link
Collaborator

exiv2-0.27.2 does not decode the data of the IPTC tag 0x005a correctly

For an image which contains the tag, the output from Exiv2 is
0x005a Envelope Iptc.Envelope.CharacterSet CharacterSet Character Set �%G <<<<<< should be translated to UTF-8

I am attaching an image which shows the output, but there was a discussion some time ago @ https://dev.exiv2.org/boards/3/topics/1288 where Robin showed the output from one of his images
C:\Users\rmills\Desktop>exiv2 -pi robin.jpg
Iptc.Envelope.ModelVersion Short 1 4
Iptc.Envelope.CharacterSet String 3 ←%G

In my case the output was from a command line such as:
D:>D:\pkg\C++\MSVC2017\exiv2-master-0.27.2\exiv2-0.27.2-Source\build32ReleaseStatic\bin\exiv2.exe -PIXxgklnt D:\wxIctest\Media\headst\indi\PerdueAnnElizabeth-GraveMarker-FAG-62151533_132727369452.jpg

Running under Windows 10,

With some searching, I have been able to track down the following bits and pieces
The output comes from code in Exiv2 code actions.cpp somewhere between line #662 & 758 or there about
The above output was taken from a UTF-8 enabled Win 10 DOS console window; without the UTF-8 facility the string �%G does not show.
Initially I found the issue in my test app, which is able to handle UTF-8 strings.

It seems this is a rather obscure part of the standards, but I have been able to find a reference in the ISO/IEC 10646 standard of 2017/12 on page 19
obtainable from: https://standards.iso.org/ittf/PubliclyAvailableStandards/c069119_ISO_IEC_10646_2017.zip

12.2 Identification of a UCS encoding scheme
When the escape sequences from ISO/IEC 2022 are used, the identification of a UCS encoding scheme (see
Clause 10) specified by this International Standard shall be by a designation sequence chosen from the following
list:
ESC 02/05 02/15 04/09
UTF-8 encoding form; UTF-8 encoding scheme
ESC 02/05 02/15 04/12
UTF-16 encoding form; UTF-16BE encoding scheme
ESC 02/05 02/15 04/06
UTF-32 encoding form; UTF-32BE encoding scheme
NOTE – The following designation sequences: ESC 02/05 02/15 04/00, ESC 02/05 02/15 04/01, ESC 02/05 02/15 04/03, ESC 02/05
02/15 04/04, ESC 02/05 02/15 04/07, ESC 02/05 02/15 04/08, ESC 02/05 02/15 04/10, ESC 02/05 02/15 04/11 used in previous
versions of this standard to identify implementation levels 1 and 2 are deprecated. The remaining designation sequences correspond
to the former level 3 which is now the only supported content definition for code unit sequences.
ESC 02/05 04/07
UTF-8 encoding form; UTF-8 encoding scheme
If such an escape sequence appears within a code unit sequence conforming to ISO/IEC 2022, it shall consist
only of the sequences of bit combinations as shown above.
If such an escape sequence appears within a code unit sequence conforming to this International Standard, it
shall be padded in accordance with Clause 11 when the identified encoding form is either UTF-16 or UTF-32.
No padding is necessary when the identified encoding form is UTF-8. See also 12.5.

PerdueAnnElizabeth-GraveMarker-FAG-62151533_132727369452

There are references to this 'conversion/translation in convert.cpp at about line 1174, though in my limited testing this code was not accessed.
Still, I am not at all familiar with the workings of the Exiv2 code that deep in the bowels of the libraries and so I have shelved this issue at my end for now. FWIW, Exiftool does identify this 'code' correctly.

In the image I am attaching, there is no text which would require knowledge of the character set, but some other images in the same series definitely contain UTF-8 encoded strings, so that knowing the character set becomes important.

AFAIK, all this data was added by XnViewMP, which seems to stick with this character set by default, I would expect that there will be images about with different character sets, though I have not come across any others.

@tester0077 tester0077 added the bug label Sep 19, 2019
@clanmills clanmills self-assigned this Sep 19, 2019
@clanmills clanmills added this to the v0.27.4 milestone Sep 19, 2019
@clanmills
Copy link
Collaborator

@tester0077 Arnold. I'd like to help here, however I simply don't know what this is about. And that's about what I said 7 years ago on: https://dev.exiv2.org/boards/3/topics/1288

Here's what I see. There are 3 bytes of metadata for Iptc.Envelope.CharacterSet and they are '' '%' 'G'

656 rmills@rmillsmbp:~/Pictures $ exiv2 -K Iptc.Envelope.CharacterSet  ~/Downloads/65270223-8f5f8c80-dacf-11e9-8863-feb8fe6b779f.jpg | od -a
0000000    I   p   t   c   .   E   n   v   e   l   o   p   e   .   C   h
0000020    a   r   a   c   t   e   r   S   e   t  sp  sp  sp  sp  sp  sp
0000040   sp  sp  sp  sp  sp  sp  sp  sp  sp  sp  sp  sp  sp   S   t   r
0000060    i   n   g  sp  sp  sp  sp  sp  sp   3  sp  sp esc   %   G  nl
0000100
657 rmills@rmillsmbp:~/Pictures $ 

There's no question that those bytes are in the file:

659 rmills@rmillsmbp:~/Pictures $ exiv2 -pR ~/Downloads/65270223-8f5f8c80-dacf-11e9-8863-feb8fe6b779f.jpg  | grep -A 2 -B 4 Record
       178 | 0x9c9e XPKeywords                   |      BYTE |       20 |       342 | H.e.a.d.s.t.o.n.e..
  END MemIo
     884 | 0xffe1 APP1  |    5634 | http://ns.adobe.com/xap/1.0/.<?x
    6520 | 0xffed APP13 |     414 | Photoshop 3.0.8BIM..........Z...
  Record | DataSet | Name                     | Length | Data
       1 |      90 | CharacterSet             |      3 | .%G
       2 |       5 | ObjectName               |     20 | Perdue Ann Elizabeth
660 rmills@rmillsmbp:~/Pictures $ 

I can't remember what a UTF-8 Enabled DOS console is, however I think you're saying that the DOS console does not present the data correctly. Isn't that an issue with the DOS Console?

I don't understand the standard you have referenced. As with the previous discussion of this matter, I don't know what is being discussed or what is expected. I believe Exiv2 is copying the data correctly from the file.

Could you provide a suitable test file and relevant output from ExifTool?

@clanmills clanmills modified the milestones: v0.27.4, v0.27.3 Sep 20, 2019
@clanmills clanmills added notReproducible Reported bugs not confirmed and removed bug labels Sep 20, 2019
@tester0077
Copy link
Collaborator Author

DOS Console: the default console uses a code page which does not handle UTF-8 characters. This is important because without UTF-8 support, running exiv2 from the command line shows no output at all for the command I quoted run against the image in question. One can set up a UTF-8 enabled console via a batch file to handle the output from exiv2, which then does show the output as shown in your example. Unfortunately, my example output is mangled in the Github window and I did not catch that the first time around.

Output for Iptc.Envelope.CharacterSet is 'correct' in as far as it goes, but it ought to be translated to identify the intended char set as per the ISO standard I referenced. It is a very strange mapping of a very unusual sequence of characters to a specific character set and for many images in my case, it does matter, although I can get away with ignoring it and defaulting to UTF-8.

The Exiv2 IPTC page @ https://www.exiv2.org/iptc.html identifies it as a string of 'control functions'

The output from Exiftool for the relevant section - Exiftool seems to output the data in an order I don't understand - but it does translate the control sequence to what seems to be the correct equivalent and FWIW, it (and utilities which use it as a base) is the only one I have found to do so, with the exception of the output from the dumpfile utility which is/was part of the Adobe XMP SDK distribution
Exiftool output
Description : Perdue Ann Elizabeth wife of Abram G. Weaver
Title : Perdue Ann Elizabeth
Subject : People, Perdue Weaver, Elizabeth, Categories, Genealogy, Place, Headstones, USA, Alabama, Brewton, Union Cemetery, Place
Last Keyword IPTC : Headstone
Last Keyword XMP : Headstone
Person In Image : Perdue Ann Elizabeth
Hierarchical Subject : People, People|Perdue Weaver, People|Perdue Weaver|Elizabeth, Categories, Categories|Genealogy, Place, Categories|Genealogy|Headstones, Place|USA, Place|USA|Alabama, Place|USA|Alabama|Brewton, Place|USA|Alabama|Brewton|Union Cemetery, Categories|Genealogy|Photos|Place
Current IPTC Digest : 7e7da8efd318594777e68967b7ea7303
Coded Character Set : UTF8 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Object Name : Perdue Ann Elizabeth
Supplemental Categories : Genealogy|Headstones

Dumpfile output:
D:\pkg\C++\MSVC2017\XMP-Toolkit-SDK-CC201607-old\samples\target\windows\Debug>DumpFile.exe -tree D:\wxIctest\Media\headst\indi\PerdueAnnElizabeth-GraveMarker-FAG-62151533_132727369452.jpg
Dumping JPEG file
[size 31384 (0x7a98)]
JPEG:FFD8
[offset 0 (0x0),SOI]
JPEG:FFE0
[offset 2 (0x2),size 16, APP0, "JFIF"]
.....
Photoshop Image Resources 1
[from JPEG Photoshop APP13, offset 7148 (0x1BEC), size 452]
PSIR:IPTC
IPTC data
[from PSIR #1028, offset 7160 (0x1BF8), size 440]
IPTC:1:90
[offset 7160 (0x1BF8), size 3,encoding = 0x1B2547, (UTF-8)] <<<<<<<<<<<<<<<<
IPTC:2:5 = 'Perdue Ann Elizabeth'
[offset 7168 (0x1C00), size 20,Title]
IPTC:2:20 = 'Genealogy|Headstones'
[offset 7193 (0x1C19), size 20,Supplemental Category]
IPTC:2:25 = 'People'
[offset 7218 (0x1C32), size 6,Keyword]
IPTC:2:25 = 'Perdue Weaver'

It also took me a good bit of time to understand what I was looking for, where I might find a reference and even how to read this reference.
The ISO standard I quoted, needs some interpretation, using other references as to how to 'read' their way of specifying octets, specified in earlier sections of that standard
Quoting from the standard:

ESC 02/05 04/07
UTF-8 encoding form; UTF-8 encoding scheme

and applying the necessary translation, we get
ESC 02/05 04/07
ESC 0x25 0x47
ESC % G => UTF-8

The same page identifies a number of other control sequences, but labels several of these as 'deprecated'.
As I said, I have no images with any of the alternatives, though I expect they are out there. Still those are of no concern to me - as yet ;-)

And finally:

From the Exiftool pages @ https://sno.phy.queensu.ca/~phil/exiftool/TagNames/IPTC.html#EnvelopeRecord

90 | CodedCharacterSet | string[0,32]! | (values are entered in the form "ESC X Y[, ...]". The escape sequence for UTF-8 character coding is "ESC % G", but this is displayed as "UTF8" for convenience. Either string may be used when writing. The value of this tag affects the decoding of string values in the Application and NewsPhoto records. This tag is marked as "unsafe" to prevent it from being copied by default in a group operation because existing tags in the destination image may use a different encoding. When creating a new IPTC record from scratch, it is suggested that this be set to "UTF8" if special characters are a possibility)

@clanmills
Copy link
Collaborator

Arnold. There's lots of detail here. However I don't know what you want me to change.

Am I to change:

720 rmills@rmillsmbp:~/gnu/github/exiv2/0.27-maintenance/build $ exiv2 -K Iptc.Envelope.CharacterSet ~/Downloads/65270223-8f5f8c80-dacf-11e9-8863-feb8fe6b779f.jpg 
Iptc.Envelope.CharacterSet                   String      3  G
721 rmills@rmillsmbp:~/gnu/github/exiv2/0.27-maintenance/build $ 

to output UTF-8. Why change it? Are we to filter every IPTC field in some way?

Exiv2 is a library and we are read the data from file and give it to the caller. If it needs special treatment to display correctly in a DOS box, a GUI or any presentation layer, I'm not convinced that is our responsibility.

For that matter, I'm not sure that Exiv2 can even determine that it is outputting to a DOS box. For sure, the library does not know. I suppose the exiv2(.exe) command-line program might be able to detect this. However the purpose of exiv2.exe is to act as a test harness for the library.

@tester0077
Copy link
Collaborator Author

The short answer: yes, I would expect Exiv2 to translate the 'control sequence to what it is meant to convey to the user of Exiv2 or the library. Just as it 'translates' the labels, camera model, lens type etc, etc to something meaningful (to an English speaker :-) )

The reference to the DOS box was necessary - albeit (potentially) distracting - because Exiv2, Exiftool and even dumpfile are command line tools and as such depend to some degree on the idiosyncrasies of the 'terminal emulator' they are having to report to.
Under Linux, users who only deal with the ASCII equivalent of the various character sets are isolated from much of this because Linux has adopted UTF as a standard, good ol' M$ has not - yet

@clanmills
Copy link
Collaborator

@tester0077 I'm going to close this. You're requesting something for which we have no specification. You may find somebody else willing to undertake this, however I will not.

@tester0077
Copy link
Collaborator Author

Well ,,,, shrug ... your call.
I did provide the specification. c/w references and examples from other utilities .....
And knowing the encoding used for all of the IPTC strings is/was/will be important to someone - they went to considerable trouble to setup the standard.

FWIW, this issue must have been considered in the past by Exiv2 developers.
From the Exiv2 0.27.2 code

D:\pkg\C++\MSVC2017\exiv2-master-0.27.2\exiv2-0.27.2-Source\src\convert.cpp(1174): (*iptcData_)["Iptc.Envelope.CharacterSet"] = "\033%G"; // indicate UTF-8 encoding
D:\pkg\C++\MSVC2017\exiv2-master-0.27.2\exiv2-0.27.2-Source\src\convert.cpp(1195): if (added) (*iptcData_)["Iptc.Envelope.CharacterSet"] = "\033%G"; // indicate UTF-8 encoding
D:\pkg\C++\MSVC2017\exiv2-master-0.27.2\exiv2-0.27.2-Source\src\iptc.cpp(408): if (value == "\033%G") return "UTF-8";

It just doesn't seem to be used and I cannot afford to dig in deep enough to become conversant enough with the Exiv2 code just to figure out why this bit isn't used where and how it should have been.

@clanmills
Copy link
Collaborator

I apologise for saying "No" in such an abrupt manner.

It's possible that the code exists. Andreas (the founder of Exiv2) included the iconv library.

Team Exiv2 currently have two objectives:

  1. "the dots" (v0.27.3 ....)
  2. v0.28 (modernisation to C++11).

I know you intend well by raising these issues. However, please remember we are a small team of volunteers. For v0.27.3, I'm working on matters which you have raised such as taglist, and README-SAMPLES.md.

You are respected. However we cannot fix every concern you raise.

@tester0077
Copy link
Collaborator Author

Understood & no problem, Robin.
My apologies if I come across as 'pushy'?
From my perspective, once I understand the problem, I will find a way to handle it.
In bringing it to your attention, my main goal is/was to record the issue. Whether, when or how you and your team decide to address the issue is your call.
I did & do not expect to have any of the issues I raised fixed 'stat' - I am a 'team' of one 'volunteer' myself.
In this case, I was as much interested in recording some of the relevant background information for your team as for anyone else who might run into the issue. Sometimes knowing why is as good as a 'fix'

@tester0077
Copy link
Collaborator Author

After getting the IPTC data sorted out, I would like to record here my implementation.
`try {
Exiv2::Image::AutoPtr image = Exiv2::ImageFactory::open( CUtils::wx2std(a_wsFileName) );

wxASSERT( image.get() != 0 );
image->readMetadata();

Exiv2::IptcData &iptcData = image->iptcData();

-----------------------

Exiv2::IptcData::const_iterator itr = iptcData.begin();
size_t i = 0;
// it seems that the char set definition ought to precede any string definition
wxString wsCharSet = _T("unknown");
wxString wsIptcLabel( _T("IptcLabel"));
wxString wsValue;
for (itr = iptcData.begin(); itr != iptcData.end(); ++itr)
{
  wxString wsLabel;
  // necessary to give each entry a different wxPG 'label'
  // to avoid cpomplaints from code about entries with same label
  wsLabel.Printf( _T("%s-%d"), wsIptcLabel, i++ );
  std::string key = itr->tagLabel();
  std::string value = itr->toString();
  if (itr->value().ok())
  {
    if( (key == _T("Iptc.Envelope.CharacterSet") ) ||
      ( key == "Character Set") )
    {
      if( !value.empty() )
      {
        if( (value == _T("\x1b%G")) || (value == _T("UTF-8") ) )
        {
          value = "UTF-8";
          wsCharSet = _T("UTF-8");
        }
        else if( value == _T("ASCII") )
        {
          value = "ASCII";
          wsCharSet = _T("ASCII");
        }
        else if( value == _T("\x1b.A") ) //ESC . A
        {
          value = "ISO 8859-1";
        }
        else
        {
          value = _("unknown IPTC Char set");
        }
      }
    }
    // see the enum types in exiv2 types.hpp
    // also note that IPTC has special types for string, date & time
    wxString wsType = itr->typeName();
    int iTypeId = itr->typeId();
    wsValue = value;
    switch ( iTypeId )
    {
    case Exiv2::TypeId::string:   // IPTC string
      // need to convert depending on the char set deternined above
      if( wsCharSet.IsSameAs( _T("UTF-8") ))
        wsValue = wxString::FromUTF8( wsValue.c_str() );
      break;
    case Exiv2::TypeId::date:   // todo!!
    case Exiv2::TypeId::time:   // todo!!
    default:
      break;
    }
    a_pPropPage->AppendIn( pImIptc, new wxStringProperty(
      key, wsLabel, wsValue ) );
    if ( g_iniPrefs.data[IE_LOG_VERBOSITY].dataCurrent.lVal > 4 )
    {
      wsT.Printf( _( "IPTC key: %s val:  %s" ), key, wsValue );
      wxLogMessage( wsT );
    }
  }
}``

Some of this code depends on wxWidgets - but can easily be adapted to other frameworks

@clanmills
Copy link
Collaborator

Thanks for this, Arnold. As Exiv2 usually builds/links iconv, I believe we already have character set conversion alternatives to wx and avoids a build dependency between Exiv2 and (the huge) wx library. I think wstr can deal with this on MSVC/Windows.

I don't want to get involved with this. Another member of the team make undertake this challenge, however my priorities are the 0.27 "dots" and working on the book/work-shop next year in Rennes.

I feel that the centre of exiv2 is reading/writing/modifying metadata and I wish the library had never become involved in lens recognition, data convertors and other "data presentation/interpretation" matters.

@clanmills
Copy link
Collaborator

@tester0077

  1. Approving Open PRs

I’d like to get Exiv2 v0.27.3 released (it was due on 2019-09-30). I’ve assigned you to review a couple of PRs.
https://github.com/Exiv2/exiv2/pulls?utf8=✓&q=is%3Apr+is%3Aopen+milestone%3Av0.27.3+review-requested%3Atester0077

Dan’s busy moving house. When I last spoke to Luis (about 2 weeks ago), he said “hope to have more time for open-source later this year”.

Can you “review and approve” your PRs. I’d really like Dan and/or Luis to do this as they always think of something smart and clever. However, it’s more important to get this stuff released.

  1. Team Exiv2 Riot/Matrix Chat Server

Can you send me your email address and I will invite you to join the Team Exiv2 Chat Server on Riot/Matrix. That’s how the team discusses stuff “off-line”. It will take up very little of your time, however you can speak directly to the Team or 2 a team member. You’ll find this useful. You won’t find it intrusive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
notReproducible Reported bugs not confirmed
Projects
None yet
Development

No branches or pull requests

2 participants