IMAGING-168 installing package with Swedish characters adds junk char… #18

khemkasudeep · 2015-08-03T07:33:19Z

…acters to dc:title property

britter · 2015-09-29T16:14:26Z

src/main/java/org/apache/commons/imaging/formats/jpeg/iptc/IptcParser.java

@@ -125,6 +127,9 @@ public PhotoshopApp13Data parsePhotoshopSegment(final byte[] bytes,
    protected List<IptcRecord> parseIPTCBlock(final byte[] bytes, final boolean verbose)
            throws IOException {
        final List<IptcRecord> elements = new ArrayList<IptcRecord>();
+        final String DEFAULT_ENCODING = "ISO-8859-1";
+        final int ENV_TAG_CODED_CHARACTER_SET = 90;


Should both better be private static finalconstants on this class

Maybe ENV_TAG_CODED_CHARACTER_SET could even be added to IptcConstants.

… encoding of the IPTC tag values

britter · 2015-10-02T08:01:13Z

Hello @khemkasudeep,

after applying your changes I get:

Tests in error:
  ByteSourceImageTest.test:84->checkGuessFormat:139 » ImageRead Couldn't read ma...
  BmpReadTest.data:44->BmpBaseTest.getBmpImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->BmpBaseTest.access$000:30->BmpBaseTest.isBmp:34 » ImageRead
  DcxReadTest.data:40->DcxBaseTest.getDcxImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->DcxBaseTest.access$000:30->DcxBaseTest.isDcx:34 » ImageRead
  GifReadTest.data:40->GifBaseTest.getGifImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->GifBaseTest.access$000:30->GifBaseTest.isGif:34 » ImageRead
  IcnsReadTest.data:42->IcnsBaseTest.getIcnsImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->IcnsBaseTest.access$000:30->IcnsBaseTest.isIcns:34 » ImageRead
  JpegReadTest.data:46->JpegBaseTest.getJpegImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->JpegBaseTest.isJpeg:34 » ImageRead
  IptcCodedCharacterSetTest.testCodedCharacterSet:61 » ImageRead Unexpected EOF.
  PamReadTest.test:41->PamBaseTest.getPamImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->PamBaseTest.access$000:30->PamBaseTest.isPam:34 » ImageRead
  ConvertPngToGifTest.test:38->PngBaseTest.getPngImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->PngBaseTest.access$000:30->PngBaseTest.isPng:34 » ImageRead
  PngReadTest.test:40->PngBaseTest.getPngImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->PngBaseTest.access$000:30->PngBaseTest.isPng:34 » ImageRead
  PsdReadTest.test:40->PsdBaseTest.getPsdImages:43->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->PsdBaseTest.access$000:29->PsdBaseTest.isPsd:32 » ImageRead
  RgbeReadTest.test:39->RgbeBaseTest.getRgbeImages:43->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->RgbeBaseTest.access$000:29->RgbeBaseTest.isRgbe:32 » ImageRead
  TiffReadTest.test:36->TiffBaseTest.getTiffImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->TiffBaseTest.access$000:30->TiffBaseTest.isTiff:34 » ImageRead
  TiffRoundtripTest.test:41->TiffBaseTest.getTiffImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->TiffBaseTest.access$000:30->TiffBaseTest.isTiff:34 » ImageRead
  XmpDumpTest.test:45 » ImageRead Couldn't read magic numbers to guess format.
  XmpUpdateTest.test:52 » ImageRead Couldn't read magic numbers to guess format.

Tests run: 394, Failures: 0, Errors: 16, Skipped: 3

Without your changes, all tests pass. Can you have a look? Please make sure to run mvn clean verify before submitting PRs. Thank you!

khemkasudeep · 2015-10-02T13:46:08Z

@britter Tests are being passed in windows machine. What OS are you using?

britter · 2015-10-02T17:19:21Z

@khemkasudeep I'm on

Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T13:57:37+02:00)
Maven home: /usr/local/Cellar/maven/3.3.3/libexec
Java version: 1.7.0_79, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/jre
Default locale: de_DE, platform encoding: UTF-8
OS name: "mac os x", version: "10.10.5", arch: "x86_64", family: "mac"

Maybe you can try in a Linux VM?

khemkasudeep · 2015-10-05T06:54:29Z

Hi @britter

I tried on
Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T17:27:37+05:30)
Maven home: /usr/local/Cellar/maven/3.3.3/libexec
Java version: 1.7.0_67, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.10.5", arch: "x86_64", family: "mac"

But it again passed. Can it depend on locale?
Could you provide complete log?

britter · 2015-10-07T17:27:27Z

Hello @khemkasudeep

here is the build log: https://gist.github.com/britter/ec0b7b15128bf8931e6c

I'm currently looking for a way to configure surefire to use english locale on my system...

khemkasudeep · 2015-10-08T16:24:29Z

@britter I tried below. Even then the test passed.

org.apache.maven.plugins
maven-surefire-plugin

-Duser.language=de -Duser.region=DE

britter · 2015-10-08T16:30:03Z

@khemkasudeep I also found that one and tried to set my build to your locale. Build still failing. I assume the configuration doesn't work or this problem is unrelated to locale.

ashokpanghal · 2016-03-02T07:10:07Z

Hi @britter , @khemkasudeep
I tested this patch on my mac with below maven setup and it passed without any flaws. @britter, could you please start afresh with conning the repo again and merging this PR there .
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T22:11:47+05:30) Maven home: /usr/local/Cellar/maven/3.3.9/libexec Java version: 1.7.0_80, vendor: Oracle Corporation Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home/jre Default locale: de_DE, platform encoding: UTF-8 OS name: "mac os x", version: "10.11.3", arch: "x86_64", family: "mac"

kinow · 2017-08-06T11:02:33Z

src/main/java/org/apache/commons/imaging/formats/jpeg/iptc/IptcParser.java

+                return codedCharacterSetString;
+            }
+        }catch (IllegalCharsetNameException e){
+


Might be worth to add at least one comment saying why these exceptions are not important, and that can be safely ignored.

janArb · 2019-08-01T12:09:15Z

I patched this library with this merge request myself and can verify that under windows and linux all tests are passing successfully and the bugfix is solved successfully, too. Please merge this to fix this bug!

Apache Maven 3.3.9
Maven home: /usr/share/maven3
Java version: 1.8.0_171, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.19.0-25-generic", arch: "amd64", family: "unix"
Tests run: 433, Failures: 0, Errors: 0, Skipped: 15


Apache Maven 3.6.0
Maven home: C:\dev\tools\apache-maven-3.6.0
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: C:\dev\java\jdk1.8.0_191\jre
Default locale: de_DE, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
Tests run: 433, Failures: 0, Errors: 0, Skipped: 15

PS: If you want to do this, too:

git clone https://github.com/apache/commons-imaging.git
cd commons-imaging
git fetch origin pull/18/head:bugfix-iptc-encoding
mvn test

kinow · 2019-08-01T12:12:36Z

Hi @janArb ! Thanks for testing and reporting back here.

Unfortunately there is some feedback not addressed in the pull request. I can sort out the conflict files, but would be better if @khemkasudeep could reply/address the pending review comments.

Otherwise we would need a new clean pull request, but that would be bad as @khemkasudeep appears to have a working solution that only needs a bit of polishing. I'll see if I can have another look with more calm in the next days.

Bruno

michael-o · 2019-08-01T12:39:58Z

src/main/java/org/apache/commons/imaging/formats/jpeg/iptc/IptcParser.java

-                final byte[] recordData = element.value.getBytes("ISO-8859-1");
-                if (!new String(recordData, "ISO-8859-1").equals(element.value)) {
+                final byte[] recordData = element.value.getBytes(DEFAULT_ENCODING);
+                if (!new String(recordData, DEFAULT_ENCODING).equals(element.value)) {
                    throw new ImageWriteException(
                            "Invalid record value, not ISO-8859-1");


This should also use `DEFAULT_ENCODING

Good catch, this way we won't forget changing the exception text 👍

ecki · 2019-08-01T13:46:42Z

BTW I think Latin1 is one of the 8bit transparent encodings, so the old code would never be unequal. This is one of the reasons why Latin1 is used if a encoding is not well defined. I do not understand iptc enough here to say if it is good or bad to use a different encoding, but it might need to be configurable...

janArb · 2019-08-01T13:52:07Z

@ecki It was not configurable before and latin is still the fallback if no encoding is defined. Why should this bugfix here solve more features? I think an encoding configuration should be treated as feature.
PS: It definitely is good to support UTF8 if that is defined in the official IPTC encoding tag (https://de.wikipedia.org/wiki/IPTC-IIM-Standard -> Coded Character Set)

khemkasudeep · 2019-08-02T10:17:28Z

Hi @janArb
I raised this PR in 2015. It has been a long time since I touched this code base. But AFAIR, the only blocker at that time for this patch was @britter was getting build issues on his machine while patch was building fine on my machine.

kinow · 2019-08-02T10:22:31Z

src/main/java/org/apache/commons/imaging/formats/jpeg/iptc/IptcRecord.java

        this.iptcType = iptcType;
        byte[] tempBytes;
        try {
-            tempBytes = value.getBytes("ISO-8859-1");
+            tempBytes = value.getBytes(charsetName);


Overall I think the code looks good to be merged. I would need to find some spec/docs for reference to clear up a few more questions about the issue.

But the main issue right now are the conflicts. The tempBytes was removed from IptcRecord, and now we have only the String value. We need to look at the git history and past issues, and try to understand why it was changed this way.

Then we can resolve the conflicts and leave it ready to be merged 👍

kinow · 2019-08-02T10:24:20Z

@ecki

BTW I think Latin1 is one of the 8bit transparent encodings, so the old code would never be unequal. This is one of the reasons why Latin1 is used if a encoding is not well defined. I do not understand iptc enough here to say if it is good or bad to use a different encoding, but it might need to be configurable...

I don't know much about IPTC, and this issue with transparent encodings. But looks to me like after this PR we would have ISO-8859-1 as default/fallback, but if within the metadata another value was defined, then the parser would now respect and use it... so in a certain way it would be configurable I think?

kinow · 2019-08-02T10:26:32Z

@janArb

Why should this bugfix here solve more features?

I think @ecki was merely pointing out something to watch out while reviewing it, and as you explained, after this PR it should be configurable I think.

kinow · 2019-08-02T10:29:11Z

@khemkasudeep

Hi! And thanks again for your contribution. I just fetched the code from your branch as was in the process or rebasing against master to try and solve the conflicts, when I noticed that the bytes were removed in IptcRecord 😕

I will need to look at the commit log and understand better what changed why, and how to massage your code to get it in a state ready to be merged again (unless you have spare time to fix it?).

Once that's done, I think it's ready to be merged - and @janArb kindly tested it as well, so looks like a good issue to be included in 1.0-alpha2 👍

Bruno

ecki · 2019-08-02T11:13:42Z

I think @ecki was merely pointing out something to watch out while reviewing it, and as you explained, after this PR it should be configurable I think.

Yes, exactly, the code I commented did not look like it is configurable. If it is, it might be fine (however configurable is not very useful in scenarios where you work on collections of files where some have different encoding, a binary transparent encoding instead of encoding exceptions is preferred in that case (at least to me).

janArb · 2019-08-14T15:14:20Z

Just realized that even after the patch for reading UTF8 the writing of IPTC tags is also still limited to Latin (

commons-imaging/src/main/java/org/apache/commons/imaging/formats/jpeg/iptc/IptcParser.java

Line 489 in 2752dd7

if (!new String(recordData, "ISO-8859-1").equals(element.value)) {

) in org.apache.commons.imaging.formats.jpeg.iptc.IptcParser#writeIPTCBlock

                final byte[] recordData = element.value.getBytes("ISO-8859-1");
                if (!new String(recordData, "ISO-8859-1").equals(element.value)) {
                    throw new ImageWriteException(
                            "Invalid record value, not ISO-8859-1");

IMAGING-168 installing package with Swedish characters adds junk char…

2752dd7

…acters to dc:title property

britter reviewed Sep 29, 2015
View reviewed changes

IMAGING-168 IPTC parser should use CodedCharacterSet tag to determine…

c451871

… encoding of the IPTC tag values

kinow reviewed Aug 6, 2017

View reviewed changes

michael-o reviewed Aug 1, 2019

View reviewed changes

kinow reviewed Aug 2, 2019

View reviewed changes

kinow added the someday label Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMAGING-168 installing package with Swedish characters adds junk char… #18

IMAGING-168 installing package with Swedish characters adds junk char… #18

khemkasudeep commented Aug 3, 2015

britter Sep 29, 2015

britter Sep 29, 2015

britter commented Oct 2, 2015

khemkasudeep commented Oct 2, 2015

britter commented Oct 2, 2015

khemkasudeep commented Oct 5, 2015

britter commented Oct 7, 2015

khemkasudeep commented Oct 8, 2015

britter commented Oct 8, 2015

ashokpanghal commented Mar 2, 2016

kinow Aug 6, 2017

janArb commented Aug 1, 2019 •

edited

kinow commented Aug 1, 2019

michael-o Aug 1, 2019

kinow Aug 2, 2019

ecki commented Aug 1, 2019

janArb commented Aug 1, 2019 •

edited

khemkasudeep commented Aug 2, 2019

kinow Aug 2, 2019

kinow commented Aug 2, 2019

kinow commented Aug 2, 2019

kinow commented Aug 2, 2019

ecki commented Aug 2, 2019

janArb commented Aug 14, 2019

IMAGING-168 installing package with Swedish characters adds junk char… #18

Are you sure you want to change the base?

IMAGING-168 installing package with Swedish characters adds junk char… #18

Conversation

khemkasudeep commented Aug 3, 2015

britter Sep 29, 2015

Choose a reason for hiding this comment

britter Sep 29, 2015

Choose a reason for hiding this comment

britter commented Oct 2, 2015

khemkasudeep commented Oct 2, 2015

britter commented Oct 2, 2015

khemkasudeep commented Oct 5, 2015

britter commented Oct 7, 2015

khemkasudeep commented Oct 8, 2015

britter commented Oct 8, 2015

ashokpanghal commented Mar 2, 2016

kinow Aug 6, 2017

Choose a reason for hiding this comment

janArb commented Aug 1, 2019 • edited

kinow commented Aug 1, 2019

michael-o Aug 1, 2019

Choose a reason for hiding this comment

kinow Aug 2, 2019

Choose a reason for hiding this comment

ecki commented Aug 1, 2019

janArb commented Aug 1, 2019 • edited

khemkasudeep commented Aug 2, 2019

kinow Aug 2, 2019

Choose a reason for hiding this comment

kinow commented Aug 2, 2019

kinow commented Aug 2, 2019

kinow commented Aug 2, 2019

ecki commented Aug 2, 2019

janArb commented Aug 14, 2019

janArb commented Aug 1, 2019 •

edited

janArb commented Aug 1, 2019 •

edited