Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMAGING-168 installing package with Swedish characters adds junk char… #18

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

khemkasudeep
Copy link

…acters to dc:title property

@@ -125,6 +127,9 @@ public PhotoshopApp13Data parsePhotoshopSegment(final byte[] bytes,
protected List<IptcRecord> parseIPTCBlock(final byte[] bytes, final boolean verbose)
throws IOException {
final List<IptcRecord> elements = new ArrayList<IptcRecord>();
final String DEFAULT_ENCODING = "ISO-8859-1";
final int ENV_TAG_CODED_CHARACTER_SET = 90;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should both better be private static finalconstants on this class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ENV_TAG_CODED_CHARACTER_SET could even be added to IptcConstants.

@britter
Copy link
Member

britter commented Oct 2, 2015

Hello @khemkasudeep,

after applying your changes I get:

Tests in error:
  ByteSourceImageTest.test:84->checkGuessFormat:139 » ImageRead Couldn't read ma...
  BmpReadTest.data:44->BmpBaseTest.getBmpImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->BmpBaseTest.access$000:30->BmpBaseTest.isBmp:34 » ImageRead
  DcxReadTest.data:40->DcxBaseTest.getDcxImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->DcxBaseTest.access$000:30->DcxBaseTest.isDcx:34 » ImageRead
  GifReadTest.data:40->GifBaseTest.getGifImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->GifBaseTest.access$000:30->GifBaseTest.isGif:34 » ImageRead
  IcnsReadTest.data:42->IcnsBaseTest.getIcnsImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->IcnsBaseTest.access$000:30->IcnsBaseTest.isIcns:34 » ImageRead
  JpegReadTest.data:46->JpegBaseTest.getJpegImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->JpegBaseTest.isJpeg:34 » ImageRead
  IptcCodedCharacterSetTest.testCodedCharacterSet:61 » ImageRead Unexpected EOF.
  PamReadTest.test:41->PamBaseTest.getPamImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->PamBaseTest.access$000:30->PamBaseTest.isPam:34 » ImageRead
  ConvertPngToGifTest.test:38->PngBaseTest.getPngImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->PngBaseTest.access$000:30->PngBaseTest.isPng:34 » ImageRead
  PngReadTest.test:40->PngBaseTest.getPngImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->PngBaseTest.access$000:30->PngBaseTest.isPng:34 » ImageRead
  PsdReadTest.test:40->PsdBaseTest.getPsdImages:43->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->PsdBaseTest.access$000:29->PsdBaseTest.isPsd:32 » ImageRead
  RgbeReadTest.test:39->RgbeBaseTest.getRgbeImages:43->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->RgbeBaseTest.access$000:29->RgbeBaseTest.isRgbe:32 » ImageRead
  TiffReadTest.test:36->TiffBaseTest.getTiffImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->TiffBaseTest.access$000:30->TiffBaseTest.isTiff:34 » ImageRead
  TiffRoundtripTest.test:41->TiffBaseTest.getTiffImages:45->ImagingTest.getTestImages:81->ImagingTest.getTestImages:120->TiffBaseTest.access$000:30->TiffBaseTest.isTiff:34 » ImageRead
  XmpDumpTest.test:45 » ImageRead Couldn't read magic numbers to guess format.
  XmpUpdateTest.test:52 » ImageRead Couldn't read magic numbers to guess format.

Tests run: 394, Failures: 0, Errors: 16, Skipped: 3

Without your changes, all tests pass. Can you have a look? Please make sure to run mvn clean verify before submitting PRs. Thank you!

@khemkasudeep
Copy link
Author

@britter Tests are being passed in windows machine. What OS are you using?

@britter
Copy link
Member

britter commented Oct 2, 2015

@khemkasudeep I'm on

Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T13:57:37+02:00)
Maven home: /usr/local/Cellar/maven/3.3.3/libexec
Java version: 1.7.0_79, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_79.jdk/Contents/Home/jre
Default locale: de_DE, platform encoding: UTF-8
OS name: "mac os x", version: "10.10.5", arch: "x86_64", family: "mac"

Maybe you can try in a Linux VM?

@khemkasudeep
Copy link
Author

Hi @britter

I tried on
Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 2015-04-22T17:27:37+05:30)
Maven home: /usr/local/Cellar/maven/3.3.3/libexec
Java version: 1.7.0_67, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_67.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.10.5", arch: "x86_64", family: "mac"

But it again passed. Can it depend on locale?
Could you provide complete log?

@britter
Copy link
Member

britter commented Oct 7, 2015

Hello @khemkasudeep

here is the build log: https://gist.github.com/britter/ec0b7b15128bf8931e6c

I'm currently looking for a way to configure surefire to use english locale on my system...

@khemkasudeep
Copy link
Author

@britter I tried below. Even then the test passed.

org.apache.maven.plugins
maven-surefire-plugin

-Duser.language=de -Duser.region=DE

@britter
Copy link
Member

britter commented Oct 8, 2015

@khemkasudeep I also found that one and tried to set my build to your locale. Build still failing. I assume the configuration doesn't work or this problem is unrelated to locale.

@ashokpanghal
Copy link

Hi @britter , @khemkasudeep
I tested this patch on my mac with below maven setup and it passed without any flaws. @britter, could you please start afresh with conning the repo again and merging this PR there .
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T22:11:47+05:30) Maven home: /usr/local/Cellar/maven/3.3.9/libexec Java version: 1.7.0_80, vendor: Oracle Corporation Java home: /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home/jre Default locale: de_DE, platform encoding: UTF-8 OS name: "mac os x", version: "10.11.3", arch: "x86_64", family: "mac"

return codedCharacterSetString;
}
}catch (IllegalCharsetNameException e){

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth to add at least one comment saying why these exceptions are not important, and that can be safely ignored.

@janArb
Copy link

janArb commented Aug 1, 2019

I patched this library with this merge request myself and can verify that under windows and linux all tests are passing successfully and the bugfix is solved successfully, too. Please merge this to fix this bug!

Apache Maven 3.3.9
Maven home: /usr/share/maven3
Java version: 1.8.0_171, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.19.0-25-generic", arch: "amd64", family: "unix"
Tests run: 433, Failures: 0, Errors: 0, Skipped: 15


Apache Maven 3.6.0
Maven home: C:\dev\tools\apache-maven-3.6.0
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: C:\dev\java\jdk1.8.0_191\jre
Default locale: de_DE, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0
Tests run: 433, Failures: 0, Errors: 0, Skipped: 15

PS: If you want to do this, too:

git clone https://github.com/apache/commons-imaging.git
cd commons-imaging
git fetch origin pull/18/head:bugfix-iptc-encoding
mvn test

@kinow
Copy link
Member

kinow commented Aug 1, 2019

Hi @janArb ! Thanks for testing and reporting back here.

Unfortunately there is some feedback not addressed in the pull request. I can sort out the conflict files, but would be better if @khemkasudeep could reply/address the pending review comments.

Otherwise we would need a new clean pull request, but that would be bad as @khemkasudeep appears to have a working solution that only needs a bit of polishing. I'll see if I can have another look with more calm in the next days.

Bruno

final byte[] recordData = element.value.getBytes("ISO-8859-1");
if (!new String(recordData, "ISO-8859-1").equals(element.value)) {
final byte[] recordData = element.value.getBytes(DEFAULT_ENCODING);
if (!new String(recordData, DEFAULT_ENCODING).equals(element.value)) {
throw new ImageWriteException(
"Invalid record value, not ISO-8859-1");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also use `DEFAULT_ENCODING

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this way we won't forget changing the exception text 👍

@ecki
Copy link
Contributor

ecki commented Aug 1, 2019

BTW I think Latin1 is one of the 8bit transparent encodings, so the old code would never be unequal. This is one of the reasons why Latin1 is used if a encoding is not well defined. I do not understand iptc enough here to say if it is good or bad to use a different encoding, but it might need to be configurable...

@janArb
Copy link

janArb commented Aug 1, 2019

@ecki It was not configurable before and latin is still the fallback if no encoding is defined. Why should this bugfix here solve more features? I think an encoding configuration should be treated as feature.
PS: It definitely is good to support UTF8 if that is defined in the official IPTC encoding tag (https://de.wikipedia.org/wiki/IPTC-IIM-Standard -> Coded Character Set)

@khemkasudeep
Copy link
Author

Hi @janArb
I raised this PR in 2015. It has been a long time since I touched this code base. But AFAIR, the only blocker at that time for this patch was @britter was getting build issues on his machine while patch was building fine on my machine.

this.iptcType = iptcType;
byte[] tempBytes;
try {
tempBytes = value.getBytes("ISO-8859-1");
tempBytes = value.getBytes(charsetName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think the code looks good to be merged. I would need to find some spec/docs for reference to clear up a few more questions about the issue.

But the main issue right now are the conflicts. The tempBytes was removed from IptcRecord, and now we have only the String value. We need to look at the git history and past issues, and try to understand why it was changed this way.

Then we can resolve the conflicts and leave it ready to be merged 👍

@kinow
Copy link
Member

kinow commented Aug 2, 2019

@ecki

BTW I think Latin1 is one of the 8bit transparent encodings, so the old code would never be unequal. This is one of the reasons why Latin1 is used if a encoding is not well defined. I do not understand iptc enough here to say if it is good or bad to use a different encoding, but it might need to be configurable...

I don't know much about IPTC, and this issue with transparent encodings. But looks to me like after this PR we would have ISO-8859-1 as default/fallback, but if within the metadata another value was defined, then the parser would now respect and use it... so in a certain way it would be configurable I think?

@kinow
Copy link
Member

kinow commented Aug 2, 2019

@janArb

Why should this bugfix here solve more features?

I think @ecki was merely pointing out something to watch out while reviewing it, and as you explained, after this PR it should be configurable I think.

@kinow
Copy link
Member

kinow commented Aug 2, 2019

@khemkasudeep

Hi! And thanks again for your contribution. I just fetched the code from your branch as was in the process or rebasing against master to try and solve the conflicts, when I noticed that the bytes were removed in IptcRecord 😕

I will need to look at the commit log and understand better what changed why, and how to massage your code to get it in a state ready to be merged again (unless you have spare time to fix it?).

Once that's done, I think it's ready to be merged - and @janArb kindly tested it as well, so looks like a good issue to be included in 1.0-alpha2 👍

Bruno

@ecki
Copy link
Contributor

ecki commented Aug 2, 2019

I think @ecki was merely pointing out something to watch out while reviewing it, and as you explained, after this PR it should be configurable I think.

Yes, exactly, the code I commented did not look like it is configurable. If it is, it might be fine (however configurable is not very useful in scenarios where you work on collections of files where some have different encoding, a binary transparent encoding instead of encoding exceptions is preferred in that case (at least to me).

@janArb
Copy link

janArb commented Aug 14, 2019

Just realized that even after the patch for reading UTF8 the writing of IPTC tags is also still limited to Latin (

if (!new String(recordData, "ISO-8859-1").equals(element.value)) {
) in org.apache.commons.imaging.formats.jpeg.iptc.IptcParser#writeIPTCBlock

                final byte[] recordData = element.value.getBytes("ISO-8859-1");
                if (!new String(recordData, "ISO-8859-1").equals(element.value)) {
                    throw new ImageWriteException(
                            "Invalid record value, not ISO-8859-1");

@kinow kinow added the someday label Jul 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants