Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed length unmarshaller - Make it possible to customize calculating of record length #159

Open
davsclaus opened this issue Mar 5, 2024 · 7 comments

Comments

@davsclaus
Copy link
Contributor

When beanio is unmarshalling some data (like a line of text), then the mapping to fields is based on string length value. However for asian countries then they have single and double byte characters that can be mixed in a String, which causes the length to be mis-calculated.

In the source code of beanio, then the setRecordValue method:
https://github.com/beanio/beanio/blob/main/src/org/beanio/internal/parser/format/fixedlength/FixedLengthUnmarshallingContext.java#L34

Is hardcoded the record length as follows:

    @Override
    public void setRecordValue(Object value) {
        this.record = (String) value;
        this.recordLength = value == null ? 0 : record.length();
    }

And the recordLength is private, so you cannot override this and calculate this via a custom implementation of the unmarshalling context.

I wonder if beanio could make this pluggable so end users can provide their own implementation of this, so you can calulcate the record length that you need.

@bjansen
Copy link
Collaborator

bjansen commented Mar 5, 2024

Are you saying you'd like to compute the number of bytes instead of the number of UTF-8 code points?

class UtfTest {
	public static void main(String[] args) {
		String str = "hello 世界";
		System.out.println(str.length()); // 8
		System.out.println(str.getBytes(StandardCharsets.UTF_8).length); // 12
	}
}

Why do you think the current behavior is incorrect?

@hfuruich
Copy link

hfuruich commented Mar 5, 2024

Hi @bjansen. It's common in Japan(and Asia) to treat "double byte character" as 2 character length. Of course "single byte character" counts as 1 character length.
"Aa" is a combination of "double byte character" and "single byte character".
The expected behavior for BeanIO is counting this String length as 3 instead of 2.

I found a similar discussion in the google group (It's about 8 years ago).
https://groups.google.com/g/beanio/c/00lSwPI2U6Y

I hope this request is understandable.

@davsclaus
Copy link
Contributor Author

Thanks @bjansen for jumping in here. To better understand this more clearly then we are in process of getting a real world example put together to ensure what we discuss and potentially can improve in beanio is on the right track.

@bjansen
Copy link
Collaborator

bjansen commented Mar 6, 2024

@hfuruich thanks for the explanation, I think I understand the problem. I'm intrigued though, how do you know how many bytes a given character takes? Do you need a UTF-8 table on hand, or do you assume that characters in the Hiragana block for example are always 3 bytes long?

For an actual solution, I can think of several possibilities:

  • a new attribute on records: <record count="chars|bytes">
  • a new attribute on fields: <field count="chars|bytes">
  • a new property in FixedLengthParserConfiguration that could be configured like this:
<parser class="org.beanio.stream.fixedlength.FixedLengthRecordParserFactory">
    <property name="countMode" value="chars"/>
</parser>
  • a global configuration in beanio.properties that would cover all the fixed length parsers

What kind of granularity would be needed? Would the last suggestion be enough?

@davsclaus
Copy link
Contributor Author

Those are some really good suggestions.

I like all of them (sorry for that) but having it in the mapping file make it easy for non developers to specify. And Java code is needed when you must do it via Java or have some special code that can control this.

And the global option make it easy to set instead of having to change a lot of mapping files

@hfuruich
Copy link

hfuruich commented Mar 7, 2024

@bjansen thank you so much for your excellent idea.

The bytes of a character is depends on the encoding user uses. So following code will count the expected length.
(We might need some exception handling whether the requested encoding is supported or not.)

str.getBytes(Charset.forName("encoding name which user specified")).length

It looks like "<stream>" has an attribute named "encoding". Is it possible using this "encoding" attribute to calculate the character's bytes?

If it's possible, your excellent idea covers users scenario.

  • User A who only uses single byte characters
    Set <field count="char"/>.

  • User B who uses single byte characters and multi byte characters
    Set <stream encoding="MS932"> or other encoding name and <field count="byte"/>.

As @davsclaus votes, all of them covers most of use cases in the world. I like all of your ideas too.

@hfuruich
Copy link

Hello.
If there is a possibility, please also consider about providing an annotation or interface to calculate how many bytes for characters.
This is an example.
UTF-8 treats "A" as 3 bytes. But some users in Asia want to count this as 2 characters. It sounds like very strange for non Asian users but this is the real world. Some Asian users want to treat the multiple-bytes characters as double-bytes characters.
To handle this use case, I thought providing an interface or annotation could be a solution. User simply can implement this
interface as they wants.

int countBytes(character[])

This is very confusing use case but please also consider about this too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants