-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Character.UnicodeBlock and add better tests #1887
base: main
Are you sure you want to change the base?
Conversation
d2a4eb7
to
6672521
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are currently tracking Unicode for Java 8. We used Unicode 6.3.0. This version was used since it was a minor update from JDK8's 6.2.0 whereas 7.0.0 added many new codes.
If you would like avoid formatting with scripts/scalafmt
you could probably use their comment for ignoring a section.
@ekrich thanks for the initial feedback, I'm updating the PR, will see what breaks, but happy to hopefully get it resolved, I'm duplicating the work in I'll see if I can get a deprecation list from the unicode java8 and make sure they are included. |
What version of JVM is Scala.js tracking? I created a project here but I like your script approach. https://github.com/ekrich/scala-unicode Initially I was thinking that we could put all the scripts in one place and maybe put it under |
I seem to remember that to be bug-for-bug compatible with Java 8 there were a few codes in Java 8 which @er1c Now what was it written over the gate leading into the Inferno? |
@LeeTibbert I didn't remove anytests, so if there were any edge cases previously documented, they should be found. |
er1c Thank you for starting this conversation. I mean to be supportive of your energy & efforts. I hope my investment of time & plain Let me start at the highest level, the level of strategy. While it should not get in the way of the scala-java-time is a good candidate for java.time. The former compiles & pases its Demo project on SN. I believe that Eric P is a contributor to scala-java-locale (sp?). Third party libraries are likely to have more domain expertise and a quicker turnaround. Unicode is now As we probably all agree, the Unicode and Locale issues are Large Problems, and it will take a Beast to |
So, let me now focus on the PR at hand. I have a few high level concerns. If those bear out, I will do another pass at line-by-line. Again, if the comment is useful, great. If it is wrong or just not useful, no harm done. |
@@ -0,0 +1,415 @@ | |||
// Run with $ amm scripts/GenerateUnicodeBlock.sc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is good to get the generating script into Git. Please consider giving a few
more clues the the poor suffering who comes after you
and tries to run this.
For example, WT? is amm? yes, I know it is ammonite, but clues such as a URL
to ammonite and ammonite scripts, would
improve, IMO, the software engineering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the software engineering would also be improved if there were a short, one paragraph description of how this script fits into the general build process. That it is
not automagically executed in build.sbt, but rather is manually run and the output
manually edited into j.l.Character.scala.
That is, could a bright middle-school pupil, or the less capable engineer-in-a-hurry of English common law, follow in your footsteps?
@@ -0,0 +1,415 @@ | |||
// Run with $ amm scripts/GenerateUnicodeBlock.sc | |||
// Replace code in ./javalanglib/src/main/scala/java/lang/Character.scala |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comments in Character.scala, which I will review next. IMO, the software engineering would be improved if this script printed out some, actually quite a bit of audit
trail information: Who ran it, when, which version of Unicode used as input files.
val LATIN_EXTENDED_B = new UnicodeBlock("LATIN_EXTENDED_B", 0x180, 0x24f) | ||
val IPA_EXTENSIONS = new UnicodeBlock("IPA_EXTENSIONS", 0x250, 0x2af) | ||
////////////////// | ||
// Begin Generated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the software engineering would be improved by at least giving the
reference Unicode version, the name and Git repository location of the script
used to generate the output, and the fact that it was manually folded (no build.sbt)
into the current file.
I do not see the usual "Do not edit" .
I also like to see who generated the output and date. Some would say that this
data can be more reliably figured out from GitBlame, but that is a recursive distraction from trying to read the code as is, on-the-fly.
////////////////// | ||
|
||
// scalastyle:off line.size.limit | ||
val SURROGATES_AREA = new UnicodeBlock("SURROGATES_AREA", 0x0, 0x0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this takes a close look from an SN compiler expert.
I believe that the next whack of UnicodeBlock declarations should be "lazy val". Of course, that means that any tables using them should be carefully constructed to be
lazy on any path but a small set of exceedingly common UnicodeBlocks (such as English only). That is a fast&small chain to slow&complete design.
My concerns are the latency/startup_time when the binary begins to execute. Since these vals are in an object within an object, I believe they will currently all be allocated and initialized at binary startup. That is quite a few allocations for data that will almost certainly never be used.
I am pretty certain that changing to "lazy" will reduce startup time & heap requirements. I am not certain if it will slightly increase code size.
I am aware that the code being replaced used "vals" rather than "lazy vals", I am reluctant to see hundreds of more items slowing up binary startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I seem to remember being advised when I was doing Unicode work for the re2 regex code to use one dimensional arrays (of structures). And to be careful to make
maps lazy.
I think it prudent to ask what the recommendation from current SN leadership is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, the re2s and Character.scala unicode data should not be being duplicated.
I think there is at least one pending PR to reduce, not eliminate, some of that
duplication. IIRC, the idea was to let that PR merge & settle, then do additional
passes. I seem to remember to total elimination was not feasible, but that a lot of error-prone duplication could be eliminated.
I believe that known flaw should not keep progress from happening here. I am also open to & welcome better suggestions. Sometimes one simply runs out of time before the next distraction/crises hits and the technical debt remains in the code base for far too long.
|
||
private val BLOCKS = Array( | ||
private val allBlocks: Array[UnicodeBlock] = Array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this is the type of thing that needs to be lazy, if possible. I did not trace how it gets used. This table will, I believe, cause all of the UnicodeBlocks it references to be
instantiated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May need to be an method which returns the "known&fast" UnicodeBlocks, then iterates over a lazy array of "complete&slow" UnicodeBlocks.
SUPPLEMENTARY_PRIVATE_USE_AREA_B) | ||
import scala.Predef.ArrowAssoc | ||
private val blocksByNormalizedName | ||
: scala.collection.Map[String, UnicodeBlock] = scala.collection.Map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, baseline code uses scala map. There was an effort in Scala.js to reduce/eliminate the use of scala.collection, at least in javalib, in favor of java collections
(which are a Royal Pain!).
I think there has been no organized effort to do that yet in SN. Perhaps changing this can be delayed due to prior use in the baseline code. I am not the one to make that
call but thought I would point the effort out.
val SUPPLEMENTARY_PRIVATE_USE_AREA_B = | ||
new UnicodeBlock("SUPPLEMENTARY_PRIVATE_USE_AREA_B", 0x100000, 0x10ffff) | ||
new UnicodeBlock("SUPPLEMENTARY_PRIVATE_USE_AREA_B", 0x100000, 0x10FFFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the last generated UnicodeBlock. With generated code, my very early computer science training to print out the number of generated entries comes to mind.
That makes comparing the count from the generated output against that in the Unicode Spec reference files. I seem to remember that either they give the expected count, or it was very easy to get (unix wc -l minus a few).
// For example, "Latin Extended-A" and "latin extended a" are equivalent. | ||
// For more information on the comparison of property values, | ||
// see UAX #44: http://www.unicode.org/reports/tr44/ | ||
private def toNormalizedName(name: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comment: just-in-time delivery of hard to find pertinent wizardry.
Allows me to ask my next question, just below.
// For more information on the comparison of property values, | ||
// see UAX #44: http://www.unicode.org/reports/tr44/ | ||
private def toNormalizedName(name: String): String = { | ||
name.toLowerCase.replaceAll(raw"[\s\-_]", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the "\s" is incorrect/incomplete here. That is, I am pretty sure that the
text in the comment is referring to Unicode whitespace (table Zs) while "\s" is referring to the much smaller set of Java whitespace.
As a thought experiment, imagine some person set out to break/test the code by giving an input string which contains an Ogham (Old Irish) space mark (one of my favorite test cases). I submit that according to the comment, the Ogham character should be removed and that the presented code will not do that.
We are in the land of regex here. The known & tedious way to loop through each of the input characters and retain it only if it is not in the Unicode Zs table (in scalanative.re2s). Since replaceAll brings us well into the land of re2s, I could look
to see if re2s has a way to specify * replace Unicode whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the nanolevel, I think doing the replaceAll first, then the toLowerCase has lower
amortized cost over the expected set of inputs. Cost for text not containing Unicode
whitespace, etc. is probably same (but one always has to look at implementations).
Most names will contain whitespace, etc, so replaceAll first will be a few cycles faster.
Too many hours and exams in Algorithms classes.
} | ||
|
||
//////////////// | ||
// End Generated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an end generated marker is good, but generated by which script, date?
Where there is one generator, there may well be more than one.
// Begin Generated | ||
////////////////// | ||
|
||
// scalastyle:off line.size.limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the scalafmt used in ScalaNative respect the scalastyle:off (and others)
I suspect these are holdovers from the Scala.js port. Perhaps left in baseline
code to reduce churn but probably should not be in new SN code (especially generated).
What does scalafmt do to these large generated tables? I would expect it to mangle them, but the code in this PR looks OK.
er1c re: I didn't remove anytests, so if there were any edge cases previously documented, they should be found. Now that I have reviewed the code on a full screen and not on my phone, I see that you are mainly declaring UnicodeBlocks. |
.fold { | ||
throw new IllegalArgumentException() | ||
} { value => value } | ||
blocksByNormalizedName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove a perfectly good & useful null check?
There is the off chance is that such was recommended by someone in the know at
Scala.js.
If not, my long training in expecting explicit argument checks for every & anything passed in from application code leads me to miss the null check here. True, that
some code some-J-Random-where, perhaps 5 layers down will throw the exception
at some hard to find and debug place. With line numbers in the NIR for SN in the offing, there is a chance to get an explicit line number at an understandable place early in execution, when there is less shrapnel.
Do I sound opinionated here? I was merely trying for experienced. I am also aware
that I am a contributor, and that more senior reviewers set the standard & practice.
TODO:
Findings
All of the JDK8 constants are defined in the JDK 14 constants (aka JDK14 is a superset)