Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Port Lucene.Net.Analysis.Ko #645

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

krinsang
Copy link

The dotnet implementation of the Lucene library has yet to release a version containing the Analyzer class for Korean.
To mirror the Java releases, the Korean Analyzer and its respective dependencies have been implemented.

r.hwang added 9 commits September 28, 2022 14:51
The current released version of Lucenenet does not implement the Korean
Analyzer as it does in Java. This commit serves to port over the logic
from the Java repo to C# however it only contains logic for the Analyzer
class.
The current released version of Lucenenet does not implement the Korean
Analyzer as it does in Java. This commit serves to port over the logic
from the Java repo to C# however it only contains logic for the Analyzer
class.
…rean-analyzer

# Conflicts:
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/DecompoundToken.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/BinaryDictionary.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/CharacterDefinition.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/ConnectionCosts.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/Dictionary.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/TokenInfoDictionary.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/TokenInfoFST.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/UnknownDictionary.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/DictionaryToken.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/KoreanAnalyzer.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/TokenAttributes/PartOfSpeechAttributes.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/TokenAttributes/ReadingAttributes.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/TokenAttributes/ReadingAttributesImpl.cs
…rean-analyzer

# Conflicts:
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/DecompoundToken.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/BinaryDictionary.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/CharacterDefinition.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/ConnectionCosts.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/Dictionary.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/TokenInfoDictionary.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/TokenInfoFST.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/Dict/UnknownDictionary.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/DictionaryToken.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/KoreanAnalyzer.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/TokenAttributes/PartOfSpeechAttributes.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/TokenAttributes/ReadingAttributes.cs
#	src/Lucene.Net.Analysis.Common/Analysis/Ko/TokenAttributes/ReadingAttributesImpl.cs
implementation of the number filter and the respective filter factories
@rclabo
Copy link
Contributor

rclabo commented Sep 29, 2022

Wow! Very interesting contribution. It does not look like Java Lucene 4.8.0 or 4.8.1 contain the KoreanAnalyzer however they do contain a CJKAnalyzer which is intended to cover Chinese, Japanese, and Korean.

Which Java Lucene version is this contribution a port of?

@NightOwl888
Copy link
Contributor

NightOwl888 commented Sep 30, 2022

Thanks for the contribution.

There was a prior attempt at porting the nori analyzer on the feature/analysis-nori branch. However, there were 2 issues preventing it from functioning on Lucene.Net 4.8.0:

  1. The FST implementation was redesigned between 4.8.0 and the version that nori was added in Lucene. This incompatibility made it impossible to debug the tests.
  2. BigDecimal doesn't exist in .NET.

We are more concerned with getting past the first issue than the second, since it would be trivial to exclude KoreanNumberFilter and KoreanNumberFilterFactory from the project (commented out, with notes explaining that we are missing the BigDecimal dependency at the top of the file), but I will provide some info about how to proceed if you are interested in porting it.

Issues to Fix

  1. Move this into a new project named Lucene.Net.Analysis.Nori, using Lucene.Net.Analysis.Kuromoji as a project template. We don't want to add large dependencies to Lucene.Net.Analysis.Common, since most people are using it.
  2. Add a comment at the top of each code file indicating the exact Lucene version it is ported from. Example: // Lucene version compatibility level 9.3.0
  3. Bring over the tests from the feature/analysis-nori branch and add them to a new project named Lucene.Net.Tests.Analysis.Nori (or alternatively create a fresh port of the tests). Either way, please ensure the tests are up to date with the exact Lucene version that you are porting the nori analyzer from.
  4. The nori project is very similar to Lucene.Net.Analysis.Kuromoji and Lucene.Net.Analysis.SmartCn projects. Please use similar coding styles (constants instead of static fields, properties instead of methods where appropriate, naming conventions, folder detection for custom dictionaries, etc.)
  5. Please include the changes and documentation from the the feature/analysis-nori branch to the lucene-cli tool that is used to build custom nori dictionaries.
  6. After addressing the porting issues below, please ensure all of the tests are passing. The best way to do this is to setup Java debugging and step through the code. Note that you will need to use the version of Java Lucene you are doing the port from in order to get the same results. Lucene has made some changes to their setup procedure, so it may take some research/experimentation to do. Do note that we have these versions successfully running. Alternatively, you may compare the code files line by line, but it may not be possible to figure out how to make the tests pass if you use this approach.

FST

At the time I attempted the feature/analysis-nori branch, the FST API seemed to fit, however, due to some design changes it produced completely different results than the version I had ported it from (unfortunately, I don't recall what version it is based upon). At the time I thought that FST was tied deeply into other Lucene components and having multiple incompatible versions in the project wouldn't work. However, I have since learned that FST is only used in specific scenarios that end users won't need to plug together, so having a copy of the a later version of FST in the Lucene.Net.Analysis.Nori project should work fine (I think).

To be able to debug, we need to be able to step through the code and get FST to return the exact results from the Lucene version this is based upon. So, we need a fresh port of FST from the same version of Lucene that nori is based upon. The convention we are following is to put "extra" components such as this into a folder named Support, followed by the regular folder convention in Lucene.

src
  |
  -- Lucene.Net.Analysis.Nori
    |
    -- (All nori code files/folders)
    |
    -- Support
      |
      -- Util
        |
        -- Fst
          |
          -- (All FST code files)
  |
  -- Lucene.Net.Tests.Analysis.Nori
    |
    -- (All nori test code files/folders)
    |
    -- Support
      |
      -- Util
        |
        -- Fst
          |
          -- (All FST test code files)

Please be mindful that we will be using similar namespace conventions as we are currently (the namespace may not necessarily match the name of the project it belongs to). For now, please put the new FST port into the Lucene.Net.Support.Util.Fst namespace.

BigDecimal

We definitely don't want to take on a dependency to IKVM, both because it is large and because it doesn't support most of the .NET runtimes that we do. Please try one of the following (in order of preference):

  1. I suspect the primary reason BigDecimal was the goto class in Java is because Java doesn't have a decimal data type like .NET does. It has limited precision compared to BigDecimal, but would work for most use cases. As in the the feature/analysis-nori branch, we can use a nullable decimal? to account for gaps between Java and .NET reference vs value types.
  2. We could attempt to wrap BigInteger into a BigDecimal. There is an implementation here, which was found on StackOverflow. Please include it in the Support/Numerics folder (Lucene.Net.Numerics namespace) and be mindful of comments where people have suggested improvements to the implementation. We can use a nullable BigDecimal? to account for gaps between Java and .NET reference vs value types.
  3. If neither of the above options works, I suggest including the code files for KoreanNumberFilter and KoreanNumberFilterFactory (and their tests) in the project, but commenting them out. We will simply not include an implementation in Lucene.Net.

@krinsang
Copy link
Author

krinsang commented Oct 3, 2022

Wow! Very interesting contribution. It does not look like Java Lucene 4.8.0 or 4.8.1 contain the KoreanAnalyzer however they do contain a CJKAnalyzer which is intended to cover Chinese, Japanese, and Korean.

Which Java Lucene version is this contribution a port of?

Nice to meet, you. This is a port of Lucene 8.11.0. The problem with the CJK Analyzer that I ran into was the method TokenStreamComponents stratifies using a bigram strategy instead of removing non-root components/keywords. In the Java implementation of the KoreanAnalyzer, I noticed that the TokenStreamComponents method exhibits a stemming behavior. I am using the Java library to perform offline jobs via Scala, and C# for online analysis of keywords.

@krinsang
Copy link
Author

krinsang commented Oct 3, 2022

Thanks for the contribution.

There was a prior attempt at porting the nori analyzer on the feature/analysis-nori branch. However, there were 2 issues preventing it from functioning on Lucene.Net 4.8.0:

  1. The FST implementation was redesigned between 4.8.0 and the version that nori was added in Lucene. This incompatibility made it impossible to debug the tests.
  2. BigDecimal doesn't exist in .NET.

We are more concerned with getting past the first issue than the second, since it would be trivial to exclude KoreanNumberFilter and KoreanNumberFilterFactory from the project (commented out, with notes explaining that we are missing the BigDecimal dependency at the top of the file), but I will provide some info about how to proceed if you are interested in porting it.

Issues to Fix

  1. Move this into a new project named Lucene.Net.Analysis.Nori, using Lucene.Net.Analysis.Kuromoji as a project template. We don't want to add large dependencies to Lucene.Net.Analysis.Common, since most people are using it.
  2. Add a comment at the top of each code file indicating the exact Lucene version it is ported from. Example: // Lucene version compatibility level 9.3.0
  3. Bring over the tests from the feature/analysis-nori branch and add them to a new project named Lucene.Net.Tests.Analysis.Nori (or alternatively create a fresh port of the tests). Either way, please ensure the tests are up to date with the exact Lucene version that you are porting the nori analyzer from.
  4. The nori project is very similar to Lucene.Net.Analysis.Kuromoji and Lucene.Net.Analysis.SmartCn projects. Please use similar coding styles (constants instead of static fields, properties instead of methods where appropriate, naming conventions, folder detection for custom dictionaries, etc.)
  5. Please include the changes and documentation from the the feature/analysis-nori branch to the lucene-cli tool that is used to build custom nori dictionaries.
  6. After addressing the porting issues below, please ensure all of the tests are passing. The best way to do this is to setup Java debugging and step through the code. Note that you will need to use the version of Java Lucene you are doing the port from in order to get the same results. Lucene has made some changes to their setup procedure, so it may take some research/experimentation to do. Do note that we have these versions successfully running. Alternatively, you may compare the code files line by line, but it may not be possible to figure out how to make the tests pass if you use this approach.

FST

At the time I attempted the feature/analysis-nori branch, the FST API seemed to fit, however, due to some design changes it produced completely different results than the version I had ported it from (unfortunately, I don't recall what version it is based upon). At the time I thought that FST was tied deeply into other Lucene components and having multiple incompatible versions in the project wouldn't work. However, I have since learned that FST is only used in specific scenarios that end users won't need to plug together, so having a copy of the a later version of FST in the Lucene.Net.Analysis.Nori project should work fine (I think).

To be able to debug, we need to be able to step through the code and get FST to return the exact results from the Lucene version this is based upon. So, we need a fresh port of FST from the same version of Lucene that nori is based upon. The convention we are following is to put "extra" components such as this into a folder named Support, followed by the regular folder convention in Lucene.

src
  |
  -- Lucene.Net.Analysis.Nori
    |
    -- (All nori code files/folders)
    |
    -- Support
      |
      -- Util
        |
        -- Fst
          |
          -- (All FST code files)
  |
  -- Lucene.Net.Tests.Analysis.Nori
    |
    -- (All nori test code files/folders)
    |
    -- Support
      |
      -- Util
        |
        -- Fst
          |
          -- (All FST test code files)

Please be mindful that we will be using similar namespace conventions as we are currently (the namespace may not necessarily match the name of the project it belongs to). For now, please put the new FST port into the Lucene.Net.Support.Util.Fst namespace.

BigDecimal

We definitely don't want to take on a dependency to IKVM, both because it is large and because it doesn't support most of the .NET runtimes that we do. Please try one of the following (in order of preference):

  1. I suspect the primary reason BigDecimal was the goto class in Java is because Java doesn't have a decimal data type like .NET does. It has limited precision compared to BigDecimal, but would work for most use cases. As in the the feature/analysis-nori branch, we can use a nullable decimal? to account for gaps between Java and .NET reference vs value types.
  2. We could attempt to wrap BigInteger into a BigDecimal. There is an implementation here, which was found on StackOverflow. Please include it in the Support/Numerics folder (Lucene.Net.Numerics namespace) and be mindful of comments where people have suggested improvements to the implementation. We can use a nullable BigDecimal? to account for gaps between Java and .NET reference vs value types.
  3. If neither of the above options works, I suggest including the code files for KoreanNumberFilter and KoreanNumberFilterFactory (and their tests) in the project, but commenting them out. We will simply not include an implementation in Lucene.Net.

Thank you for such a detailed response. Are these unresolved issues directly related to the output of the methods used by KoreanAnalyzer being different between Java and C#?

@NightOwl888
Copy link
Contributor

NightOwl888 commented Oct 4, 2022

I am using the Java library to perform offline jobs via Scala, and C# for online analysis of keywords.

If by this you mean that you are using the same index files in both C# and Scala, this won't work because of differences in the binary format between the 4.x and 8.x codecs. 8.x doesn't have support for reading 4.x indexes, it ends at 5.x. Backward compatibility of codecs is read-only, but if that works for your use case (i.e. no docs added from Scala), porting the 4.6 codec from Lucene 5.5.5 to plug into Lucene 8.11.0 is likely possible given the fact the codec interface is pluggable, just the binary format changes between implementations.

Another option is to use Lucene 8.11.0 and use IKVM's <IkvmReference> to compile Lucene.

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <TargetFramework>netcoreapp3.1</TargetFramework>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="IKVM.Maven.Sdk" Version="1.0.1" />
  </ItemGroup>
  
  <ItemGroup>
    <MavenReference Include="org.apache.lucene:lucene-analyzers-nori" Version="8.11.0" Debug="true" />
  </ItemGroup>

</Project>

Alternatively, you could use ikvmc to build the libraries so you can use the -exclude parameter to leave out the .class files in the META-INF/versions folder because they will produce warnings under Java SE 8. This will involve using Maven to download and figure out how to wire up the dependencies with ikvmc.

Use the Maven copy-dependencies plugin to bring all of the .jar files into a specific local directory.

pom.xml

Use this pom.xml file in the directory you run the following commands from.

<?xml version="1.0" encoding="UTF-8"?>

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.mycompany.app</groupId>
  <artifactId>my-app</artifactId>
  <version>1.0-SNAPSHOT</version>

  <dependencies>
    <dependency>
      <groupId>org.apache.lucene</groupId>
      <artifactId>lucene-analyzers-nori</artifactId>
      <version>8.11.0</version>
    </dependency>

  </dependencies>
</project>
mvn org.apache.maven.plugins:maven-dependency-plugin:3.3.0:copy-dependencies -DoutputDirectory="F:\deps" -DincludeScope=compile

To figure out what depends on what, you can use the depgraph-maven-plugin

mvn com.github.ferstl:depgraph-maven-plugin:3.3.0:for-artifact -DoutputFileName=".\depgraph.txt" -DgraphFormat=text -DclasspathScope=compile -Dscopes=compile -Dversion=1.0-SNAPSHOT -DgroupId=com.mycompany.app -DartifactId=my-app -DshowVersions=true -DshowDuplicates -DshowGroupIds -DshowConflicts

Are these unresolved issues directly related to the output of the methods used by KoreanAnalyzer being different between Java and C#?

Not all of them. We need to mirror the organization of the Lucene project. Nori is a separate module from analyzers-common for good reason - to allow most users to exclude it if all they need is StandardAnalyzer. ~95% of users rely on analyzers-common, we don't want them to depend on big dependencies like IKVM, ICU4N, or nori resource files just to use Lucene.NET.

Unfortunately, KoreanAnalyzer wasn't added until Lucene 8.0, and was backported to 7.4. Although most of it seems to work (only 3 of the tests that apply to it fail), the UserDictionary relies on FST which had incompatible binary format updates in 6.0.

I have since attempted updating the the feature/analysis-nori branch (which I have determined must be 8.2.0 by eliminating the other possibilities) and porting FST from 8.2.0, but it is setup not to function with any version below 6, and that is by design. FST is used directly by the codecs, so although there are no references, I am pretty sure that they must be equivalent to read/write the same binary format from/to the index.

I also made an attempt to re-export the mecab-ko-dic to using the 4.6 codec, but for some reason that also doesn't seem to fix the problems. It may be due to a bug in the nori port, but without anything to run to directly compare it to, it is difficult to determine what the problem actually is. There are instructions on how to build mecab-ko-dic which I am providing here for reference, but I have also added a test to build the dictionary with the same settings as the nori build. The mecab dictionary can be downloaded here.

In theory, there is no real reason why KoreanAnalyzer cannot be ported to 4.8.0, but it will require cooperation with someone on the Lucene team to help us backport it, since we have absolutely nothing in the real world to compare it with. Maybe just by studying the binary formats of the codecs and the FST specifications that Lucene is based on it could be done, but I suspect it will also require someone who is familiar enough with the history of the Lucene binary formats and changes to the test framework/codecs for us to have passing tests in 4.8.0. There are only 3 test failures that would need to be addressed (see option 3 below). There are more failures due to the KoreanNumberFilter, but these are fixable without any assistance.

Options

  1. Use IKVM and the exact version of Lucene in both Java and .NET as I mentioned above.
  2. Use the ICUTokenizer (found in the Lucene.Net.ICU NuGet package) to build your own Analyzer for Korean. ICU uses a dictionary-based BreakIterator to tokenize CJK text. See the tests. Lucene.NET 4.8.0 uses ICU 60.1 (Lucene 4.8.0 used ICU 52.1) and Lucene 8.11.0 uses ICU 62.1 so the differences, if any, will be minor.
  3. Use the implementation from the feature/analysis-nori-2 branch, that is now upgraded to build in VS/command line. You can host the NuGet package yourself or use the DLL from the NuGet package by including the binary in your project. The TestRandomHugeStringsMockGraphAfter, TestUserDict, and TestLookup are the only failing tests other than the KoreanNumberFilter tests. Using a UserDictionary is an edge case that probably doesn't apply to you and the other appears to be sort of an edge case, also. It will most likely work as-is for your purposes, but since we have failing tests, this isn't something we can add to the project until Lucene.NET is upgraded to a more recent version of Lucene.
  4. Contact the Lucene team on the developer mailing list to see if there is someone who can help us backport the nori analyzer to 4.8.0. It would probably be easiest for them to do the backport in Java Lucene 4.8.0 on a new branch and then we could use that branch as a basis for our port to .NET. This would give us something in Java to run to be able to debug any issues with the .NET port. We recently had some assistance from @mocobeto on the brand new PersianStemmer contribution that didn't exist in Lucene and she was quite helpful. You are not the first person to look for Korean support, so it would be great to have the nori module, but we don't currently have a path forward without a full upgrade of Lucene.NET.

NOTE: The last 3 options only apply if you are not using the same index files in both Java and .NET because as I previously mentioned, there are no common codec formats between 4.x and 8.x. If you are sharing the same index files, the only options for doing so are:

  1. Build a web service to put the index behind so it can be read/written using a specific version (or use solr or elastic search).
  2. Use IKVM.
  3. Port the 4.6 codec over to Java Lucene 8.11.0 from Lucene 5.5.5. You may need to look at other codecs in the last version of 7.x and 8.x to see how they differ (7.x went back to 5.x, 8.x only went back to 7.x).

Assuming the index is not behind a web service and your app updates the index, you can use the Lucene.Net.Replicator module to copy the files of the primary index to another location where they can be read by another process. If your index never changes or only changes on deployment, you don't need to use it - it is just for safely copying out index files atomically without taking the application that is updating it down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants