Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] Publishing xgboost4j and others to Maven Central #1807

Closed
alexeygrigorev opened this issue Nov 23, 2016 · 42 comments
Closed

[jvm-packages] Publishing xgboost4j and others to Maven Central #1807

alexeygrigorev opened this issue Nov 23, 2016 · 42 comments

Comments

@alexeygrigorev
Copy link
Contributor

alexeygrigorev commented Nov 23, 2016

Many users would like to see xgboost4j published to maven central (see #935)

I think we can follow the approach similar to MTJ (https://github.com/fommil/matrix-toolkits-java), which depends on netlib binaries - and this is, probably, what @Javelinjs suggested in his comment about mxnet

In essence, the idea is to have separate JAR files for each platform and publish them all to Maven Central. Then we add all of them as dependencies to xgboost4j and during the execution time decide which one to load.

We can also have a look at jni-loader (https://github.com/mrburrito/jni-loader)

This is how it looks for MTJ:

mtj-dep

We could start from selecting one platform, e.g. 64bit linux, and see how it goes.

@CodingCat
Copy link
Member

for platform, since XGBoost does not work for 32bit systems

we only need to care 64 linux/win/osx

@CodingCat
Copy link
Member

my personally preferred way to publish to maven is contain everything in a single jar

http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.1.2.6/

you can download snappy-java-1.1.2.6.jar and look at the structure of their native libs

@alexeygrigorev
Copy link
Contributor Author

I'll have a look, thanks. What I don't get is how the build process is organized in this case: it may mean that they have some internal repository with binaries, then they pull them from there during the building process, and only after that publish the jar.

Having several jars may be an advantage because there will be no need for that: we can use the maven central as such repository.

But I'll need to have a closer look.

@alexeygrigorev
Copy link
Contributor Author

alexeygrigorev commented Nov 24, 2016

I am trying the multiple-modules approach - it seems more natural to me and, unlike the one-module-has-them-all approach, I have ideas how to implement it.

The way I think it could work is the following. Suppose there are 3 persons A, with a linux machine, B, with windows, and C with a mac.

When the next version is ready to be released to maven, A takes the current version of xgboost4j (e.g. 0.7-SNAPSHOT), and using the maven-release plugin does this:

  • updates the version to 0.7
  • releases the linux native lib along with other java modules to maven
  • commits the change in version to git
  • updates the version to 0.8-SNAPSHOT, commits the change again

After this is done, B and C can checkout the 0.7 version from git, and then build and publish only the native modules.

Of course, it is possible that B or C do the main release and others just publish the binaries.

I'm experimenting in my fork here: https://github.com/alexeygrigorev/xgboost

What do you think?

@CodingCat
Copy link
Member

I'll have a look, thanks. What I don't get is how the build process is organized in this case: it may mean that they have some internal repository with binaries, then they pull them from there during the building process, and only after that publish the jar.

Having several jars may be an advantage because there will be no need for that: we can use the maven central as such repository.

They have pre-built native libraries https://github.com/xerial/snappy-java/tree/7650aa29fb52c3ba467e9c906cf22a3dab536861/src/main/resources/org/xerial/snappy/native

and

load them with https://github.com/xerial/snappy-java/blob/7650aa29fb52c3ba467e9c906cf22a3dab536861/src/main/java/org/xerial/snappy/SnappyLoader.java

there will be only one library in central maven

@alexeygrigorev
Copy link
Contributor Author

OK so it means they store the binaries in git? I am not sure it's a good idea.

Anyways, my experiments with multi-module build seem to have worked: I managed to deploy the binaries and the jars to sonatype's snapshot nexus. Here it is: https://oss.sonatype.org/content/repositories/snapshots/ml/dmlc/xgboost/

I only have linux and windows machines, so I tried only these two.

Right now using the snapshot versions should be possible this way:

<project>
...
  <repositories>
    <repository>
      <id>sonatype-shapshot</id>
      <name>Sonatype Snapshot Repository</name>
      <url>https://oss.sonatype.org/content/repositories/snapshots/</url>
    </repository>
  </repositories>
  <dependencies>
    <dependency>
      <groupId>ml.dmlc.xgboost</groupId>
      <artifactId>xgboost4j</artifactId>
      <version>0.7-SNAPSHOT</version>
    </dependency>
    ...
  </dependencies>
</project>

This should automatically download the appropriate native version depending on the platform.

For linux it seems to work well, but for windows it needs extra libraries - so I may need to try this with a clean virtual machine with only java and maven installed and see if it works.

Also, I needed to turn off building the jar-with-dependencies - sonatype's nexus doesn't allow uploading large files. These jars can be built with a special profile.

Once we agree on everything, then I can create a pull request and we can publish XGBoost to Sonatype Release repository, which synchronizes with maven central.

@CodingCat
Copy link
Member

It does not say that we need to store binaries in git..The reason they have prebuilt native libraries saved there is that they plan to support many pkatforms including those with hard-to-use toolchains....

Our goal is only to support 64-bits linux/mac/win. We only need to do what we are doing: compile native libs-> copy to resource dir ->build jar

I still didn't see why uploading many jars to central maven repo is necessary...

@alexeygrigorev
Copy link
Contributor Author

It may not be necessary but I don't know how to organize the build process without it.

As I wrote earlier, in my opinion the limitation of one-jar-rules-them-all approach is that we first need to build the code for each target platform, store the binaries somewhere, and then during the publication to maven pull the binaries from there and include in the final jar. I don't know how to do it.

When it comes to multiple modules, it is still not ideal, but solves this problem, and the build process is organized as I wrote earlier.

So I may suggest to follow the approach I propose and have the binaries in central sooner rather than later, and then maybe someone with better knowledge of maven can modify it and do it better.

@CodingCat
Copy link
Member

As I wrote earlier, in my opinion the limitation of one-jar-rules-them-all approach is that we first need to build the code for each target platform, store the binaries somewhere, and then during the publication to maven pull the binaries from there and include in the final jar. I don't know how to do it.

Why storing binaries somewhere? how about put all native libraries in local disk (resource directory), include them in the jar when building and finally publish jar to maven?

@alexeygrigorev
Copy link
Contributor Author

Ok, so how would you do this? Somebody builds binaries for windows and then sends them over email to the person with linux?

@CodingCat
Copy link
Member

It's another question i do not understand...

Why we have to involve more than one persons for cross building? It's hard to imagine a program release process requires two persons....

In rocksdb, they use vagrant to cross build ubuntu and mac...xgboost does mot have those system calls or something else, these two platforms can share the same native lib file in most of cases..

For windows, i am not a expert in win programming...even vargrant does not work, a manual within-VM will achieve the same goal

@CodingCat
Copy link
Member

The next question to discuss is... can we skip windows when releasing to maven? The main reason is that we do not have enough(zero?) test on xgboost4j under windows...

@alexeygrigorev
Copy link
Contributor Author

alexeygrigorev commented Dec 5, 2016

Well we probably don't need to involve more than one user, but I'm not an expert in vargrant either, sorry.

But what I suggest does require three users:

  • user with linux builds xgb and runs mvn deploy. This publishes only the linux version.
  • users with windows and mac build xgb and run mvn --projects xgboost4j-native-windows deploy and mvn --projects xgboost4j-native-osx deploy respectively.

This is for publishing the snapshot version, a release build would be a bit more complicated, but I outlined it above. As I am not familiar with vargrant and other virtualization tools, I don't know how to organize it better.

Let me know if my proposal is interesting for you, otherwise I'm putting my current efforts on hold.

@CodingCat
Copy link
Member

I will talk with mxnet guys to understand if there is any other reason for them to have many jars in mvn central

@Craigacp
Copy link
Contributor

I successfully made a jar with a Windows dll, a Mac OSX macports dylib and a Linux so, and store that in our artifactory which works pretty well. Apart from when someone who uses brew tries to use the macports dylib and it gives an odd library not found error.

@Widerstehen
Copy link

@alexeygrigorev I am looking forward to get a windows OS xgboost-spark JAR from maven central repository or others, i offen code in IntelliJ IDEA tool windows OS,then run project in Linux production system, because it is convenient debugging. In my experience, it is easy to compile xgboost in Linux OS, but in windows OS ,i have never been successful. So if you have done it , please tell me ,thank you very much.

@algorithmdog
Copy link

I have the same problem with @frank111 .

@virl
Copy link

virl commented Jun 15, 2017

Please publish xgboost to Maven with bundled native libraries for all architectures.

@CodingCat
Copy link
Member

Even for non-x86 architecture?

@virl
Copy link

virl commented Jun 16, 2017

@CodingCat Yes, for all architectures that XGBoost4J supports.

Please bundle native libraries into Maven package and load them at runtime depending on what architecture app is running.

Or at least allow to select native architecture via linking with different Maven packages at app's build time (not your library build time!), like DeepLearning4J does it.

Anyway, building from source just to select multithreading backend should not be required. And Maven packages should be enough for usage of the library.

@mjakobus
Copy link

I would also appreciate it very much if a least the major releases of XGBoost4J would be available via Maven.

I'm also using DeepLearning4J, which is very comfortable to use compared to XGBoost4J. In the meanwhile dl4j is even offering nightly builds on maven.

In my opinion the missing of reliable builds of XGBoost4J is a major bummer for more serious use cases for this great library. Especially on Windows building XGBoost4J is an heavy adventure ;)

@virl
Copy link

virl commented Jun 16, 2017

@mjakobus Yes, I have same feelings: XGBoost4J missing regular major releases and especially Maven-released packages with native backend selected at runtime.

@anshbansal
Copy link

How do people use this in production if it is not in maven central? Manually create the JAR files?

@superbobry
Copy link
Contributor

superbobry commented Nov 7, 2017

At Criteo we build XGBoost JARs on Travis/Appveyor. In theory, the same scripts can be reused to publish the official JARs for XGBoost, but I didn't have the time to do that.

@alexeygrigorev
Copy link
Contributor Author

alexeygrigorev commented Nov 7, 2017

We just manually put them to our nexus
(By "manually" I mean via maven, but not in a CI-configured way)

@anshbansal
Copy link

so the pom works to generate the artifact via standard maven jar building commands? And has this been tested in a linux environment?

@alexeygrigorev
Copy link
Contributor Author

In our case - yes, and we do it only for linux machines

@Craigacp
Copy link
Contributor

Craigacp commented Nov 7, 2017

I've built a multi-jar with Linux, Windows and Mac libraries, and put it in an artifactory. Works fine from there.

@edumucelli
Copy link

At BlaBlaCar we build it then publish to an internal nexus. Then apps fetch from the nexus. It is not an multi-jar, thus we have Linux and Mac libraries separately. Apps then get the right dependency, e.g., using a Os.isFamily(Os.FAMILY_MAC). Would be great to have a multi-jar out-of-the-box, though. @Craigacp is your multi-jar available somewhere?

@Craigacp
Copy link
Contributor

Craigacp commented Nov 7, 2017

Unfortunately my version isn't available, but the logic in XGBoost4J causes it to load the correct binary based on the platform, so all you need to do is unzip each jar, copy the dll, so and dylib into the same resources directory and rejar it. If you require multiple linux versions, this approach won't work, as the loading logic isn't complicated enough (similarly it fails if you have multiple so files for different platforms e.g. Linux & Solaris).

@superbobry
Copy link
Contributor

superbobry commented Nov 7, 2017

@edumucelli you can assemble a multi JAR by running download_latest_release.py from here.

It is built for an admittedly ancient CentOS6, so should work on CentOS7 as well as more recent Linux distributions.

@edumucelli
Copy link

@superbobry, that is great! Thank you for sharing it!

@Obarros
Copy link

Obarros commented Feb 13, 2018

@alexeygrigorev @CodingCat @edumucelli what was the outcome of this?
Is there a solution in place to automatically building JARs for xgboost and publish it somewhere?

@alexeygrigorev
Copy link
Contributor Author

There is, yes. Right now it is possible to do mvn publish and it will deploy it to your local nexus repository

@edumucelli
Copy link

edumucelli commented Jun 26, 2018

@Obarros I am using @superbobry's multi JAR on Debian-based containers in production.

@CodingCat
Copy link
Member

for anyone who wants to use pre-built version of xgboost, please check README file in https://github.com/dmlc/xgboost/tree/master/jvm-packages, we have published artifacts to maven central

@Obarros
Copy link

Obarros commented Jun 26, 2018

@CodingCat, @edumucelli Thanks!

@bluelu
Copy link

bluelu commented Jun 28, 2018

@CodingCat could you please also push the windows artifacts as well? The published artifact only contains the linux version. thanks

@edumucelli
Copy link

@bluelu, it contains both Linux and MacOS.

@superbobry
Copy link
Contributor

@edumucelli this has been discussed in #3276. tl;dr @CodingCat decided not to support Windows for the Maven Central JARs.

We have some prebuilt JARs over at criteo-forks/xgboost-jars which come with a Windows DLL, though.

@bluelu
Copy link

bluelu commented Jun 28, 2018

Hi, it's fine for me. I have build my own version, however it would help others certainly if it would be readily available without having to compile yourself.

@edumucelli
Copy link

@superbobry thanks for the link to that thread. That's not an issue, I was just complementing @bluelu's comment about linux-only jar, which in fact has a MacOS dlyb too.

@lock lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests