Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging with JCuda and JOpenCL projects for better quality cuda interfaces #475

Open
archenroot opened this issue Oct 8, 2017 · 73 comments

Comments

@archenroot
Copy link
Member

@saudet Hi buddy,
it just came to my head in last few weeks what about merging cuda and opencl stuff here with work of guys from Jcuda and Jopencl projects. I understand there are some fundamental differences, but having rather more quality devs on single project could provide project quality as well.

The guys from JCuda opened discussion on my request here:
https://forum.byte-welt.net/t/about-jcuda-and-javacpp/19538

So, if you think it could bring more value as well, you are free to join the discussion.

@saudet
Copy link
Member

saudet commented Oct 8, 2017

That would be nice, but the problem is that people expect Oracle to come up with a better solution than JavaCPP, even though they are not working on anything at the moment. As far as I can tell, the developers of Project Panama have given up on any generic solution to C++, no one knows how to make something better than JavaCPP. Still, they hope and believe, and wait, mostly. If you could help convincing them that nothing better is going to happen, explaining and reexplaining over and over again how JavaCPP could get better, that would be the first thing that needs to be done.

@archenroot
Copy link
Member Author

Just for reference: jcuda/jcuda#12 (comment)

@archenroot
Copy link
Member Author

archenroot commented Oct 8, 2017

@saudet
I am reading about it it actually started long time ago..(the panama project). What is that project based on, JNI or something new?

Anyway I registered under the project mail list, but to be hones I went trough some links on project site, some repository links are broken, and also blog sites of the main creators/devs are not updated for long time... I will check more and read about this.

@jcuda
Copy link

jcuda commented Oct 8, 2017

There is a lot happening in Panama (the project...) right now. Admittedly, although I'm registered to the mailing list, too much to follow it all in detail. However, if they manage to achieve the goals that they stated at the project site, http://openjdk.java.net/projects/panama/ , this would certainly compete with JavaCPP.

Of course, development there happens at a different pace. We all know that a "single-developer project" often can be far more agile than a company-driven project, where specifications and sustainability play a completely different role. Panama also approaches topics that go far beyond what can be accomplished by JavaCPP or JNI in general. They are really going down to the guts, and the work there is interwoven with the topics of Value Types, Vectorization and other HotSpot internals.

So I agree to saudet that it does not make sense to (inactively) "wait for a better solution". JavaCPP is an existing solution for (many of, but by no means all of) the goals that are addressed in Panama.


More generally speaking, the problem of fragmentation (in terms of different JNI bindings for the same library) occurred quite frequently. One of the first "large" ones had been OpenGL, with JOGL which basically competed with LWJGL. For CUDA, there had been some very basic approaches, but none of them (except for JCuda) have really been maintained. When OpenCL popped up, there quickly have been a handful of Java bindings (some of them being listed at jocl.org and in this stackoverflow answer ), but I'm not sure about how actively each of them is still used and maintained.

(OT: It has been a bit quiet around OpenCL in general recently. Maybe due to Vulkan, which also supports GPU computations? When Vulkan was published, I registered jvulkan.org, but the statement "Coming soon" is not true any more: There already is a vulkan binding in LWJGL, and the API is too complex to create manual bindings. There doesn't seem to be a Vulkan preset for JavaCPP, or did I overlook it?)

For me, as the maintainer of jcuda.org and jocl.org, one of the main questions about "merging" projects would be how this can be done "smoothly", without just abandoning one project in favor of the other. I always tried to be backward compatible and "reliable", in that sense. Quite a while ago, I talked to one of the maintainers of Jogamp-JOCL, about merging the Jogamp-JOCL and the jocl.org-JOCL. One basic idea there had been to reshape one of the libraries so that it could be some sort of "layer" that is placed over the other, but this idea has not been persued any further.

I'm curious to hear other thoughts and ideas about how such a "merge" might actually be accomplished, considering that the projects are built on very different infrastructures.

@saudet
Copy link
Member

saudet commented Oct 9, 2017 via email

@saudet
Copy link
Member

saudet commented Oct 9, 2017

Yes, JCuda, etc could be rebased on JavaCPP, that's the idea IMO. There are no bindings for OpenCL or Vulkan just because I don't have the time to do everything, that's all.

@archenroot
Copy link
Member Author

archenroot commented Oct 9, 2017

@jcuda @saudet
Little offtopic, but related:
I am very interested in JNR, but to be honest, I wasn't able to find any kind of benchmarking or even some detailed comparison. Before we had JNA and JNI, while JNA was slow and easy to use, but for high-performance stuff you need performance, so we go with JNI where possible, right? That is also the way of JavaCPP and JCuda as well. Could you guys put here some reference document comparing JNR to JNI from performance perspective? I would love to understand the internal architecture of JNR to see especially performance benefits over JNI, I am aware it is far beyond performance only, but when you run 200 node cpu/gpu cluster, the performance (throughput and latency) matters. The complexity of adoption could be handled always :-)

@saudet
Copy link
Member

saudet commented Oct 9, 2017 via email

@archenroot
Copy link
Member Author

archenroot commented Oct 9, 2017

@saudet thanks buddy,

I also suggest to move the discussion about jcuda vs javacpp to marco's thread at, as he requested:
https://forum.byte-welt.net/t/about-jcuda-and-javacpp/19538/3

NOTE: I think out of theoretical discussion, as performance is the top priority I suggest if you @saudet create under JavaCPP new github project where we can develop real benchmark for Jcuda and Javacpp based CUDA (as Vulkan and OpenCL are not available in the moment), so we can analyze code syntax diff/similarities and performance as well in some unified way.

I also suggest to decide which benchmark framework should be used to build this stuff:

Or here:
https://stackoverflow.com/questions/7146207/what-is-the-best-macro-benchmarking-tool-framework-to-measure-a-single-threade

@saudet
Copy link
Member

saudet commented Oct 9, 2017 via email

@archenroot
Copy link
Member Author

I will create initial project and adopt few basic CUDA algorithms to be implemented in Jcuda and javacpp, I hope we could find more users from the other side (jcuda) to participate as well.

@saudet
Copy link
Member

saudet commented Oct 9, 2017 via email

@archenroot
Copy link
Member Author

archenroot commented Oct 9, 2017

I think make it this generic best, so benchmarks sounds good. As out of of this I would like to also later (if having time) to test JavaCPP vs JNR in some simple dummy getuuid functions call tests from libc as kind of template:

#include <uuid/uuid.h>
void uuid_generate(uuid_t out);void uuid_generate_random(uuid_t out);void
uuid_generate_time(uuid_t out);int uuid_generate_time_safe(uuid_t out);

@jcuda
Copy link

jcuda commented Oct 9, 2017

I am also registered to the list, but I'm not seeing anything happen. Could you point me to where, for example, they demonstrate creating an instance of a class template? I would very much like to see it. Thanks

Again, I'm not so deeply involved there, but their primary goal is (to my understanding) not something that is based on accessing libraries via their definitions in header files. My comment mainly referred to the high-level project goals (i.e. accessing native libraries, basically regardless of which language they have been written in), together with the low-level efforts in the JVM. At least, there are some interesting threads in the mailing list, and the repo at http://hg.openjdk.java.net/panama/panama/jdk/shortlog/d83170db025b seems rather active.


Regarding the benchmarks: As I also mentioned in the forum, creating a sensible benchmark may be difficult. Even more so if it is supposed to cover the point that is becoming increasingly important, namely multithreading. But setting up a basic skeleton with basic sample code could certainly help to figure out what can be measured, and how it can be measured sensibly.

(As for the topic of merging libraries, the API differences might actually be more important, but this repo would automatically serve this purpose, to some extent - namely, by showing how the same task is accomplished with the different libraries)

@archenroot
Copy link
Member Author

@jcuda

Thanks for your comments. Actually based on the presentation it even looks they have added even more processing layers than JNI has :-))), but I will need to investigate the whole story more. Thanks for link.

Regarding benchmark:
that is the point, establishing kind of skeleton. By multithreading you mean CPU multithreading?, I think it will be good along with template definition to discuss possible algorithms to be implemented and their general specification. good point.

That is exactly the point, because I also do not now how the differences are big in the moment, how big breakthrough we talk about.

@saudet
Copy link
Member

saudet commented Oct 9, 2017

@archenroot I created the repository and gave you admin access:
https://github.com/bytedeco/benchmarks
Feel free to arrange it as you see fit and let me know if you need anything else! Thanks

@archenroot
Copy link
Member Author

@saudet good starting point, I will try to do as discussed:
prepare common benchmark structure/template and list of interesting algorithms (including of course multi-threaded from client perspective).

I am thinking to in some cases provide as well existing C/C++ implementation if available to compare native performance, but will focus on Jcuda vs javacpp at first.

Thanks again.

@jcuda
Copy link

jcuda commented Oct 10, 2017

By multithreading you mean CPU multithreading?

Yes. CUDA offers streams and some synchronization methods that are basically orchestrated from client side. (This may involve stream callbacks, which only have been introduced in JCuda recently, an example is at https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/driver/samples/JCudaDriverStreamCallbacks.java )

As for the other "benchmarks": Some simple matrix multiplication could be one that creates a real workload. Others might be more artificial, in order to more easily tune the possible parameters. Just a rough example: One could create a kernel that just operates on a set of vector elements. Then one could create a vector with 1 million entries, and try different configurations - namely, copying X elements and launching a kernel with grid size X (1000000/Y times). This would mean

  • process 100-element blocks, using 10000 copies/launches
  • process 1000-element blocks, using 1000 copies/launches
  • process 10000-element blocks, using 100 copies/launches
  • process 100000-element blocks, using 10 copies/launches

(the kernel itself could then also be "trivial", or create a real workload by throwing in some useless sin(cos(tan(sin(cos(tan(x))))) computations...)

Again, this is just a vague idea.

@saudet
Copy link
Member

saudet commented Oct 22, 2017

FWIW, being able to compile CUDA kernels in Java is something we can do easily with JavaCPP as well. To get a prettier interface, we only need to finish what @cypof has started in bytedeco/javacpp#138.

@blueberry
Copy link

@archenroot @jcuda May I add that the actual computation time of the GPU kernels is not that important for the benchmarks. What we need to measure here is an overhead over plain C/C++ cuda driver calls.

So, let's say that enqueing the "dummy" kernel costs X time. Java wrapper needs k * X time. We are interested in knowing k1 (JCuda) and k2 (JavaCPP cuda), k1*X/X, k2*X/X or/and k1 * X / k2 * X.

In my opinion, k1 * X / k2 * X is the easiest to measure of those.

@jcuda
Copy link

jcuda commented Oct 22, 2017

Compiling CUDA kernels at runtime already is possible with the NVRTC (a runtime compiler). An example is in https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/nvrtc/samples/JNvrtcVectorAdd.java . (Of course one could add some convenience layer around this. But regarding the performance, the compilation of kernels is not relevant in most use cases). I'll have a look at the linked PR, though.

@saudet
Copy link
Member

saudet commented Oct 23, 2017

@jcuda Oh, interesting. It's nice to be able to do this with C++ in general and not only CUDA though.

@jcuda
Copy link

jcuda commented Oct 24, 2017

In fact, the other sample at https://github.com/jcuda/jcuda-samples/blob/master/JCudaSamples/src/main/java/jcuda/nvrtc/samples/JNvrtcLoweredNames.java shows that this also supports "true" C++, with namespace, templates etc.

(The sample does not really "do" anything, it only shows how the mangled names may be accessed afterwards).

The NVRTC was introduced only recently, and before it was introduced, one problem indeed was the lack of proper C++ support for kernels in JCuda: It was possible to compile kernels that contained templates by using the offline CUDA compiler (which is backed by a C++ compiler like that of Visual Studio). The result was a PTX file with one function for each template instance. But of course, with oddly mangled names that had to be accessed directly via strings from Java. With the NVRTC, this problem is at least alleviated.

@saudet
Copy link
Member

saudet commented Oct 24, 2017 via email

@jcuda
Copy link

jcuda commented Oct 25, 2017

That's right. And the question was already asked occasionally, aiming at something like "JThrust". But I think that the API of thrust (which on some level is rather template-heavy) does not map sooo well to Java. I think that a library with a functionality that is similar to that of thrust, but in a more Java-idiomatic way.

(A while ago I considered to at least create some bindings for https://nvlabs.github.io/cub/ , as asked for in jcuda/jcuda-main#11 , but I'm hesitating to commit to another project - I'm running out of spare time....)

@saudet
Copy link
Member

saudet commented Jan 15, 2018

@jcuda @archenroot @blueberry FYI, wrapper overhead might become more important since kernel launch overhead has apparently been dramatically reduced with CUDA 9.1:

  • Launch kernels up to 12x faster with new core optimizations

https://developer.nvidia.com/cuda-toolkit/whatsnew

@jcuda
Copy link

jcuda commented Jan 16, 2018

They don't give any details/baseline of what they compared. A dedicated benchmark or comparison with CUDA 9.0 and 9.1 might be worthwhile. (I haven't updated to 9.1 yet - currently, the Maven release of 9.0 is on its way...)

@archenroot Any updates on the benchmark repo?

@saudet
Copy link
Member

saudet commented Jan 17, 2018

In the meantime, I've released presets for CUDA 9.1 :)
http://search.maven.org/#search%7Cga%7C1%7Cbytedeco%20cuda

@saudet
Copy link
Member

saudet commented Jan 5, 2021

@jcuda The central statistics for the CUDA presets look like this (numbers for December aren't in yet it seems):
image
(I don't know what's been happening since August, but it looks like something is happening there. 136,593 downloads for November are from the same IP... Maybe some rogue CI server gone wild somewhere.)

In any case, my goal with JavaCPP was never to provide clean APIs for end users, but to provide developers like you with the tools necessary to work on high-level idiomatic APIs. The kind of tools that nearly all Python developers take for granted, but for some reason most Java developers, even those at Oracle, prefer to write JNI manually, such as with the work that @Craigacp has recently been doing for ONNX Runtime. Another case in point, Panama has officially dropped any intentions of offering something like JavaCPP as part of OpenJDK, see http://cr.openjdk.java.net/~mcimadamore/panama/jextract_distilled.html. What they are saying essentially is that since they haven't been able to come up with something that's perfect, that they can confidently support for the next century or so (I'm exaggerating here, but that's not far from the truth), they will leave this dirty work to others like myself and yourself! :) So, please do consider rebasing JCuda and JOCL on JavaCPP. People who really wish to use the crappy parts of the CUDA API will be able to, while you can concentrate on offering some subset of it that makes sense to most Java users. TensorFlow has done it and they even got a speed boost over manually written JNI, see tensorflow/java#18 (comment). MXNet has also dropped their manually written JNI too and may choose to continue either with (slow) JNA or (faster) JavaCPP, see apache/mxnet#17783.

In any case, if you still feel strongly against using a tool like JavaCPP, please let me know why! The engineers at NVIDIA certainly haven't been very clear about why they consider tools like Cython, pybind11, setuptools, and pip to be adequate for Python, but not for Java where for some reason everything has to be redone manually with JNI for each new project, see rapidsai/cudf#1995 (comment). /cc @razajafri

@jcuda
Copy link

jcuda commented Jan 5, 2021

So, what's happening since August...?

JcudaStats

Maybe people (or at least, one or few "large" users) are moving from JCuda to JavaCPP...


In any case, my goal with JavaCPP was never to provide clean APIs for end users, but to provide developers like you with the tools necessary to work on high-level idiomatic APIs.

Originally, my goal of JCuda was also to address two layers:

  1. The 1:1 low-level JNI bindings. Just offering what is there, exactly as it is, regardless of whether it makes sense for Java or not. (This includes obvious things, like process(int *array, int length) that should be process(int array[]) in Java, but also many others)
  2. A somewhat object-oriented, idiomatic, easier-to-use API on top of that

I didn't really tackle the latter. It would be easy to offer some abstraction layer that covers 99% of all use cases (copy memory, run kernel - that's it). But designing, maintaining and extending this properly could be a full-time job.

The direct JNI bindings had been manageable... until recently. I have some parsing- and code generation infrastructure (which, in turn, is far away from being publishable). But the general approach of memory/Pointer handling hit some limits with the recent CUDA API extensions.


I talked with some of the Panama guys a while ago. Part of this discussion was also about ~"the right level of abstraction". I'm generally advocating for defining clear, narrow tasks. Creating a tool that does one thing, and does it right. Or as indicated by the two steps mentioned above: Defining a powerful (versatile), stable (!) API, and build the convenience layer based on that.

I didn't manage to follow the discussion on the Panama mailing list in all detail. But I can roughly imagine the difficulties that come with designing something that is supposed to be used for literally everything (i.e. each and every C++ library that somebody might write), and doing this in a form that is stable and reliable.

(And by the way: I highly appreciate the fact that Oracle puts much emphasis on long-term stability and support. Today, I can take a Java file that was written for a 32bit Linux with Java 1.2 in 1999, and drag-and-drop it into my IDE on Win10 with Java 8, and it just works. Period. No updates. No incompatibility. No hassle. No problems whatsoever. Maybe one only learns to appreciate that after being confronted with the daunting task of updating some crappy JS "web-application" from Angular 4.0.1.23 to 4.0.1.23b and noticing that this may imply a re-write. Stability and reliability are important)

I only occasionally read a few mails from the Panama mailing list, and noticed that the discussion is sometimes ... *ehrm*... a bit heated ;-) and this point seems to be very controversial. But I cannot say anything technically profound here, unless I invest some time to update and get an overview of the latest state. So ... the following does not make sense (I know that), and may sound stupid, but to roughly convey my line of thought: Could it be that, at one day, Panama and JavaCPP work together? E.g. that Panama can generate JavaCPP presets, or JavaCPP presets can be used in Panama? I think that one tool addresing a certain layer, or having a narrower focus than another, does not mean that the tools cannot complement each other...


An aside:

People who really wish to use the crappy parts of the CUDA API will be able to, while you can concentrate on offering some subset of it that makes sense to most Java users.

I'd really like to do that, for some parts of the CUDA API. It lends itself to an Object-Oriented layer quite naturally.

Kernel kernel = Platform.compile("kernel.cu");
Device device = Platforms.get(0).getSomeDevice();
Memory input = device.receive(someArray);
Memory output = device.allocate(n);
device.execute(kernel, input, output);
...

And even more so for the new "Graph Execution" part of the API that my rant was about (I'm a fan of flow-based programming - that's why I created https://github.com/javagl/Flow , and having "CUDA modules" there would be neat...). But the point is: Nobody wants to use these parts of the CUDA API. People think that they have to use it, for profit, and will use it. They will hate it, but they will use it. And NVIDIA knows that, so they obviously don't give the slightest ... ... care... about many principles of API design.


So, please do consider rebasing JCuda and JOCL on JavaCPP.
+
In any case, if you still feel strongly against using a tool like JavaCPP, please let me know why!

I don't feel strong against using a tool like JavaCPP, and already mentioned elsewhere: If JavaCPP had been available 10 years ago, I probably wouldn't have spent countless hours for JCuda (including the parsing and code generation infrastructure). I have to admit that I haven't set up the actual JavaCPP toolchain, for the actual creation of code, because I'd have to allocate some time for https://github.com/bytedeco/javacpp-presets/wiki/Building-on-Windows , but it would certainly be (or have been) less effort in the long run...

Regarding rebasing JCuda on JavaCPP: I think we already talked about that, quickly, in the forum. It might be possible to do that to some extent. But I have some doubts. Very roghly speaking:

  • It may not be worth the effort. Introducing a layer that translates a call like JCuda.cudaMalloc(jcudaPointer, 4); into the appropriate cudart.cudaMalloc(javacppPointer, 4); may be justified by trying to change the basis of JCuda and maintaining some backward compatibility at the same time. But essentially, they are both only 1:1 JNI bindings of the CUDA API, so the differences at the surface level may not warrant the effort of adding such a translation layer for a few thousand functions. (Also, JCuda is a spare time project. What's in for me there? If someone has some spare $$.$$$,$$, I'll do this, no problem...)
  • It may have a slight disadvantage in terms of performance
  • Such a translation may not be possible in all cases

The last one refers to one point that I'm not sure about in JavaCPP. To my understanding, when creating an IntPointer from an int[] array, like https://github.com/bytedeco/sample-projects/blob/master/cuda-vector-add-driverapi/src/main/java/VectorAddDrv.java#L44 , then the memory will also (immediately) be allocated and filled on native side. If this is true, then imagine code like this:

int array[] = new int[100000000]; // 100 million ints - ~400 MB
IntPointer a0 = new IntPointer(array); // This will allocate and copy 400 MB...
IntPointer a1 = new IntPointer(array); // This will allocate and copy 400 MB...
IntPointer a2 = new IntPointer(array); // This will allocate and copy 400 MB...
...

In JCuda, I deliberately tried to allow a "Pointer to an int[] array" as a "shallow object", meaning that it does not do any copies or allocations. One could call this a "more natural" integration of Java arrays (despite all the difficulties that come along with that - garbage collection, relocation...). If the creation of an IntPointer implied an allocation+copy, then one would have to be very careful to avoid patterns like the one above. (And still, even if only one copy is created, it may get into the way of people who deal with ""Big Data®""...). It could probably still be possible to handle this in a thin translation layer, but may require some care to do it right.

@mcimadamore
Copy link

I only occasionally read a few mails from the Panama mailing list, and noticed that the discussion is sometimes ... ehrm... a bit heated ;-) and this point seems to be very controversial. But I cannot say anything technically profound here, unless I invest some time to update and get an overview of the latest state. So ... the following does not make sense (I know that), and may sound stupid, but to roughly convey my line of thought: Could it be that, at one day, Panama and JavaCPP work together? E.g. that Panama can generate JavaCPP presets, or JavaCPP presets can be used in Panama? I think that one tool addressing a certain layer, or having a narrower focus than another, does not mean that the tools cannot complement each other...

Hi, I'm Maurizio and I work on Panama - I think what you suggest is not at all stupid/naive. The new Panama APIs (memory access + foreign linker) provide foundational layers to allow low-level memory access and foreign calls. This is typically enough to bypass what currently needs to be done in JNI/Unsafe - meaning that, at least for interfacing with plain C libraries, no JNI glue code/shared libraries should be required. It is totally feasible, at least on paper, to tweak JavaCPP to emit Panama-oriented bindings instead of JNI-oriented ones (even as an optional mode). While this hasn't happened yet, I don't think there's a fundamental reason as to why it cannot happen. I know of some frameworks (Netty and Lucene to name a few) who have started experimenting a bit with the Panama API, to replace their current usages of JNI/Unsafe, so it is possible. Of course, since we're still at an incubating stage, there might be some hiccups (e.g. some API points might need tweaking, and/or performance numbers might not be there in all cases) - but we're generally trying to improve things and managed to do so over the last year.

@jcuda
Copy link

jcuda commented Jan 5, 2021

@mcimadamore We talked a bit via mail, and I gave jextract a try in https://mail.openjdk.java.net/pipermail/panama-dev/2019-February/004443.html , but it has been quite a while ago, a lot has happened in the meantime, and I'm not really up to date.

(There's something paradox about the situation that I spend spare time for JCuda, instead of Panama, while the latter could help to spend less time for JCuda ... :-/ )

While this hasn't happened yet, I don't think there's a fundamental reason as to why it cannot happen.

From a birds-eye perspective (and not being deeply familiar with JavaCPP, I don't have another perspective... yet), my thought was that it might eventually be possible to replace the Generator.java with something that emits Panama bindings. The Generator class might benefit from a refactor, though. Right now, the pattern is that code is emitted based on certain conditions:

...
if (!functions.isEmpty() || !virtualFunctions.isEmpty()) {
    /* write lots of code */
}
for (Class c : jclasses) {  
    /* write lots of code */
}
for (Class c : deallocators) { ... }
if (declareEnums)  { ... }

In fact, there are some similarities to my code generation project. I tried to establish "sensible defaults", but still make it possible to plug in CodeWriter instances at each and every level, based on certain conditions ... roughly like that...

// Define how pointer declarations are written
functionDeclarationWriter.getDeclarationWriter().prepend(
    ParameterPredicates.parameterHasType(TypePredicates.isPointer()),
    new WriterForAllPointers();

// Define the code for initializing a certain parameter...
functionDeclarationWriter.getInitNativeWriter().prepend(
    ParameterPredicates.parameterMatches("methodRegEx\*", "parameterName"),
    new SpecialInitializationWriterForThisParameter();

This may be over-engineering, but conceptually, it's the attempt to abstract what's currently done in the Generator. In general, breaking the 4200-LOC-monolith Generator into a handful of XyzGenerator-classes (and ... ~2000 lines that are replaced with something like out.print(templateCodeFromFile("adaptersTemplate.c")) ...) could allow dedicated emitters, or to "address different backends", so to speak.

But again, that's just brainstorming. I know that it's never as easy as it looks on this level...

@saudet
Copy link
Member

saudet commented Jan 6, 2021

Originally, my goal of JCuda was also to address two layers:

  1. The 1:1 low-level JNI bindings. Just offering what is there, exactly as it is, regardless of whether it makes sense for Java or not. (This includes obvious things, like process(int *array, int length) that should be process(int array[]) in Java, but also many others)

  2. A somewhat object-oriented, idiomatic, easier-to-use API on top of that

I didn't really tackle the latter. It would be easy to offer some abstraction layer that covers 99% of all use cases (copy memory, run kernel - that's it). But designing, maintaining and extending this properly could be a full-time job.

Well, not necessarily. None of the current contributors of TensorFlow for Java are being paid to work on it full time, and it seems to be working out alright. I think what's important is figuring out ways to engage multiple people in a project, and then have it grow that way. I was under the impression that you were already spending most of your time on 2, but if not, indeed, maybe JavaCPP could pick up 1 and then you can move on to 2 for most of the time you can spend on this.

The direct JNI bindings had been manageable... until recently. I have some parsing- and code generation infrastructure (which, in turn, is far away from being publishable). But the general approach of memory/Pointer handling hit some limits with the recent CUDA API extensions.

I talked with some of the Panama guys a while ago. Part of this discussion was also about ~"the right level of abstraction". I'm generally advocating for defining clear, narrow tasks. Creating a tool that does one thing, and does it right. Or as indicated by the two steps mentioned above: Defining a powerful (versatile), stable (!) API, and build the convenience layer based on that.

I didn't manage to follow the discussion on the Panama mailing list in all detail. But I can roughly imagine the difficulties that come with designing something that is supposed to be used for literally everything (i.e. each and every C++ library that somebody might write), and doing this in a form that is stable and reliable.

(And by the way: I highly appreciate the fact that Oracle puts much emphasis on long-term stability and support. Today, I can take a Java file that was written for a 32bit Linux with Java 1.2 in 1999, and drag-and-drop it into my IDE on Win10 with Java 8, and it just works. Period. No updates. No incompatibility. No hassle. No problems whatsoever. Maybe one only learns to appreciate that after being confronted with the daunting task of updating some crappy JS "web-application" from Angular 4.0.1.23 to 4.0.1.23b and noticing that this may imply a re-write. Stability and reliability are important)

Oh sure, I understand very well the benefits. I guess my beef is more with the Java community that hasn't been creating and experimenting with tools for native libraries, which leaves us with very little experimental results for projects like Panama to pick and choose from.

I only occasionally read a few mails from the Panama mailing list, and noticed that the discussion is sometimes ... ehrm... a bit heated ;-) and this point seems to be very controversial. But I cannot say anything technically profound here, unless I invest some time to update and get an overview of the latest state. So ... the following does not make sense (I know that), and may sound stupid, but to roughly convey my line of thought: Could it be that, at one day, Panama and JavaCPP work together? E.g. that Panama can generate JavaCPP presets, or JavaCPP presets can be used in Panama? I think that one tool addresing a certain layer, or having a narrower focus than another, does not mean that the tools cannot complement each other...

Yes, as @mcimadamore points out, that's pretty much how it's shaping up to be. These days, I consider Panama to be the "new JNI", which approaches things in a different way, but I'm not entirely convinced it's going to be substantially more usable than JNI and sun.misc.Unsafe. In theory, it should have less overhead than JNI, which would make it worth using just for that reason, but it's not currently the case. Also, it's not going to be available in Android for the foreseeable future, so we'll see.

An aside:

People who really wish to use the crappy parts of the CUDA API will be able to, while you can concentrate on offering some subset of it that makes sense to most Java users.

I'd really like to do that, for some parts of the CUDA API. It lends itself to an Object-Oriented layer quite naturally.

Kernel kernel = Platform.compile("kernel.cu");
Device device = Platforms.get(0).getSomeDevice();
Memory input = device.receive(someArray);
Memory output = device.allocate(n);
device.execute(kernel, input, output);
...

And even more so for the new "Graph Execution" part of the API that my rant was about (I'm a fan of flow-based programming - that's why I created https://github.com/javagl/Flow , and having "CUDA modules" there would be neat...). But the point is: Nobody wants to use these parts of the CUDA API. People think that they have to use it, for profit, and will use it. They will hate it, but they will use it. And NVIDIA knows that, so they obviously don't give the slightest ... ... care... about many principles of API design.

Yeah, whatever. What I'm trying to do with JavaCPP is to be able to at least expose even these APIs to Java so that they are at least as (un)usable as from C/C++, and it's been working out better than I thought it could at first.

So, please do consider rebasing JCuda and JOCL on JavaCPP.
+
In any case, if you still feel strongly against using a tool like JavaCPP, please let me know why!

I don't feel strong against using a tool like JavaCPP, and already mentioned elsewhere: If JavaCPP had been available 10 years ago, I probably wouldn't have spent countless hours for JCuda (including the parsing and code generation infrastructure). I have to admit that I haven't set up the actual JavaCPP toolchain, for the actual creation of code, because I'd have to allocate some time for https://github.com/bytedeco/javacpp-presets/wiki/Building-on-Windows , but it would certainly be (or have been) less effort in the long run...

It's not that hard! :) JavaCPP itself just needs a C++ compiler, like this: https://github.com/bytedeco/javacpp#getting-started

Regarding rebasing JCuda on JavaCPP: I think we already talked about that, quickly, in the forum. It might be possible to do that to some extent. But I have some doubts. Very roghly speaking:

  • It may not be worth the effort. Introducing a layer that translates a call like JCuda.cudaMalloc(jcudaPointer, 4); into the appropriate cudart.cudaMalloc(javacppPointer, 4); may be justified by trying to change the basis of JCuda and maintaining some backward compatibility at the same time. But essentially, they are both only 1:1 JNI bindings of the CUDA API, so the differences at the surface level may not warrant the effort of adding such a translation layer for a few thousand functions. (Also, JCuda is a spare time project. What's in for me there? If someone has some spare $$.$$$,$$, I'll do this, no problem...)
  • It may have a slight disadvantage in terms of performance
  • Such a translation may not be possible in all cases

The last one refers to one point that I'm not sure about in JavaCPP. To my understanding, when creating an IntPointer from an int[] array, like https://github.com/bytedeco/sample-projects/blob/master/cuda-vector-add-driverapi/src/main/java/VectorAddDrv.java#L44 , then the memory will also (immediately) be allocated and filled on native side. If this is true, then imagine code like this:

int array[] = new int[100000000]; // 100 million ints - ~400 MB
IntPointer a0 = new IntPointer(array); // This will allocate and copy 400 MB...
IntPointer a1 = new IntPointer(array); // This will allocate and copy 400 MB...
IntPointer a2 = new IntPointer(array); // This will allocate and copy 400 MB...
...

In JCuda, I deliberately tried to allow a "Pointer to an int[] array" as a "shallow object", meaning that it does not do any copies or allocations. One could call this a "more natural" integration of Java arrays (despite all the difficulties that come along with that - garbage collection, relocation...). If the creation of an IntPointer implied an allocation+copy, then one would have to be very careful to avoid patterns like the one above. (And still, even if only one copy is created, it may get into the way of people who deal with ""Big Data®""...). It could probably still be possible to handle this in a thin translation layer, but may require some care to do it right.

JavaCPP also supports arrays. We can have overloads like this:

native void someFunction(IntPointer array, int size);
native void someFunction(int[] array, int size);

It doesn't try to interpret that "size" because it leads to issues like you've noticed where it's not always possible to map mechanically. However, it's possible to layer on top of that additional overloads like this:

void someFunction(int[] array) { someFunction(array, array.length); }

cuMemcpyHtoD() doesn't take an int* though, it takes a void*, and there are no void[] in Java so that's why JavaCPP doesn't do anything by default for that, but we could have it generate something like this:

native int cuMemcpyHtoD(long dstDevice, byte[] srcHost, long ByteCount);
native int cuMemcpyHtoD(long dstDevice, short[] srcHost, long ByteCount);
native int cuMemcpyHtoD(long dstDevice, int[] srcHost, long ByteCount);
...
int cuMemcpyHtoD(long dstDevice, byte[] srcHost) { return cuMemcpyHtoD(dstDevice, srcHost, srcHost.length); }
...

I suppose that's the kind of thing we could do to make it more like JCuda. Anything else? FWIW, Java arrays are limited to 2^31 elements, so that's why I don't feel it's worth spending too much time supporting all the corner cases. For "big data" applications, the data is in native memory anyway. It's never going to be in Java arrays.

This may be over-engineering, but conceptually, it's the attempt to abstract what's currently done in the Generator. In general, breaking the 4200-LOC-monolith Generator into a handful of XyzGenerator-classes (and ... ~2000 lines that are replaced with something like out.print(templateCodeFromFile("adaptersTemplate.c")) ...) could allow dedicated emitters, or to "address different backends", so to speak.

But again, that's just brainstorming. I know that it's never as easy as it looks on this level...

Yup, that's all things that could be worked on for JavaCPP 2.0 along things like using Clang to parse header files, see bytedeco/javacpp#51. (Clang is pretty big though. I'm not sure how Panama plans to justify the cost of adding that to the JDK. It would make sense if they also planned on using LLVM instead of C2 as with https://www.azul.com/products/zing/falcon-jit-compiler/, but they're not planning on doing that, so, I don't know. Panama's roadmap is still way too unclear for me. Like I said, I'm currently considering Panama to be the "new JNI" that might not bring performance improvements and may never be ported to Android...)

@jcuda
Copy link

jcuda commented Jan 6, 2021

Well, not necessarily. None of the current contributors of TensorFlow for Java are being paid to work on it full time, and it seems to be working out alright. I think what's important is figuring out ways to engage multiple people in a project, and then have it grow that way.

That's true. But JCuda always has been a one-man show. There is nothing that could "grow". When there's a new function in CUDA, the JNI stuff is added, and that's it. I probably should have polished+published the code generation part. As an analogy to JavaCPP: The result is just a pile of repetitive JNI code. The process that generates this result is far more relevant.


These days, I consider Panama to be the "new JNI", which approaches things in a different way, but I'm not entirely convinced it's going to be substantially more usable than JNI and sun.misc.Unsafe.

Not being entirely up to date, I cannot say anything further about Panama. But ... let's be honest: It was good to have something like JNI, because interoperation with C libraries is crucial for all programming languages and ecosystems. And it was... "sufficient" for generating a wrapper for a function like myMainComplexComputation(float arg). But there is a reason why the JavaCPP README lists roughly thirty "libraries and ways to cope with JNI" (trying to simplify or automate things).

On a very low technical level: I wonder why this page (which is now only available via the wayback machine) was at some point removed from the JNI docs: https://web.archive.org/web/20070112113059/http://java.sun.com/docs/books/jni/html/stubs.html As far as I understand, this looks like a very generic way to handle native calls....


It's not that hard! :) JavaCPP itself just needs a C++ compiler, like this

It requires certain tools like msys, mingw, and an installation procedure consisting of 20 steps,. When I see something like this, I usually assume that at least 5 of these steps will ~"not work as described". (No offense, this is not specific for JavaCPP, just from my experience...) - so I'd allocate at least a weekend for something like this.


JavaCPP also supports arrays.

I'd have to take a closer look at the JNI code there. I'll just ask this now, and if you think that the answer is beyond the scope of this issue thread, just say "RTFM Code!" - but maybe you can/want to make a short statement about that:

When passing an array to something like a CUDA function, then these functions may be inherently asynchronous. There is a plethora of technical caveats, particularly for things like cuMemcpyHtoDAsync, more generally, functions that expect page-locked memory, or functions that receive structures that contain arrays, and where "a pointer to this array" is supposed to be used in the actual native function. I have seen that GetPrimitiveArrayCritical is emitted in some cases. But this apparently only refers to arrays that are method parameters, and not to the arrays that may be elements of the structures that are used as method parameters. It's hard to summarize this sensibly - but imagine a function - in pseudocode - that does something like

void processAll(int arrayOfArrays[][]) {
    for (each array x) { asynchronously(process(arrayOfArrays[x])); }
}

The point being: One passes in a structure (array) of objects that are processed asynchronously, but there is no mechanism that prevents each arrayOfArrays[x] from being moved or even garbage collected...


However, the respective IntPointer et al constructors already say that they "allocate and copy" the memory, so if someone repeatedly creates new IntPointer instances for the same array, it's their fault.

I suppose that's the kind of thing we could do to make it more like JCuda.

The goal would not necessarily be to "make it more like JCuda", but to improve things in terms of performance and usability (and we can argue about in how far these goals overlap ;-)). There could be different ways of achieving this for this highly specific case. The cudaMemcopy(int*/float*/byte*...) overloads could be one option. Something like IntPointer.toArrayWithoutCopyingSoThatItCanBeUsedAsVoidPointer(array) being another. That's nothing that could/should be decided too hastily.


things like using Clang to parse header files, see bytedeco/javacpp#51.

I had seen issue 51 before, but it appears to be quiet there (2018). And admittedly, Clang and LLVM are things that I'd like to have a closer look at, but are also nothing that one could just casually get started with. For my stuff, I just used the parsing functionality from https://www.eclipse.org/cdt/ . It parses the whole C++ code, and generates an AST - that's all I needed until now. Re-implementing a full-fledged C++-parser is out of scope for any project.

@saudet
Copy link
Member

saudet commented Jan 7, 2021

Well, not necessarily. None of the current contributors of TensorFlow for Java are being paid to work on it full time, and it seems to be working out alright. I think what's important is figuring out ways to engage multiple people in a project, and then have it grow that way.

That's true. But JCuda always has been a one-man show. There is nothing that could "grow". When there's a new function in CUDA, the JNI stuff is added, and that's it. I probably should have polished+published the code generation part. As an analogy to JavaCPP: The result is just a pile of repetitive JNI code. The process that generates this result is far more relevant.

JCuda is used by others, so it's not just you alone. What's mainly missing is money. You could get money to work on JCuda, for example, via the NVIDIA-funded projects at https://github.com/rapidsai. Those do not currently interoperate with JCuda, JavaCPP, or anything that gives Java developers access to CUDA functions. I think that's a big oversight on their part, but at the moment their lead engineers do not understand this seemingly simple fact! Someone like you may be able to convince them that they should make their libraries compatible with JCuda, JavaCPP, etc, at which point you may get NVIDIA engineers working for you on your projects and even get money to get things working. I know of at least @razajafri and https://www.linkedin.com/in/stevemasson/ that have tried to use JavaCPP at NVIDIA and they may be able to offer you some help, but it's going to be a hard battle to get their lead engineers to hear you. It's not an impossible task though.

These days, I consider Panama to be the "new JNI", which approaches things in a different way, but I'm not entirely convinced it's going to be substantially more usable than JNI and sun.misc.Unsafe.

Not being entirely up to date, I cannot say anything further about Panama. But ... let's be honest: It was good to have something like JNI, because interoperation with C libraries is crucial for all programming languages and ecosystems. And it was... "sufficient" for generating a wrapper for a function like myMainComplexComputation(float arg). But there is a reason why the JavaCPP README lists roughly thirty "libraries and ways to cope with JNI" (trying to simplify or automate things).

On a very low technical level: I wonder why this page (which is now only available via the wayback machine) was at some point removed from the JNI docs: https://web.archive.org/web/20070112113059/http://java.sun.com/docs/books/jni/html/stubs.html As far as I understand, this looks like a very generic way to handle native calls....

Probably just copyright issues with the publisher or something? After all, it's an old book...

It's not that hard! :) JavaCPP itself just needs a C++ compiler, like this

It requires certain tools like msys, mingw, and an installation procedure consisting of 20 steps,. When I see something like this, I usually assume that at least 5 of these steps will ~"not work as described". (No offense, this is not specific for JavaCPP, just from my experience...) - so I'd allocate at least a weekend for something like this.

The only reason it needs MSYS2 is to run the Bash scripts. Now that Windows has WSL, we could port all that to WSL. That's something else you could work on. :) Please put the blame where it belongs. Until recently Microsoft was very unfriendly to anything that wasn't 100% Microsoft, including Linux and Java. The community had no choice but to come up with hacks like MSYS2. These days, the smoother developer experience is on Linux.

JavaCPP also supports arrays.

I'd have to take a closer look at the JNI code there. I'll just ask this now, and if you think that the answer is beyond the scope of this issue thread, just say "RTFM Code!" - but maybe you can/want to make a short statement about that:

When passing an array to something like a CUDA function, then these functions may be inherently asynchronous. There is a plethora of technical caveats, particularly for things like cuMemcpyHtoDAsync, more generally, functions that expect page-locked memory, or functions that receive structures that contain arrays, and where "a pointer to this array" is supposed to be used in the actual native function. I have seen that GetPrimitiveArrayCritical is emitted in some cases. But this apparently only refers to arrays that are method parameters, and not to the arrays that may be elements of the structures that are used as method parameters. It's hard to summarize this sensibly - but imagine a function - in pseudocode - that does something like

void processAll(int arrayOfArrays[][]) {
    for (each array x) { asynchronously(process(arrayOfArrays[x])); }
}

The point being: One passes in a structure (array) of objects that are processed asynchronously, but there is no mechanism that prevents each arrayOfArrays[x] from being moved or even garbage collected...

However, the respective IntPointer et al constructors already say that they "allocate and copy" the memory, so if someone repeatedly creates new IntPointer instances for the same array, it's their fault.

Like I said, the data is likely in native memory anyway. It's not worth spending all that time trying to support Java arrays. Panama does not support Java arrays, at all, full stop, period. It's a dead end, let it go: Use native (aka off-heap) memory and forget about Java arrays.

I suppose that's the kind of thing we could do to make it more like JCuda.

The goal would not necessarily be to "make it more like JCuda", but to improve things in terms of performance and usability (and we can argue about in how far these goals overlap ;-)). There could be different ways of achieving this for this highly specific case. The cudaMemcopy(int*/float*/byte*...) overloads could be one option. Something like IntPointer.toArrayWithoutCopyingSoThatItCanBeUsedAsVoidPointer(array) being another. That's nothing that could/should be decided too hastily.

things like using Clang to parse header files, see bytedeco/javacpp#51.

I had seen issue 51 before, but it appears to be quiet there (2018). And admittedly, Clang and LLVM are things that I'd like to have a closer look at, but are also nothing that one could just casually get started with. For my stuff, I just used the parsing functionality from https://www.eclipse.org/cdt/ . It parses the whole C++ code, and generates an AST - that's all I needed until now. Re-implementing a full-fledged C++-parser is out of scope for any project.

JavaCPP isn't parsing the whole of C++, only the bits needed to parse most header files, but it's quite ad hoc. We talked about that before. Anyway, the only free usable C++ parser that is being actively maintained these days is Clang, so... It could make sense as part of an external library like JavaCPP, or even GraalVM, which is using LLVM as a compiler backend called via JavaCPP:
https://github.com/oracle/graal/search?q=javacpp
But I still don't see how/where Panama expects to fit that in the JDK, and they don't know either...

@junlarsen
Copy link
Member

JavaCPP isn't parsing the whole of C++, only the bits needed to parse most header files, but it's quite ad hoc. We talked about that before. Anyway, the only free usable C++ parser that is being actively maintained these days is Clang, so... It could make sense as part of an external library like JavaCPP, or even GraalVM, which is using LLVM as a compiler backend called via JavaCPP:
https://github.com/oracle/graal/search?q=javacpp
But I still don't see how/where Panama expects to fit that in the JDK, and they don't know either...

I've been messing around with the generated clang bindings, experimenting with a code generator and so on. If we decide to go with clang or similar for parsing I think we're going to have to do a bit of manual text replacement in the header files if we want to keep full compatability with the current setups we have (not sure how we do line patterns for example). Perhaps we could have the clang backend as an optional generator while keeping the current one?

I'm not very experienced with the libclang C API as I've only done minor experiments with it but their documentation mentions that the C API doesn't really provide that much information and that their C++ API has a lot more data available with more in-depth AST traversal. If we end up struggling with the C API, perhaps we could write a tiny C++ wrapper and generate bindings for that with JavaCPP?

I would be happy to help with work regarding a Clang based parser and/or generator. Our initial goal doesn't have to be to replace the existing parser or generator. Let me know if this is something of interest.

@saudet
Copy link
Member

saudet commented Jan 7, 2021

I've been messing around with the generated clang bindings, experimenting with a code generator and so on. If we decide to go with clang or similar for parsing I think we're going to have to do a bit of manual text replacement in the header files if we want to keep full compatability with the current setups we have (not sure how we do line patterns for example). Perhaps we could have the clang backend as an optional generator while keeping the current one?

Well, it'd probably be an incompatible move, towards JavaCPP 2.0. @wmeddie made me realize the InfoMap could be "upgraded" to something like a RuleMap where the values that we give it are closures, so that we don't have to do things like patterns or solidify too much in advance the code that we're supposed to generate for each identifier. Sort of like how Gradle works actually.

I'm not very experienced with the libclang C API as I've only done minor experiments with it but their documentation mentions that the C API doesn't really provide that much information and that their C++ API has a lot more data available with more in-depth AST traversal. If we end up struggling with the C API, perhaps we could write a tiny C++ wrapper and generate bindings for that with JavaCPP?

Yes, I think that would be the best approach. We can easily "extend" the C API with new functions like you and @yukoba have already done here:
https://github.com/bytedeco/javacpp-presets/tree/master/llvm/src/main/resources/org/bytedeco/llvm/include
(And then when we're satisfied with how they work, we can also try to send pull requests upstream...)

I would be happy to help with work regarding a Clang based parser and/or generator. Our initial goal doesn't have to be to replace the existing parser or generator. Let me know if this is something of interest.

It's something of interest for sure, but it is something that would probably take even a good engineer like you maybe half a year to complete! So, for the moment, it's probably going to stay on the back burner for a while still... You could still look into all that and get bits and pieces done here and there, that'd be great, I just don't want to set unrealistic goals here. I also would like to see if Panama ends up being actually useful for this purpose :) You may want to try to work with them on that, but OpenJDK has their own ideas about how to do things, and they generally do things their way regardless of the feedback they receive (unless it comes with a lot of money attached obviously)...

@saudet
Copy link
Member

saudet commented Feb 26, 2021

All the API for OpenCL 2.0 should be there, yes. If there is anything missing I'll fix it when we find it! CLBlast doesn't look too hard to support either, but first things first, let me know if there is anything missing from the presets for OpenCL itself.

HIP and friends, well, there's a whole lot of minor APIs like that. I don't have the time either to do everything by myself! I was hoping oneAPI would take care of abstracting everything, but it's obviously not going to happen. :( Something's bound to show up at some point though. I'd wait and see what happens over the next few months, and if there's still nothing available... It feels to me that something like TVM might very well start to become useful as a general computational framework:

It already has backends for CUDA, OpenCL, Vulkan, Metal, ROCm, DSPs, FPGAs, etc and it's working pretty well, even from Java:

BTW, here's a good step in the right direction for the generalization of a framework like TVM:
https://medium.com/octoml/compiling-classical-ml-for-up-to-30x-performance-gains-and-hardware-portability-2aef760af694

@saudet
Copy link
Member

saudet commented Apr 23, 2021

@jcuda BTW, if your reluctance to use off-heap memory comes from a belief that it is slower to access from Java, you do not need to worry about that. It is just as fast to access off-heap memory than it is to access the memory of Java arrays, and the API can still see be pretty enough. See, for example, here what it looks like with "indexers": http://bytedeco.org/news/2014/12/23/third-release/

@jcuda
Copy link

jcuda commented Apr 24, 2021

There's not really a "reluctance" to use off-heap memory. In some way, quite the contrary: If I had to ~"design a raw-data API from scratch", I'd almost certainly design it exclusively around things like FloatBuffer. The reason is simple: Considering a method signature like

public static void processChunk(float array[], int min, int max) { ... }

there is hardly a reason to actually use a float[] array. Designing the same as

public static void processChunk(FloatBuffer buffer) { ... }

is far more flexible in every concievable way: It also accepts direct buffers, the min/max hassle is "hidden" conveniently in the position/limit of the buffer, and if someone has a plain float[] array, then it's possible to either just do a FloatBuffer.wrap for the call, or (depending on the goal) offer a trivial convenience wrapper like

public static void processChunk(float array[], int min, int max) { 
    processChunk(FloatBuffer.wrap(array).limit(max).position(min);
}

The API is just more powerful that way.

That being said:

My primary motivation for supporting plain array[] objects (specifically, to not require direct buffers) was a different one: Back then, when I started JCuda, one had to assume that most compute-heavy Java libraries operated on float[] arrays. So if someone had code like

float compute() {
    float huuuugeAmountOfData[] = create();
    float result = doComplexProcessingOn(huuuugeAmountOfData);
    return result;
}

and wanted to offload the "complex processing" to the GPU, this should be possible without the additional overhead of having to create a direct buffer and copying the data first. The memory transfer already is the bottleneck in most cases anyhow....

Things have changed since then. If I had to start over, I'd probably be "reluctant" to do all the contortions that are necessary to support non-direct data...


Edit: That "indexers" link looks interesting, I'll try to allocate some time to look into that. It might be related to (or a "better engineered, goal-oriented" version of) what I did with https://github.com/javagl/ND a few years ago...

@saudet
Copy link
Member

saudet commented May 1, 2021

Things have changed since then. If I had to start over, I'd probably be "reluctant" to do all the contortions that are necessary to support non-direct data...

I see, makes sense. BTW, Smile is also pretty "old" w.r.t to that aspect of using float[], double[], and friends as part of the API, but the main author did switch to using JavaCPP over the previously available alternative to access the necessary native libraries (mainly netlib-java). It might be worth investigating what he ended up doing there. It looks like for the kind of BLAS-like interface of these libraries, JavaCPP is able to generate everything with float[]/FloatBuffer and double[]/DoubleBuffer. In any case, I'm not seeing any usage of FloatPointer or DoublePointer...

Edit: That "indexers" link looks interesting, I'll try to allocate some time to look into that. It might be related to (or a "better engineered, goal-oriented" version of) what I did with https://github.com/javagl/ND a few years ago...

I haven't been trying to create something like NumPy, but merely just what is needed to offer what we already have in C/C++ to access multidimensional arrays, that we don't have in Java. It creates issues not to have some language feature for that when trying to use C/C++ APIs in Java. As for something like NumPy in Java, the C++ API of PyTorch maps reasonably well to Java, it's pretty cool and it works with GPUs too, check it out: https://github.com/bytedeco/javacpp-presets/tree/master/pytorch I hope more than a handful of people find this interesting enough so that we can develop a high-level API on top of that.

@jackyh
Copy link
Contributor

jackyh commented Oct 13, 2021

I'm doing something for DL inference deployment. Since most of the data processing and website framework are still based on JAVA, these system/pipelines nowadays need to have a phase of DL inference (such as CV/NLP based classification). there's two ways to do this: 1) cloud native (K8S clusters, lots of PODs are Java based services with some PODs based on Python/C++ to do inference, then use RPC to take with each other. Some tools like Flask looks like quite popular here as a Python based inference service. This is the typical way when this company has lots of DL models need to be served, so use some PODs for inference service. 2) just use Java APIs to do such as Tensorflow Java API.
For most small companies and companies with just few DL models need to be served, they tend to use 2).
I think JavaCPP Preset for DL apps is a fast way to generate Java APIs (JNI based). But, if we want the companies to use these DL apps generated by JavaCPP, we need to demostrate with lots of things:

  1. lots of assets/samples
  2. most of these companies they are not familiar with GPU/cuda
  3. show them that their inference service can be accelerated
  4. the perf should be better than "Flask based Python service"

@agibsonccc
Copy link

agibsonccc commented Oct 13, 2021

@jackyh what you're doing around triton is great but very vendor specific. Not all DL workloads need GPUs as they are very expensive. Many of the companies betting on java would prefer generally to start with stock cpus first.

Generally though they do help in quite a few cases, especially for training. As for javacpp based tooling a lot of tools are built on top of it including tf java and our very own dl4j as well as our tensor library nd4j.

I view javacpp as a tool to build tools very similar to kubernetes. It's too low level for most developers to be touching directly. It's still native code and exposes very raw interfaces. I think of directly using javacpp as like programming with c++ directly. The flexibility is amazing and what it does for tools developer is also really nice.
However, wrappers and frameworks like tf java or dl4j/nd4j are needed so developers don't accidentally misuse something.

@jackyh
Copy link
Contributor

jackyh commented Oct 14, 2021

@jackyh what you're doing around triton is great but very vendor specific. Not all DL workloads need GPUs as they are very expensive. Many of the companies betting on java would prefer generally to start with stock cpus first.

We've done with lots of on-site survey to these "mid-small" e-commercial companies. 90% of these companies still use CPU for inference, some of them are using TF java api, some of them using other tools (some pipe of Python to Java, ONNX runtime java binding, etc..), some of them using micro service of Flask. The reasons are (listed by importance):

  1. Their QPS typically is not very high, in most of the cases, CPU is OK
  2. They are not familiar with GPU, dont know about cuda programming
  3. GPU is expensive compared with CPU

Generally though they do help in quite a few cases, especially for training. As for javacpp based tooling a lot of tools are built on top of it including tf java and our very own dl4j as well as our tensor library nd4j.

Here, I will not agree with the "training" topic, here's the reasons:

  1. Training depends more on ecosystem compared with inference. not only the cuda wrappers, but also lots of data processing tools. It's more of an ecosystem topic rather than computing itself. So, since Python already dominate this market, it's too too expensive for Java to invest on that.
  2. Why I still hold that, for inference, there's still a market for GPU is because:
    a, These mid-small size companies, they will have QPS peak every year, such as "Black Friday", CPU can not deal with that.
    b, Even the QPS is not that high, if people know that GPU can accelerate about even 50 times with large-batch of data for CNN models, some of them will choose GPU for "real-time" response.

I view javacpp as a tool to build tools very similar to kubernetes. It's too low level for most developers to be touching directly. It's still native code and exposes very raw interfaces. I think of directly using javacpp as like programming with c++ directly. The flexibility is amazing and what it does for tools developer is also really nice. However, wrappers and frameworks like tf java or dl4j/nd4j are needed so developers don't accidentally misuse something.

Agree, this is something like ecosystem topic. We need more java developers to contribute for Java-like high-level app demos. JavaCPP is just wrapper, it's essential, but not enough.

@Craigacp
Copy link

@jackyh The ONNX Runtime Java API works just fine on GPUs as well as CPUs. The latest release supports TensorRT as well as CUDA on GPU, and OpenVINO & DNNL for CPU alongside the standard CPU backend. If there are CPU or GPU throughput problems with the Java API open an issue on github.com/microsoft/onnxruntime and we'll fix them.

Similarly TF SIG-JVM does care about inference throughput, though it's not had as much attention as ORT has, especially on GPUs, due to it being community led and the community being rather fractured.

One of my current interests for the Java ML ecosystem is building ONNX export support as that allows Java trained models to be served on the major cloud providers using the services they provide for autoscaling. I've talked to the ONNX steering committee about improving ONNX's support for other languages, and we've built ONNX export support into my group's Java ML library.

@jackyh
Copy link
Contributor

jackyh commented Oct 14, 2021

@jackyh The ONNX Runtime Java API works just fine on GPUs as well as CPUs. The latest release supports TensorRT as well as CUDA on GPU, and OpenVINO & DNNL for CPU alongside the standard CPU backend. If there are CPU or GPU throughput problems with the Java API open an issue on github.com/microsoft/onnxruntime and we'll fix them.

Similarly TF SIG-JVM does care about inference throughput, though it's not had as much attention as ORT has, especially on GPUs, due to it being community led and the community being rather fractured.

One of my current interests for the Java ML ecosystem is building ONNX export support as that allows Java trained models to be served on the major cloud providers using the services they provide for autoscaling. I've talked to the ONNX steering committee about improving ONNX's support for other languages, and we've built ONNX export support into my group's Java ML library.

ehh...personally, I rarely know that people use Java for Training. I will try to reach out to more customers here.

@agibsonccc
Copy link

@Craigacp how is the op coverage? We're working on import for uptraining models supporting the keras h5 format, onnx as well as TF and have made some fairly good strides there.

@jackyh our audience tends to have folks who want uptraining for models. I can confirm what @Craigacp is saying here. There's an audience here, especially for pretrained models. It's not as big as python though for sure.

What we've found in practice is that generally vendor specific efforts are better than neutral ones.

In our version of triton (https://github.com/KonduitAI/konduit-serving) we've (using javacpp as well as our dl4j stack) tend to implement pipeline steps as a high level abstraction to various javacpp presets allowing for one abstraction for allowing higher performance like what you're aiming for. A big focus is on direct in memory interop by passing pointers around via javacpp.

This + a big focus on graalvm integration allowing easily deployed binaries is allowing us to essentially compile models to whole pipelines while selectively adding support for different vendors (whatever might be superior in performance for a particular use case)

@jackyh
Copy link
Contributor

jackyh commented Oct 14, 2021

@Craigacp @agibsonccc
Java eco for DL is not so big compared to Python and C++, I guess maybe 1/10 the size. But, looks like our efforts are still fragmented even then.:):)

@Craigacp
Copy link

@Craigacp how is the op coverage? We're working on import for uptraining models supporting the keras h5 format, onnx as well as TF and have made some fairly good strides there.

@jackyh our audience tends to have folks who want uptraining for models. I can confirm what @Craigacp is saying here. There's an audience here, especially for pretrained models. It's not as big as python though for sure.

Our ONNX export op coverage is pretty low but we're trying to export ML models rather than DL ones, so we're only implementing the ops we need for the models we can train (excluding TF-Java models as training is broken in TF-Java at the moment anyway). It's all public, you can see where we are up to. Longer term it might be interesting to have tf2onnx in Java, but that's a much larger effort.

I agree there is an audience for fine-tuning models in Java, and so it would be pretty useful to have that support. Though personally I find maintaining Python libraries that can pre-train large transformer models to be deeply frustrating and so I'd like to do that on the JVM too. But it's much much harder, and the market is even smaller because it's only really possible at large companies.

@Craigacp @agibsonccc Java eco for DL is not so big compared to Python and C++, I guess maybe 1/10 the size. But, looks like our efforts are still fragmented even then.:):)

I agree the DL ecosystem isn't as big, but DL is a subset of ML, and there is lower hanging fruit in ML. It would be nice to prevent people from having to use Python & scikit-learn to train logistic regressions, or tree ensembles, as those are in my experience more prevalent in terms of solving business problems.

@agibsonccc
Copy link

@jackyh I don't think either of us thinks that python isn't the incumbent.

It's just nice to have alternatives out there in the ecosystem. Java is still table stakes for many deployment use cases. Strong interop with the python ecosystem allows for easier deployment of models and also allows different use cases like: desktop tools written in java (of which there are quite a few) as well as things graalvm. With javacpp, you actually have a nice packaging mechanism as an alternative for running native deps while still having access to an easier to use programming language that's more performant than python.

I can also confirm @Craigacp is on to something with transformers. A significant number of our users deploy big NLP models and would like to see some huggingface like tooling for their enterprise deployments. I don't see the harm in bootstrapping of of the ecosystem while exposing an easier interface to these tools.

@saudet
Copy link
Member

saudet commented Oct 14, 2021

I agree there is an audience for fine-tuning models in Java, and so it would be pretty useful to have that support. Though personally I find maintaining Python libraries that can pre-train large transformer models to be deeply frustrating and so I'd like to do that on the JVM too. But it's much much harder, and the market is even smaller because it's only really possible at large companies.

@Craigacp BTW, that kind of thing is now possible with the JavaCPP Presets for PyTorch, see #1075. However, the underlying C++ API barely supports that use case as it is, so it's not only about Java, it's just not the kind of thing that's being invested in for any other language than Python. However, as @agibsonccc points out, a tool like JavaCPP makes it very easy to manage Python packages all from Java. The only thing engineers need to worry about is the language itself, so it's not that bad of a situation IMO. I often think of Python as the "Bash for AI": It's never going to be as fast as Java or C++, but it gets the job done. :)

@archenroot
Copy link
Member Author

I like this one: Python = Bash for AI :-) matching statement

@jackyh
Copy link
Contributor

jackyh commented Oct 14, 2021

@Craigacp how is the op coverage? We're working on import for uptraining models supporting the keras h5 format, onnx as well as TF and have made some fairly good strides there.
@jackyh our audience tends to have folks who want uptraining for models. I can confirm what @Craigacp is saying here. There's an audience here, especially for pretrained models. It's not as big as python though for sure.

Our ONNX export op coverage is pretty low but we're trying to export ML models rather than DL ones, so we're only implementing the ops we need for the models we can train (excluding TF-Java models as training is broken in TF-Java at the moment anyway). It's all public, you can see where we are up to. Longer term it might be interesting to have tf2onnx in Java, but that's a much larger effort.

I agree there is an audience for fine-tuning models in Java, and so it would be pretty useful to have that support. Though personally I find maintaining Python libraries that can pre-train large transformer models to be deeply frustrating and so I'd like to do that on the JVM too. But it's much much harder, and the market is even smaller because it's only really possible at large companies.

@Craigacp @agibsonccc Java eco for DL is not so big compared to Python and C++, I guess maybe 1/10 the size. But, looks like our efforts are still fragmented even then.:):)

I agree the DL ecosystem isn't as big, but DL is a subset of ML, and there is lower hanging fruit in ML. It would be nice to prevent people from having to use Python & scikit-learn to train logistic regressions, or tree ensembles, as those are in my experience more prevalent in terms of solving business problems.

For the ML part, lots of things are already covered by project Rapids/Spark

@KafkaProServerless
Copy link

Hello,

First of all, wanted to say thank you for your project.

As of this writing, Java is now version 22, and with Java 22, there is Project Panama.

With the new way to invoke JNI, it would be great to upgrade your project incorporating the latest JNI from Panama to run CUDA code.

Thank you

@agibsonccc
Copy link

agibsonccc commented Apr 22, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests