-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Support #1
Comments
This is not an area I know much about would love to know more about the approach, @eduardoleao052 is this something you're thinking of using a dependency for or building this from scratch specifically for matrix operations ? |
@medic-code after completing the TypeScript support, I'll study the best option to add GPU support. This means that the heavy matrix multiplications will be able to be run on a GPU, speeding up the ML models. |
If I may be so bold, I would suggest the following approach. I'm wondering if we can outdo Python if we build from the ground up to focus on GPU integration and performance. My understanding is Torch and Python mostly have mapped corresponding routines (kernels) in the GPU, so in theory, if we take a from the ground up performance approach, I would like to think we can outperform classical Torch. We add decorations inside the Torch-ish objects that allow a visitor object to assemble an entire compute pipeline, kind of a compute abstract syntax tree of what is to be executed that we can map data into. The pipeline needs to be able to be processed as a whole or chunked along I/O and compute-friendly boundaries to allow us to optimise when running with different GPU availability. If the whole pipeline will fit in a big fat GPU then we want it all in there. GPUs are getting bigger and fatter. Let's design for that. If we need to break it up into discrete steps with a small or multiple small GPUs, then we need to be able to do that as well.
Make use of ReadableStream WritableStream TransformerStream for compatibility with browser and node and everything async. We need to stream everything to minimise memory usage. We need to be able to stream data and models from storage into the GPU and then data out without maxing out v8's limited memory. As a kind of proof of concept and possibly even the solution, for version 0 we take the AST for the pipeline and build a function that can be passed into https://github.com/gpujs/gpu.js Let it do the heavy lifting for now. If we need to do something else in the future, we can replace this step with something else and keep the rest of the design and implementation. |
@metawrap-dev I like the spirit, and you clearly have a great understanding of hardware manipulation on Pipelines. I think we could make it work, and adding at least the |
Thanks! To give a toy example: We really want to build an object tree that represents the promise to execute this. That way, instead of passing So we go from sequenceDiagram
participant JavaScript
participant GPU
JavaScript->>GPU: w * x
GPU->>JavaScript: r1
JavaScript->>GPU: y * z
GPU->>JavaScript: r2
JavaScript->>GPU: r1 + r2
GPU->>JavaScript: a
to sequenceDiagram
participant JavaScript
participant GPU
JavaScript->>GPU: (w * x) + (y * z)
GPU->>JavaScript: a
This will work efficiently when we pass a torch-ish object into another's method as a parameter Not much if we pass in just a scalar/primitive unless we make a special class, eg a Number class that would allow a whole tree to be linked. We would, of course, make it as granular as possible (no point in building the internal AST of a matrix operation). If the library is skewed towards being able to build the largest connected Pipeline tree, then we have the ability to pass the whole intention into the GPU and optimizers and have them work their magic. |
Cool! I could see that working building upon the I'm currently finishing to add TypeScript support to the library, and will push it to main soon (currently it's in the Would you by any chance be willing to work on a PR, so that we can try to implement the |
Sure! Once you complete the TypeScript support. I do a lot of prototyping in TypeScript, so it's my happy place. In the meantime, I will look into implementing the toy example above and try out some design ideas. |
These two features in https://github.com/gpujs/gpu.js/blob/develop/README.md#combining-kernels Some notes here for things I want to investigate.
|
That's awesome, looks like a great start! |
@metawrap-dev Merged the TS into the main branch, but there are still some mistakes, I'll solve them today. |
Thanks. I will take a look. If you have any typescript questions I'm happy to help there as well. I'm going to continue in a different repository until I have the major questioned I outlined above answered. Today I'm going to tackle the streaming and batching big data question. Big as in more than can fit into memory. The goal (i) will be to have a toy versions of a model load from disk with chained execution operations and then output to disk. I want to be able to get to the point where I execution is triggered only when we want to save the result. So, complete lazy loading and execution with the ability to inspect the pipeline graph. Next after that the goal (ii) will be to have a combination of granularities so we have some operations that will be handled by a prepositioned kernel and then some that are weaved together with permutations in code. Those will be handled by less granular code generation. At that point it will be possible to create a visitor class that can analyse and chop/chunk and transpile the pipeline graph. I can probably only get some of (i) done today given the time I have as just coding up the testing and plumbing is going to be a majority of the work. |
Cool! If you want any help at some point, just let me know. About the TypeScript, the only issue left to solve are a couple of |
If you can avoid If it is the case where you don't know the type yet in the flow of execution and it will be resolved later on, then using If it is just a case of trying to get things to match and Here is my work so far. |
I'm currently removing all the |
The DataNumber class is just a start and just a minimalist toy model. Yes, higher order objects like tensors will be modelled the same way. I'm just starting very small as there is a lot of type semantics to work out to make everything composable before I move onto something more complex. |
In memory, E.g. in one test case.
Resolves to
This is, in effect, loading computing and saving the result. We get to see the pipeline graph before it is resolved. I'm not sure if I should strictly type these or use the "sausage machine principle" and assemble the correct type from arbitrary length and format input arrays. resolve() is going to be invoking any transpilation. At the moment I am just executing everything in the CPU. Optimal batch sizes can be discovered on the fly. At the moment, they are there to force the pipeline to execute when data limits are reached to emulate a shadow of some kind of real-world behaviour. I'm thinking of some back-pressure signalling up from the GPU from each kernel and the GPU in general. When complete and if it can be demonstrated to be usefully performant, the goal will be to take the learnings from my repository and transfer them, in style, to yours. If it all plays out well, we should have something that could, in theory, outperform Python, at least within striking distance. The next step is creating some actual disk I/O-driven Source and Destination implementations. After that, some matrix operations. One could be implanted as a granular kernel, and the other could be built up with primitive operations. That should allow some serious testing with code generation. I'm super busy until the end of the week, but will see what I can do in between things |
Gathering some more thoughts for (i) I'm close to being able to serialise and deserialise the execution state. What will that get us?
I'm also considering adding an unrolling strategy as the same kind of parameter as batch size. With this, the system that is trying to optimise the pipeline graph will also be able to experiment with unrolling. And that implies that metrics are going to be needed. The optimising system observing the execution needs to be able to get metrics at each point in the pipeline and maintain some previous data points to make a comparison. Enough to give it the ability to converge, but not too much that it overwhelms the task at hand.
|
Implemented So close to 100% :) Tentative flow control so that operations asynchronously will block until data becomes available. It is time to add more complex The goal is to make sure that execution only occurs after resolve, but there are situations were populating The design looks like it is holding together and will do what I want. More complex objects should now shake any bugs or design issues with the way I want data to flow and obey batch/chunk sizes. It should allow me to test the roll/unroll strategy ideas, as I need a Compute element with at least a few layers of execution to support that. |
These are really solid groundworks. Been looking through the repo, and it's impressive stuff. I think adding more complex objects is the next big step. If you need any help, let me know! |
Thanks,. Yes, now that I have the flow and semantics thought out and lazy execution working, more complex objects are now needed. Just thinking some things out loud again here. Code-Gen/Optimizer:
Mock
This should give enough. The plan for today is just to get the new mock data elements in. Probably all I have time for. I suspect the design won't survive the first data element without changes, but here goes. : ) |
A lot to unpack here. Great success.. but.. From a flow aspect, there are several ways to computer dotN. Not all of them are obviously nonsense. Also this is going to be the same for all operations.
This adds a lot of fragile complexity in having generalized arguments. I'm wondering if I can create IStreamify so to formally transform non-streaming to a streaming version. Otherwise, It's obvious I am going to end up in array hell. (Is it a matrix or an array of vectors to streaming output of vector etc. ) Time to dive deep into some of the latest typescript features. |
The plan =>
I can probably create some generalised versions of these and, worst case, we end up with a pair per Same approach for |
I tried the above. It works. I ended up with the more sensibly named The new method simplifies the code because const streamer = new Streamer<v4, number>(
new SourceMemory<v4>([1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]),
new ComputeLength4<v4>()) Creates a pipeline graph of {Streamer{SourceMemory(4 elements, atoms 4, 0 index, 1 batch size)
<= [[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1]]}
=> {ComputeLength4{SourceMemory(0 elements, atoms 0, 0 index, 1 batch size)
<= []}=>unresolved}} resolves to {Streamer{SourceMemory(4 elements, atoms 0, 4 index, 1 batch size)
<= [[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1]]}
=> {ComputeLength4{SourceMemory(4 elements, atoms 0, 4 index, 1 batch size)
<= [[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1]]}=>{DataNumber
<= 2}}} The alternative original is const compute= new ComputeLength4<v4>(new SourceMemory<v4>([1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1])) Where we pass in the source just like any other parameter. And would get something like multiply. (multiply{SourceMemory(3 elements, atoms 3, 0 index, 1 batch size)
<= [{DataNumber <= 10},{DataNumber <= 10},(multiply{SourceMemory(2 elements, atoms 2, 0 index, 1 batch size)
<= [{DataNumber <= 10},{DataNumber <= 10}]}=>unresolved)]}
=> unresolved) ...and resolves to. (multiply{SourceMemory(3 elements, atoms 0, 3 index, 1 batch size)
<= [{DataNumber <= 10},{DataNumber <= 10},(multiply{SourceMemory(2 elements, atoms 0, 2 index, 1 batch size)
<= [{DataNumber <= 10},{DataNumber <= 10}]}=>{DataNumber <= 100})]}
=> {DataNumber <= 10000}) The advantage in either could be when it comes to code generation. So I will sit on it for a while as I try more complex |
TypeScript's type system is doing some heavy lifting here. type v4 = Vector<number, 4> | DataVectorN | (number | IData<number>)[]
const source = new SourceMemory([1, 1, 1, 1], new DataVectorN([1, 1, 1, 1]), [1, new DataNumber(1), 1, 1], [1, 1, 1, 1])
const compute = new ComputeLength4()
const streamer = new Streamer<v4, number>(source, compute) |
Some more thoughts. Streaming in such a way that engages elegantly with TypeScript's type system but remains performant; There are few ways to go. The main issue is argument cardinality and the marshalling stage into an Either the Source(2)=>Streamer(2)=>Compute(2)
This needs to be expressed in the type, the type system will flag a mismatch so N needs to propagate properly. It will be used to drive Vector but we also want it as a cardinal for use at runtime. The code is bit of a mess right now because I have a few ideas in play simultaneously and not all of them will be the winner. Streamer vs no Streamer: Sausage Machine vs Structured. Sausage machine (or some variant) may actually be the best way forward and the most flexible. This is all about being able to chunk up code and data into big blobs that we can blat into the GPU without a lot of back and forth. Ultimately all the design decisions are driven by that goal. |
@metawrap-dev its really fascinating to see your thought process, I've just been wondering if you had any ideas if there's a better way to document this other than an open issue ? For example maybe an open PR with a readme.md, or a wiki page or something ? It's okay if you feel this way of documenting your thoughts is fine. I just imagine that after a while you might want to collect your thoughts in a summary and rationale so far. I realise that this is still a part of discovery as you go and you may want to postpone documenting in a more formalised way. Just some thoughts. |
Yep, eventually. At the moment, I'm still experimenting. I assumed here was good because others could chip in. It just kind of happened organically as I write down most of these thoughts while mobile and then refer back to them as I work, so very convenient for me. I will gather my thoughts elsewhere. |
I'm definitely not suggesting you should gather your thoughts else where outside of github, as i said its very useful to see your thought process/allow others to contribute. I would imagine that you may find trying to scroll back through more than 29 posts a little more challenging as you keep documenting, that is all. |
29 (now 30) posts are fine. Scrolling is fine. That's not a long issue; this is a long issue. 😁 But we finally got there. Hopefully, I will be finished before this gets that long. I have the types mostly tamed. Quite happy with it. For the initial toy system, all the tests and examples are just placeholders for fundamental but larger patterns. I have a generalized way of defining a
I have temporarily removed I have examples of different styles of I'm now deciding if I want to create It would be nice to be able to construct this behaviour out of the types system as I have for most of the existing Take MultiplyN. (I'm using 0 = Unlimited, 1 = Scalar. Might change to Infinity, 0 later but not tested how nicely Infinity plays with the type system yet) class ComputeMultiplyVN extends ElementCompute implements ICompute<number, 1, 0, number, 1, 1> If I multiply a long If I feed Also. taking a VectorAdd I'm interested in the general solution of constructing via generics and common base classes the accumulator and separating more state out of compute and including the optimization for the code-gen. I need some kind of lambda operator/compositor that can dip into logic and a factory class to provide optimizations. If you get my drift :) The other is related... the internal code generation where a compute element can create an internal representation of what it will be doing out of other eg. Multiply, based on input from the strategy (unroll/batch etc) either declares a single-grained intention to multiply,or generates a course-grained sequence of mul8/mul4/mul3/mul2 That is the goal anyway. So this will be a fun bit. Like I said before, all these examples are just mini toy versions of what the system will need to do on a larger scale. |
I have a naive type signature implementation that will compose. In this form the generator's generic/template parameters can't be completely inferred from the input types as I am trying to be flexible on the output stage. I will probably need to remove this flexibility and always return a standard form that can be derived from the source type. it('Generator matching 1:1 compute signature', async () => {
// A source of pairs matching signature of compute element.
const source = new SourceMemory<number, 4, 2>(
[
[10, 10, 10, 10],
[10, 10, 10, 10],
],
[
[10, 10, 10, 10],
[10, 10, 10, 10],
],
)
// The compute element that will take input from the source
const compute = new ComputeAddV4()
// The generator
const generator = new Generator<number, 4, 2, number, 4, 2, number, 4, 1, number, 4, 1>(source, compute)
// If the generator's generic/template parameters can be inferred from inputs then we can have this
// const generator = new Generator(source, compute)
// input pairs output vectors
// <number,4,2>[] => <number,4,1>[] = <Output<number,4,1>,1,0>
// This generator effectively becomes a new source or can we make it appear to be a new compute element?
// Does it matter as long as it has a type and matches signature of IResolves<Output<number,4,2>,1,0>
// Which is the type integral of the source's output.
expect(generator).toBeDefined()
}) |
The required code in a Generators that need to transform the I am backtracking on horizontal vectors as they add a disruptive fog to singleton vs vector vs spread argument binding as I am going through the process of harmonising all the code and moving it into base classes. I will add them back if I can make everything else work or will sacrifice them if they are too hard to keep. I would like them to contribute to optimization knowledge as one would typically get better CPU/GPU cache performance over a contiguous, non-striding source. I may try to hold that knowledge in the object itself in the In the meantime, plan B is |
Toy pipeline with If I create a
.. and then write it to a
So, I'm going to keep things simple now and try to add a stepping stone towards code generation. This will be like a |
Super busy at the moment. Hopefully back into this soon. |
Also relevant. https://github.com/HigherOrderCO/Bend I was leading towards a similar tree-based flow. I'm going to study this and see what can be gleaned. |
It's not an issue - I'm just here to say I love this.
I want to see this get to the point where it will run large transformers.
For acceleration, you might want to consider something like https://github.com/gpujs/gpu.js, maybe?
Best of luck.
The text was updated successfully, but these errors were encountered: