Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Support #1

Open
metawrap-dev opened this issue Mar 29, 2024 · 37 comments
Open

GPU Support #1

metawrap-dev opened this issue Mar 29, 2024 · 37 comments
Assignees
Labels
enhancement New feature or request

Comments

@metawrap-dev
Copy link

It's not an issue - I'm just here to say I love this.

I want to see this get to the point where it will run large transformers.

For acceleration, you might want to consider something like https://github.com/gpujs/gpu.js, maybe?

Best of luck.

@eduardoleao052 eduardoleao052 self-assigned this Mar 29, 2024
@medic-code
Copy link
Contributor

This is not an area I know much about would love to know more about the approach, @eduardoleao052 is this something you're thinking of using a dependency for or building this from scratch specifically for matrix operations ?

@eduardoleao052
Copy link
Owner

@medic-code after completing the TypeScript support, I'll study the best option to add GPU support. This means that the heavy matrix multiplications will be able to be run on a GPU, speeding up the ML models.

@metawrap-dev
Copy link
Author

If I may be so bold, I would suggest the following approach. I'm wondering if we can outdo Python if we build from the ground up to focus on GPU integration and performance. My understanding is Torch and Python mostly have mapped corresponding routines (kernels) in the GPU, so in theory, if we take a from the ground up performance approach, I would like to think we can outperform classical Torch.

We add decorations inside the Torch-ish objects that allow a visitor object to assemble an entire compute pipeline, kind of a compute abstract syntax tree of what is to be executed that we can map data into.

The pipeline needs to be able to be processed as a whole or chunked along I/O and compute-friendly boundaries to allow us to optimise when running with different GPU availability.

If the whole pipeline will fit in a big fat GPU then we want it all in there. GPUs are getting bigger and fatter. Let's design for that. If we need to break it up into discrete steps with a small or multiple small GPUs, then we need to be able to do that as well.

 ┌────────────┐    ┌────────────┐     ┌──────────┐     ┌──────────────┐     ┌─────────────┐    
 │  Torchish  │    │ Pipline    │     │ Pipleine │     │ Orchestrator │     │ GPU         │    
 │   Objects  ├──► │ Assembler  ├───► │   AST    ├────►│              ├───► │ Gateway     │    
 └────────────┘    └────────────┘     └──────────┘     └──────────────┘     └─────────────┘    
                                                                                               
                                                                                               
  The executing      Can take        Representation      Takes the            Takes chunks     
    object           the planned     of what needs       pipeline and         of the AST       
                     execution and   to be computed      determines how       and data and     
                     turn it into                        it will be executed  executes in GPU  
                     Pipeline AST                        against the GPU                       
                                                                                               
                                                                                                                                                                                                                                                                                                                                                                          

Make use of ReadableStream WritableStream TransformerStream for compatibility with browser and node and everything async. We need to stream everything to minimise memory usage. We need to be able to stream data and models from storage into the GPU and then data out without maxing out v8's limited memory.

As a kind of proof of concept and possibly even the solution, for version 0 we take the AST for the pipeline and build a function that can be passed into https://github.com/gpujs/gpu.js Let it do the heavy lifting for now. If we need to do something else in the future, we can replace this step with something else and keep the rest of the design and implementation.

@eduardoleao052
Copy link
Owner

@metawrap-dev I like the spirit, and you clearly have a great understanding of hardware manipulation on Pipelines. I think we could make it work, and adding at least the gpu.js is a clear first step in my opinion. If a simple GPU acceleration is put in place, it could already allow a much wider range of applications for the library.

@metawrap-dev
Copy link
Author

Thanks!

To give a toy example: add(mult(w,x), mult(y,z))

We really want to build an object tree that represents the promise to execute this.

That way, instead of passing (w * x) and then (y * z) and then waiting for the result of these and then passing r1 + r2 to the GPU, we could more efficiently pass (w * x) + (y * z) in one go and get one value back.

So we go from

sequenceDiagram
    participant JavaScript
    participant GPU
    JavaScript->>GPU: w * x
    GPU->>JavaScript: r1
    JavaScript->>GPU: y * z
    GPU->>JavaScript: r2
    JavaScript->>GPU: r1 + r2
    GPU->>JavaScript: a

to

sequenceDiagram
    participant JavaScript
    participant GPU
    JavaScript->>GPU: (w * x) + (y * z)
    GPU->>JavaScript: a

This will work efficiently when we pass a torch-ish object into another's method as a parameter

Not much if we pass in just a scalar/primitive unless we make a special class, eg a Number class that would allow a whole tree to be linked. We would, of course, make it as granular as possible (no point in building the internal AST of a matrix operation).

If the library is skewed towards being able to build the largest connected Pipeline tree, then we have the ability to pass the whole intention into the GPU and optimizers and have them work their magic.

@eduardoleao052
Copy link
Owner

Cool! I could see that working building upon the Tensor and Module objects. The Module.forward() method is always spelled-out before the model is actually run. Then we could build this "graph" with all the operations, and send them in parallel to the GPUs!

I'm currently finishing to add TypeScript support to the library, and will push it to main soon (currently it's in the develop branch).

Would you by any chance be willing to work on a PR, so that we can try to implement the GPU.js support? I'd love to take on that challenge!

@metawrap-dev
Copy link
Author

Sure! Once you complete the TypeScript support. I do a lot of prototyping in TypeScript, so it's my happy place.

In the meantime, I will look into implementing the toy example above and try out some design ideas.

@metawrap-dev
Copy link
Author

metawrap-dev commented Apr 1, 2024

These two features in gpu.js are likely going to be load-bearing for us.

https://github.com/gpujs/gpu.js/blob/develop/README.md#combining-kernels
https://github.com/gpujs/gpu.js/blob/develop/README.md#create-kernel-map

Some notes here for things I want to investigate.

  1. How to combine async/Promise and direct reference. (P,R) or (R{P})
  2. Data vs Function - how to model so we maximise "streamability"
  3. How do we stream "big" data in the pipeline? Chunk data along with execution nodes? Transform pipeline like map/reduce repeating F with D1...DN then combining (where)?
  4. How to chunk the pipeline graph? Assuming not chunking into big echoey GPU is the easy case. Pinch the pipe at the "thinnest" place. Define "thinnest" with the dimensions we have.
  5. Future - Is it possible to extend gpu.js to support "Persistent Kernels"
  6. Object reference equivalence optimisation?
  7. Object hash equivalence optimisation? (overkill)
  8. It feels like being monad-like is going to play a big part of this working, so make everything "complete".

@metawrap-dev
Copy link
Author

I seem to have proof of concept for item 1 on my list.

I can build a toy pipeline graph, display it unresolved and resolve it. I will put this up in a repository tomorrow.

Tomorrow, let's see if I can pass some of it into gpu.js.

image

@eduardoleao052
Copy link
Owner

That's awesome, looks like a great start!
Btw I'll merge the TypeScript branch into main today, I'll let you know.

@eduardoleao052 eduardoleao052 added the enhancement New feature or request label Apr 1, 2024
@eduardoleao052
Copy link
Owner

eduardoleao052 commented Apr 1, 2024

@metawrap-dev Merged the TS into the main branch, but there are still some mistakes, I'll solve them today.

@metawrap-dev
Copy link
Author

metawrap-dev commented Apr 2, 2024

Thanks. I will take a look. If you have any typescript questions I'm happy to help there as well.

I'm going to continue in a different repository until I have the major questioned I outlined above answered.

Today I'm going to tackle the streaming and batching big data question. Big as in more than can fit into memory. The goal (i) will be to have a toy versions of a model load from disk with chained execution operations and then output to disk.

I want to be able to get to the point where I execution is triggered only when we want to save the result. So, complete lazy loading and execution with the ability to inspect the pipeline graph.

Next after that the goal (ii) will be to have a combination of granularities so we have some operations that will be handled by a prepositioned kernel and then some that are weaved together with permutations in code. Those will be handled by less granular code generation.

At that point it will be possible to create a visitor class that can analyse and chop/chunk and transpile the pipeline graph.

I can probably only get some of (i) done today given the time I have as just coding up the testing and plumbing is going to be a majority of the work.

@eduardoleao052
Copy link
Owner

Cool! If you want any help at some point, just let me know. About the TypeScript, the only issue left to solve are a couple of type: anys and unused variables. Just a matter of keeping the code cleaner.

@metawrap-dev
Copy link
Author

If you can avoid any, it would be a Good Thing™️.

If it is the case where you don't know the type yet in the flow of execution and it will be resolved later on, then using unknown rather than any is preferred.

If it is just a case of trying to get things to match and any is used to get past that, then I can definitely help with that.

Here is my work so far.

https://github.com/metawrap-dev/streaming-compute-poc

@eduardoleao052
Copy link
Owner

I'm currently removing all the anys, don't worry!
Most can be corrected simply, if there's some syntax on a specific part that I just don't know I can let you know, thanks for offering the help! I took an overlook of the code so far, really interesting stuff. I'll actually read it tomorrow, but from what I understand you're implementing a Number class.
Will this in the future compose a Tensor class for tensor operations?

@metawrap-dev
Copy link
Author

The DataNumber class is just a start and just a minimalist toy model. Yes, higher order objects like tensors will be modelled the same way. I'm just starting very small as there is a lot of type semantics to work out to make everything composable before I move onto something more complex.

@metawrap-dev
Copy link
Author

I've updated the repository. I've also implemented some more complex composition involving sources. However, I'm still only using primitive number types and the multiply operation for now. My goal is to get the streaming working before I start doing anything more complex or more granular.

image

@metawrap-dev
Copy link
Author

metawrap-dev commented Apr 2, 2024

In memory, Source and Destination were added. I haven't yet tackled I/O and streaming, but the functionality I have so far models a lot of the problem and solution in memory.

E.g. in one test case.

{DestinationMemory(0 stored, 1 in buffer, 5 batch size) <= [( multiply {SourceMemory(4 elements, 0 index, 1 batch size) <= [10,10,10,10]} => unresolved )]=>[]}'

Resolves to

{DestinationMemory(1 stored, 0 in buffer, 5 batch size) <= []=>[10000]}

This is, in effect, loading computing and saving the result.

We get to see the pipeline graph before it is resolved.

I'm not sure if I should strictly type these or use the "sausage machine principle" and assemble the correct type from arbitrary length and format input arrays.

resolve() is going to be invoking any transpilation. At the moment I am just executing everything in the CPU.

Optimal batch sizes can be discovered on the fly. At the moment, they are there to force the pipeline to execute when data limits are reached to emulate a shadow of some kind of real-world behaviour. I'm thinking of some back-pressure signalling up from the GPU from each kernel and the GPU in general.

When complete and if it can be demonstrated to be usefully performant, the goal will be to take the learnings from my repository and transfer them, in style, to yours.

If it all plays out well, we should have something that could, in theory, outperform Python, at least within striking distance.

The next step is creating some actual disk I/O-driven Source and Destination implementations.

After that, some matrix operations. One could be implanted as a granular kernel, and the other could be built up with primitive operations. That should allow some serious testing with code generation.

I'm super busy until the end of the week, but will see what I can do in between things

@metawrap-dev
Copy link
Author

metawrap-dev commented Apr 3, 2024

Gathering some more thoughts for (i)

I'm close to being able to serialise and deserialise the execution state. Source and Destination batch themselves now. What is missing is generalising this to the Compute elements. I'm going to tease out the internal run-state of each element into something common that can be marshalled.

What will that get us?

  1. Batching and chunking of execution where possible.
  2. Execution back-pressure signalling on the running pipeline can self-optimise.
  3. Able to freeze and store long-running execution to storage and later on thaw and continue execution from where it left off.

I'm also considering adding an unrolling strategy as the same kind of parameter as batch size. With this, the system that is trying to optimise the pipeline graph will also be able to experiment with unrolling.

And that implies that metrics are going to be needed.

The optimising system observing the execution needs to be able to get metrics at each point in the pipeline and maintain some previous data points to make a comparison. Enough to give it the ability to converge, but not too much that it overwhelms the task at hand.

  1. Common Strategy object? Something that can be introspected and modified even just using dumb random search. The element can take a strategy and apply it to State.
  2. Common 'State' object? Something that reflects the concrete application of Strategy or separate 'Config' for this? State = Runtime + Config?

@metawrap-dev
Copy link
Author

metawrap-dev commented Apr 3, 2024

Implemented Strategy, Config, and State across all the elements as described above. Strategy needs to be put into practice, so what is there is a placeholder for now.

image

So close to 100% :)

Tentative flow control so that operations asynchronously will block until data becomes available.

It is time to add more complex Data and Compute elements to refine the design as single number flow seems done. I've exhausted what I can do with these now.

The goal is to make sure that execution only occurs after resolve, but there are situations were populating Source or Destination will start to trigger flows.

The design looks like it is holding together and will do what I want. More complex objects should now shake any bugs or design issues with the way I want data to flow and obey batch/chunk sizes. It should allow me to test the roll/unroll strategy ideas, as I need a Compute element with at least a few layers of execution to support that.

@eduardoleao052
Copy link
Owner

These are really solid groundworks. Been looking through the repo, and it's impressive stuff. I think adding more complex objects is the next big step. If you need any help, let me know!

@metawrap-dev
Copy link
Author

metawrap-dev commented Apr 4, 2024

Thanks,.

Yes, now that I have the flow and semantics thought out and lazy execution working, more complex objects are now needed.

Just thinking some things out loud again here.

Code-Gen/Optimizer:

  1. Just take the pipeline graph (PG), break it into sub-graphs of the right size and transform each sub-graph into a GPU kernel.
  2. "Where" it is broken up will be affected by Strategy. At a point that maximises throughput. Treat the GPU as I/O.
  3. "How" we size.. Good question.
  4. If the same PG "shape" is being built, executed and then torn down over and over, then how do we associate it with past metrics?
  5. Need a hash that bubbles up to the root that reflects the underlying execution pattern. Each identical sub-graph will have the same hash. Is it worth it? Will see? Likely yes.
  6. Add metrics now. It's too important to leave until later.
  7. How do we cleanly associate optimiser with compute element without using globals?
  8. Some 'Compute' elements will have 1:1 correspondence with inbuilt (hardware ops) and pre-positioned (software) kernels. Noting inside. Just an Element.
  9. Some Compute elements will have control over unrolling based on Strategy. Built PG will depend on Strategy
  10. Can we model flow-based/Persistent computation in the design now? Build a pipeline that you feed data into a source and it feeds into a Destination? Reset vs Revisit resolve()/Resolved semantics (self-clocking on batch). We may need to mock behaviour for now, but it is important that this is designed in at ground level now. Two modes. Complete reset + Cumulative.
  11. Naive vs Easy number packing? Is it better to store the matrix in 1d or 2d array if the whole flow from coder to GPU is taken into account? Not obvious. Measure.
  12. How much will the optimizer handle and how much will we need to pre-empt? eg. Data types: UB,U... UV,V data types? Do we need to be aware or can we assume GL optimizer will handle? Tackle this after v1.

Mock Compute and Data elements.

  1. DataVectorN
  2. DataMatrixMxN
  3. ComputeDotN (Dot Product)
  4. ComputeDot4GPU (Dot Product GPU)
  5. ComputeAverage (Average Pooling)
  6. ComputeAverage4x4GPU (Average Pooling) (Hardware Op)
  7. ComputeMultiplyMatrix
  8. ComputeMultiplyMatrix4x4GPU

This should give enough. The plan for today is just to get the new mock data elements in. Probably all I have time for.

I suspect the design won't survive the first data element without changes, but here goes. : )

@metawrap-dev
Copy link
Author

DotProduct4

A lot to unpack here.

image

Great success.. but..

From a flow aspect, there are several ways to computer dotN. Not all of them are obviously nonsense. Also this is going to be the same for all operations.

  1. Supply two vectors. Output a single number. OK.
  2. Supply an array of two vectors, batch compute and return a vector (Ok.. makes sense.. we want that)
  3. Stream in two huge vectors of known same length. Stream compute. Output a single number. (Nice)
  4. Same as 3 but we arbitrarily decide when to cut the sausage and output a number. (Can't really think of a use case)
  5. Hold my beer. Stream in a stream of huge vectors. Stream compute. Stream out vector of numbers. (Very nice)
  6. Same as 5 but we arbitrarily decide when to cut the sausage and output a number. (Can't really think of a use case, or how unless we single end of vector somehow? ).

This adds a lot of fragile complexity in having generalized arguments.

I'm wondering if I can create IStreamify so to formally transform non-streaming to a streaming version. Otherwise, It's obvious I am going to end up in array hell. (Is it a matrix or an array of vectors to streaming output of vector etc. )

Time to dive deep into some of the latest typescript features.

@metawrap-dev
Copy link
Author

The plan =>

  • ISourceable - Takes a Compute element and adapts it for streaming inputs in from a source. May require TS type magic or specialised classes for each element.

  • IDestinationable - Takes a Compute element and adapts it for streaming outputs out to a destination.

I can probably create some generalised versions of these and, worst case, we end up with a pair per Compute element but it will move the complexity of different steaming different styles into these distinct adapter objects. (A Good Thing ™️ )

Same approach for IPersistent for creating persistent kernels. Same as above but for the whole pipeline graph.

@metawrap-dev
Copy link
Author

metawrap-dev commented Apr 4, 2024

I tried the above. It works. I ended up with the more sensibly named IStreamer as we don't seem to need to for Destination and what it effect does is stream the data in.

The new method simplifies the code because Compute elements now don't need to be aware of how to batch and don't need to deal with argument overloading with the ISource

 const streamer = new Streamer<v4, number>(
    new SourceMemory<v4>([1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]), 
    new ComputeLength4<v4>())

Creates a pipeline graph of

    {Streamer{SourceMemory(4 elements, atoms 4, 0 index, 1 batch size) 
         <= [[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1]]} 
            => {ComputeLength4{SourceMemory(0 elements, atoms 0, 0 index, 1 batch size)
              <= []}=>unresolved}}

resolves to

    {Streamer{SourceMemory(4 elements, atoms 0, 4 index, 1 batch size) 
        <= [[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1]]} 
          => {ComputeLength4{SourceMemory(4 elements, atoms 0, 4 index, 1 batch size) 
           <= [[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1]]}=>{DataNumber 
             <= 2}}}

The alternative original is

 const compute=  new ComputeLength4<v4>(new SourceMemory<v4>([1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]))

Where we pass in the source just like any other parameter.

And would get something like multiply.

     (multiply{SourceMemory(3 elements, atoms 3, 0 index, 1 batch size) 
       <= [{DataNumber <= 10},{DataNumber <= 10},(multiply{SourceMemory(2 elements, atoms 2, 0 index, 1 batch size) 
           <= [{DataNumber <= 10},{DataNumber <= 10}]}=>unresolved)]} 
               => unresolved)

...and resolves to.

      (multiply{SourceMemory(3 elements, atoms 0, 3 index, 1 batch size)
        <= [{DataNumber <= 10},{DataNumber <= 10},(multiply{SourceMemory(2 elements, atoms 0, 2 index, 1 batch size) 
          <= [{DataNumber <= 10},{DataNumber <= 10}]}=>{DataNumber <= 100})]} 
            => {DataNumber <= 10000})

The advantage in either could be when it comes to code generation.

So I will sit on it for a while as I try more complex Data and Compute elements combinations and then make a decision when I move to code generation.

@metawrap-dev
Copy link
Author

TypeScript's type system is doing some heavy lifting here.

type v4 = Vector<number, 4> | DataVectorN | (number | IData<number>)[]

const source = new SourceMemory([1, 1, 1, 1], new DataVectorN([1, 1, 1, 1]), [1, new DataNumber(1), 1, 1], [1, 1, 1, 1])

const compute = new ComputeLength4()

const streamer = new Streamer<v4, number>(source, compute)

@metawrap-dev
Copy link
Author

Some more thoughts.

Streaming in such a way that engages elegantly with TypeScript's type system but remains performant; There are few ways to go.

The main issue is argument cardinality and the marshalling stage into an Compute element with different argument sizes.

Either the Streamer needs to know how many arguments it is dealing with, or the Compute element needs to know.

Source(2)=>Streamer(2)=>Compute(2)

ICompute<I,N,O>

This needs to be expressed in the type, the type system will flag a mismatch so N needs to propagate properly. It will be used to drive Vector but we also want it as a cardinal for use at runtime.

The code is bit of a mess right now because I have a few ideas in play simultaneously and not all of them will be the winner.

Streamer vs no Streamer: Compute input becomes Vector vs Source.

Sausage Machine vs Structured.

Sausage machine (or some variant) may actually be the best way forward and the most flexible.

This is all about being able to chunk up code and data into big blobs that we can blat into the GPU without a lot of back and forth. Ultimately all the design decisions are driven by that goal.

@medic-code
Copy link
Contributor

@metawrap-dev its really fascinating to see your thought process, I've just been wondering if you had any ideas if there's a better way to document this other than an open issue ?

For example maybe an open PR with a readme.md, or a wiki page or something ? It's okay if you feel this way of documenting your thoughts is fine. I just imagine that after a while you might want to collect your thoughts in a summary and rationale so far. I realise that this is still a part of discovery as you go and you may want to postpone documenting in a more formalised way. Just some thoughts.

@metawrap-dev
Copy link
Author

Yep, eventually. At the moment, I'm still experimenting. I assumed here was good because others could chip in. It just kind of happened organically as I write down most of these thoughts while mobile and then refer back to them as I work, so very convenient for me.

I will gather my thoughts elsewhere.

@medic-code
Copy link
Contributor

I'm definitely not suggesting you should gather your thoughts else where outside of github, as i said its very useful to see your thought process/allow others to contribute. I would imagine that you may find trying to scroll back through more than 29 posts a little more challenging as you keep documenting, that is all.

@metawrap-dev
Copy link
Author

metawrap-dev commented Apr 9, 2024

29 (now 30) posts are fine. Scrolling is fine. That's not a long issue; this is a long issue. 😁 But we finally got there.

Hopefully, I will be finished before this gets that long.

I have the types mostly tamed. Quite happy with it.

For the initial toy system, all the tests and examples are just placeholders for fundamental but larger patterns. I have a generalized way of defining a Compute element. The pipeline can be constructed out of resolved (static) or unresolved (Compute, Source or Data with unresolved or static elements. )

Compute elements can only take arguments of the same type, but I am considering generalizing that when I move out of the toy stage. If I can keep it this way, then it will have some advantages.

I have temporarily removed Streamer as I think I need something more general.

I have examples of different styles of Compute elements. I have been playing with the types system and elements to see how they mix (vector columns vs vector rows). My instinct tells me that telling the difference at this level will make optimization better, instead of throwing away information, but it makes the types stricter and safer, which is good for now.

I'm now deciding if I want to create Compute elements that are Source aware and can "accumulate" an answer or if I want to make another class like Streamer that handles that state and the accumulator. That would separate some of that complexity out of Compute elements as I could render them stateless. They are half in, half out at the moment.

It would be nice to be able to construct this behaviour out of the types system as I have for most of the existing Data and Compute functionality.

Take MultiplyN. Dimension 1, Cardinality 0 ("Horizontal/Row" Unlimited Array) => Dimension 1, Cardinality 1 (Scalar)

(I'm using 0 = Unlimited, 1 = Scalar. Might change to Infinity, 0 later but not tested how nicely Infinity plays with the type system yet)

class ComputeMultiplyVN extends ElementCompute implements ICompute<number, 1, 0, number, 1, 1>

If I multiply a long Source<number,1,0> of numbers (effectively a long row vector), I get a single number as the output. The goal is that this process can be suspended and resumed ideally not within ComputeMultiplyVN as I want to move that state out. Hence the new Streamer/Generatator/?

If I feed Streamer(Source,ComputeMultiplyVN) a Source of vectors eg , Source<number,4,0> then It should output an array of numbers, each number representing the terms of each vector multiplied.

Also. taking a VectorAdd Compute element that can add two vectors. Can a Streamer take that and construct something that can not only add N vectors, but is able to finesse the code generation to take advantage of a hardware operation that or a sweet spot in performance where adding four vectors at once is better? That last bit is the issue because something needs to supply that optimization domain knowledge. Can I generalize that? I think I can.

I'm interested in the general solution of constructing via generics and common base classes the accumulator and separating more state out of compute and including the optimization for the code-gen. I need some kind of lambda operator/compositor that can dip into logic and a factory class to provide optimizations.

If you get my drift :)

The other is related... the internal code generation where a compute element can create an internal representation of what it will be doing out of other Compute elements.

eg. Multiply, based on input from the strategy (unroll/batch etc) either declares a single-grained intention to multiply,or generates a course-grained sequence of mul8/mul4/mul3/mul2 Compute elements when it knows the length or even creates really fine-grained code where every multiply is represented as an internal Compute element. Having the Strategy and Tactics for the Compute element control this, and integrating this with Metrics that allow the code gen to be moderated. When we resolve(), it might result in repeated code-gen at different parts of the pipeline until some optimum is reached.

That is the goal anyway.

So this will be a fun bit.

Like I said before, all these examples are just mini toy versions of what the system will need to do on a larger scale.

@metawrap-dev
Copy link
Author

metawrap-dev commented Apr 10, 2024

I have a naive type signature implementation that will compose. In this form the generator's generic/template parameters can't be completely inferred from the input types as I am trying to be flexible on the output stage. I will probably need to remove this flexibility and always return a standard form that can be derived from the source type.

 it('Generator matching 1:1 compute signature', async () => {
    // A source of pairs matching signature of compute element.
    const source = new SourceMemory<number, 4, 2>(
      [
        [10, 10, 10, 10],
        [10, 10, 10, 10],
      ],
      [
        [10, 10, 10, 10],
        [10, 10, 10, 10],
      ],
    )

    // The compute element that will take input from the source
    const compute = new ComputeAddV4()

    // The generator
    const generator = new Generator<number, 4, 2, number, 4, 2, number, 4, 1, number, 4, 1>(source, compute)

    // If the generator's generic/template parameters can be inferred from inputs then we can have this
    // const generator = new Generator(source, compute)

    // input pairs       output vectors
    // <number,4,2>[] => <number,4,1>[] = <Output<number,4,1>,1,0>

    // This generator effectively becomes a new source or can we make it appear to be a new compute element?
    // Does it matter as long as it has a type and matches signature of IResolves<Output<number,4,2>,1,0>
    // Which is the type integral of the source's output.

    expect(generator).toBeDefined()
  })

@metawrap-dev
Copy link
Author

Generator looks like it will work. In the current experiment, I am looking at still allowing Source to work for individual arguments and Generator for collated sets of arguments. I am tightening up the types with a view to simplifying the classes.

The required code in a Compute element has been slashed. A bit more type and base class algebra should get it down to something much more manageable than it was when I started.

Generators that need to transform the Source to suit the Compute element as well as Compute elements with heterogeneous arguments will be the next challenge. Varadic Tuple arguments and partial argument consumption may work for the latter. I have an idea. Still thinking about how to finesse this with the generator such that the type system performs the maximum amount of lift. What will come out of this, at least, is a proposal with a concrete example for TypeScript to support materialising value instance type generics arguments accessible as constants.

I am backtracking on horizontal vectors as they add a disruptive fog to singleton vs vector vs spread argument binding as I am going through the process of harmonising all the code and moving it into base classes. I will add them back if I can make everything else work or will sacrifice them if they are too hard to keep. I would like them to contribute to optimization knowledge as one would typically get better CPU/GPU cache performance over a contiguous, non-striding source. I may try to hold that knowledge in the object itself in the Strategy if I can't keep it at a type/class level.

In the meantime, plan B is Compute2 Compute3 Compute4, etc, where N is the number of types supported by the compute element.

@metawrap-dev
Copy link
Author

Toy pipeline with Generator seems to work, Just with matching inputs for now. A Generator element will take a source, pass it through a single Compute element and generate a new Source that contains all the output of that process.

If I create a Generator

{Generator(0 elements, 0 atoms, 0 index, 1 batch size){
  SourceMemory(3 elements, atoms 3, 0 index, 1 batch size) 
    <= [[[1,1,1,1],[2,2,2,2]],[[3,3,3,3],[5,5,5,5]],[[7,7,7,7],[11,11,11,11]]]}
        =>ComputeAddV4=>[]}

.. and then write it to a DataStore it sucks the data out from the Generator (as a Source), triggering resolve()` down the chain and we get..

{DestinationMemory(3 stored, 0 in buffer, 1 batch size) 
   <= []
      =>[[3,3,3,3],[8,8,8,8],[18,18,18,18]]}

So, I'm going to keep things simple now and try to add a stepping stone towards code generation.

This will be like a Generator but will take a sequence of Compute elements in a complex array. In effect a multidimensional pipeline. It will in itself be a compute element. Or I may be limited to just composing two at a time. I will see how far I can push the types system,.

@metawrap-dev
Copy link
Author

Super busy at the moment. Hopefully back into this soon.

@metawrap-dev
Copy link
Author

@metawrap-dev
Copy link
Author

Also relevant.

https://github.com/HigherOrderCO/Bend

I was leading towards a similar tree-based flow. I'm going to study this and see what can be gleaned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants