How are you supposed to use Idx and ThreadId in FitLoading? #120

gwiesenekker · 2023-09-01T01:31:40Z

I guess these are used to select the pair but how? Suppose I have a CSV file with 100.000.000 rows (each row consisting of 100 inputs and 1 output). Could you provide some guidance on how to process this using FitLoading?

joaopauloschuler · 2023-09-01T02:17:36Z

Regarding "1 output", is this an integer (such as a class id) or is it a floating point value?

gwiesenekker · 2023-09-01T06:01:06Z

To be precise the records consist of 192 binary inputs (0 1 1 1 0..) and one floating point output value.

gwiesenekker · 2023-09-01T19:08:33Z

I have hard-coded the number of threads to 1 in NeuralDefaultThreadCount, and printed Idx in GetTrainingPair to get an idea what is going on, and it seems that Idx refers to the batch as it runs from 1..32 all the time. But what are you supposed to return? I have 100.000.000 records and now?

gwiesenekker · 2023-09-02T00:00:33Z

I have been digging into the code and in TNeuralDataLoadingFit.RunNNThread it says:

BlockSize := FBatchSize div FThreadNum;
..
for I := 1 to BlockSize do
..
LocalTrainingPair := FGetTrainingPair(I, Index);
..
So that explains why Idx runs from 1..32 all the time. So I guess I can ignore Idx, and I have to maintain the state of where I am in these 100.000.000 records somewhere so that the next time TTestFitLoading.GetTestPair is called I return the next record? Are the calls to TTestFitLoading.GetTestPair thread-safe/serialized?

gwiesenekker · 2023-09-02T00:17:23Z

I have increased the number of threads to 2, and based on the output it seems that the calls to TTestFitLoading.GetTestPair are run in parallel as the 'writeln' output sometimes gets garbled. So I could add a global variable nPair that I increment using InterlockedIncrement to keep track of where I am, but when should I set the variable back to 0 to start all over again? Or should I use % 100.000.000?

joaopauloschuler · 2023-09-02T03:00:42Z

"So that explains why Idx runs from 1..32 all the time. So I guess I can ignore Idx"
This is correct.

"records somewhere so that the next time TTestFitLoading.GetTestPair is called I return the next record?"
From experience, non deterministic methods perform better then their deterministic counterparts. But, if the deterministic method is easier to code a prototype, you can go for it.

Idea: divide your 100M CSV into smaller files (1 per planned thread). Then, you can have one file handler per threadid: handlers: array[0..7] of TextFile. Then, you can read rows with ReadLn(Handlers[ThreadId]). This is an idea only. I'm not saying that this is a good idea nor its a recommendation. One file handler per thread is thread safe.

If you have one handler per thread, you won't need to create your own record pointers nor any thread synch.

Other idea: keep only one 100M rows CSV file with N file handlers for each thread at different points of the same file. Thinking well, this is better than dividing the file.

gwiesenekker · 2023-09-02T03:42:36Z

Thanks. Last question: I guess training requires multiple passes over the 100.000.000 records. When do I know to start again from the beginning? Or should I just check for EOF and then close and reopen the CSV file again? I could also bit-pack all records so they would fit in memory, and GetTestPair would unpack the next record. Also in that case: when do I know to start again from the beginning?

joaopauloschuler · 2023-09-02T04:19:52Z

Feel free to keep asking. Your questions are valid.

In the case that you decide to fit the dataset in memory (I know people buying notebooks with 128GB of RAM just to fit data), you can load the data randomly. The randomness actually helps with the learning. The other option is running your models in Google Colab using High RAM machines. This is an example for running pascal code in colab:

https://colab.research.google.com/github/joaopauloschuler/neural-api/blob/master/examples/SimpleImageClassifier/SimpleImageClassifierCPU.ipynb

gwiesenekker · 2023-09-02T17:51:09Z

I have modified the HypotenuseFitLoading example to support my CSV file as discussed and I have it already working, including loading the CSV completely bit-packed in memory. That also means I have achieved within one day with Neural-API what I could not achieve with TensorFlow over (although I only worked on it in my spare time over a couple of months) say two weeks! I will clean-up the code a bit, perhaps we can add my project to the Examples section.

joaopauloschuler · 2023-09-03T00:41:32Z

@gwiesenekker, thank you very much for your kind comments!!!

If you add your code to github or another public repository, I'll be glad to add a link to it in my main readme.

In the case that you have trained NNs on public datasets, I can also add a link to the trained NN.

joaopauloschuler self-assigned this Sep 1, 2023

joaopauloschuler added the question Further information is requested label Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are you supposed to use Idx and ThreadId in FitLoading? #120

How are you supposed to use Idx and ThreadId in FitLoading? #120

gwiesenekker commented Sep 1, 2023

joaopauloschuler commented Sep 1, 2023

gwiesenekker commented Sep 1, 2023 •

edited

gwiesenekker commented Sep 1, 2023 •

edited

gwiesenekker commented Sep 2, 2023

gwiesenekker commented Sep 2, 2023 •

edited

joaopauloschuler commented Sep 2, 2023

gwiesenekker commented Sep 2, 2023 •

edited

joaopauloschuler commented Sep 2, 2023

gwiesenekker commented Sep 2, 2023 •

edited

joaopauloschuler commented Sep 3, 2023

How are you supposed to use Idx and ThreadId in FitLoading? #120

How are you supposed to use Idx and ThreadId in FitLoading? #120

Comments

gwiesenekker commented Sep 1, 2023

joaopauloschuler commented Sep 1, 2023

gwiesenekker commented Sep 1, 2023 • edited

gwiesenekker commented Sep 1, 2023 • edited

gwiesenekker commented Sep 2, 2023

gwiesenekker commented Sep 2, 2023 • edited

joaopauloschuler commented Sep 2, 2023

gwiesenekker commented Sep 2, 2023 • edited

joaopauloschuler commented Sep 2, 2023

gwiesenekker commented Sep 2, 2023 • edited

joaopauloschuler commented Sep 3, 2023

gwiesenekker commented Sep 1, 2023 •

edited

gwiesenekker commented Sep 1, 2023 •

edited

gwiesenekker commented Sep 2, 2023 •

edited

gwiesenekker commented Sep 2, 2023 •

edited

gwiesenekker commented Sep 2, 2023 •

edited