Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How are you supposed to use Idx and ThreadId in FitLoading? #120

Open
gwiesenekker opened this issue Sep 1, 2023 · 10 comments
Open

How are you supposed to use Idx and ThreadId in FitLoading? #120

gwiesenekker opened this issue Sep 1, 2023 · 10 comments
Assignees
Labels
question Further information is requested

Comments

@gwiesenekker
Copy link

I guess these are used to select the pair but how? Suppose I have a CSV file with 100.000.000 rows (each row consisting of 100 inputs and 1 output). Could you provide some guidance on how to process this using FitLoading?

@joaopauloschuler
Copy link
Owner

Regarding "1 output", is this an integer (such as a class id) or is it a floating point value?

@joaopauloschuler joaopauloschuler self-assigned this Sep 1, 2023
@joaopauloschuler joaopauloschuler added the question Further information is requested label Sep 1, 2023
@gwiesenekker
Copy link
Author

gwiesenekker commented Sep 1, 2023

To be precise the records consist of 192 binary inputs (0 1 1 1 0..) and one floating point output value.

@gwiesenekker
Copy link
Author

gwiesenekker commented Sep 1, 2023

I have hard-coded the number of threads to 1 in NeuralDefaultThreadCount, and printed Idx in GetTrainingPair to get an idea what is going on, and it seems that Idx refers to the batch as it runs from 1..32 all the time. But what are you supposed to return? I have 100.000.000 records and now?

@gwiesenekker
Copy link
Author

I have been digging into the code and in TNeuralDataLoadingFit.RunNNThread it says:

BlockSize := FBatchSize div FThreadNum;
..
for I := 1 to BlockSize do
..
LocalTrainingPair := FGetTrainingPair(I, Index);
..
So that explains why Idx runs from 1..32 all the time. So I guess I can ignore Idx, and I have to maintain the state of where I am in these 100.000.000 records somewhere so that the next time TTestFitLoading.GetTestPair is called I return the next record? Are the calls to TTestFitLoading.GetTestPair thread-safe/serialized?

@gwiesenekker
Copy link
Author

gwiesenekker commented Sep 2, 2023

I have increased the number of threads to 2, and based on the output it seems that the calls to TTestFitLoading.GetTestPair are run in parallel as the 'writeln' output sometimes gets garbled. So I could add a global variable nPair that I increment using InterlockedIncrement to keep track of where I am, but when should I set the variable back to 0 to start all over again? Or should I use % 100.000.000?

@joaopauloschuler
Copy link
Owner

"So that explains why Idx runs from 1..32 all the time. So I guess I can ignore Idx"
This is correct.

"records somewhere so that the next time TTestFitLoading.GetTestPair is called I return the next record?"
From experience, non deterministic methods perform better then their deterministic counterparts. But, if the deterministic method is easier to code a prototype, you can go for it.

Idea: divide your 100M CSV into smaller files (1 per planned thread). Then, you can have one file handler per threadid: handlers: array[0..7] of TextFile. Then, you can read rows with ReadLn(Handlers[ThreadId]). This is an idea only. I'm not saying that this is a good idea nor its a recommendation. One file handler per thread is thread safe.

If you have one handler per thread, you won't need to create your own record pointers nor any thread synch.

Other idea: keep only one 100M rows CSV file with N file handlers for each thread at different points of the same file. Thinking well, this is better than dividing the file.

@gwiesenekker
Copy link
Author

gwiesenekker commented Sep 2, 2023

Thanks. Last question: I guess training requires multiple passes over the 100.000.000 records. When do I know to start again from the beginning? Or should I just check for EOF and then close and reopen the CSV file again? I could also bit-pack all records so they would fit in memory, and GetTestPair would unpack the next record. Also in that case: when do I know to start again from the beginning?

@joaopauloschuler
Copy link
Owner

Feel free to keep asking. Your questions are valid.

In the case that you decide to fit the dataset in memory (I know people buying notebooks with 128GB of RAM just to fit data), you can load the data randomly. The randomness actually helps with the learning. The other option is running your models in Google Colab using High RAM machines. This is an example for running pascal code in colab:

https://colab.research.google.com/github/joaopauloschuler/neural-api/blob/master/examples/SimpleImageClassifier/SimpleImageClassifierCPU.ipynb

@gwiesenekker
Copy link
Author

gwiesenekker commented Sep 2, 2023

I have modified the HypotenuseFitLoading example to support my CSV file as discussed and I have it already working, including loading the CSV completely bit-packed in memory. That also means I have achieved within one day with Neural-API what I could not achieve with TensorFlow over (although I only worked on it in my spare time over a couple of months) say two weeks! I will clean-up the code a bit, perhaps we can add my project to the Examples section.

@joaopauloschuler
Copy link
Owner

@gwiesenekker, thank you very much for your kind comments!!!

If you add your code to github or another public repository, I'll be glad to add a link to it in my main readme.

In the case that you have trained NNs on public datasets, I can also add a link to the trained NN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants