Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question on the bigdata.py #14

Open
drzhouq opened this issue May 2, 2017 · 7 comments
Open

question on the bigdata.py #14

drzhouq opened this issue May 2, 2017 · 7 comments

Comments

@drzhouq
Copy link

drzhouq commented May 2, 2017

Thanks for setting this up. I am wondering the "bigdata.py". It appears to me the code does not use all the data from the "big data" population and only samples 8 points a time. That is no different than the "tensor.py" which just use the same 8 points over and over. Can you elaborate? Thanks.

@jostmey
Copy link
Owner

jostmey commented May 2, 2017

Lines 57 through 68 randomly sample the large dataset using NumPy code. Only 8 datapoints are loaded at each step of gradient descent. However, with each pass over the for-loop, another 8 datapoints are randomly sampled

@drzhouq
Copy link
Author

drzhouq commented May 2, 2017

Thanks a lot, Jared. I might not have made myself clear. The "bigdata.py" is supposed to demonstrate that we can handle "large" volume of data with TensorFlow. Because we only sample 8 points at a time, this means TensorFlow still deals with a very small amount of data. I fail to see how TensorFlow scales up to "large" volume of data in "bigdata.py" script. Both "bigdata.py" and "tensor.py" run "_EPOCH" times, the only difference is that "tensor.py" use the same data for each loop, and "bigdata.py" samples different data from a large population.

@jostmey
Copy link
Owner

jostmey commented May 2, 2017

Tensor.py shows you how to process samples in parallel. If you have a GPU, you can increase _BATCH to something like 100 or 1,000 or even 10,000 and using tensors it will run in parellel. There are two reasons why you might want a bigger _BATCH. (1) You can get away with a larger step size using fewer _EPOCHS. (2) You can handle data with more "variance", which is to say if you are classifying between 100 outcomes, at a minimum you want a batch size on the order of 100 (otherwise, convergence with Stochastic Gradient Descent will be slow).

But you're right, you will need to run through a "_EPOCH" many times

@drzhouq
Copy link
Author

drzhouq commented May 5, 2017

Many thanks for your time and patience, Jared. I think I was stuck at the fact that the "bigdata.py" generates a big dataset, but the regression only sample 8 points for 10,000 times. Therefore, the scripts at most uses 80,000 data points of 8 million and leave 7.92M data unused. Perhaps, as you suggested, the script could use a large _BATCH, like 1000, to simulate getting data feed from a large dataset.

@jostmey
Copy link
Owner

jostmey commented May 5, 2017

It works because the extra 7.92M datapoints are very similar to the first 80,000 datapoints. In circumstances where this is not the case, you might consider running stochastic gradient descent much, much longer to cover all the datapoints (serial) or use a larger batch size (parallel)

@drzhouq
Copy link
Author

drzhouq commented May 8, 2017

That's my point. Except generating extra 7.92M random point, the "bigdata.py" script is identical to the "tensorflow.py" script. Therefore, I am not sure I get the purpose of the "bigdata.py" script. :-)

@jostmey
Copy link
Owner

jostmey commented May 8, 2017

The point is to explain how to use placeholders. If you don't use placeholders, the amount of data that can be handled by TensorFlow is limited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants