Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting datasets from datavault slows down computer #417

Open
ckometter opened this issue Aug 6, 2017 · 3 comments
Open

Getting datasets from datavault slows down computer #417

ckometter opened this issue Aug 6, 2017 · 3 comments

Comments

@ckometter
Copy link

ckometter commented Aug 6, 2017

Hi,
I'm trying to get large datasets (1GB+) from datavault 2.3.4 with csv files using get() from a remote computer. It takes a long time and the computer running datavault slows down to the point of almost freezing which is interfering with the measurement it is running.
I was wondering:

  • if updating to the new version of datavault will solve that issue.
  • Is it possible to open partially completed datasets saved as hdf5 files
  • is it possible to choose between hdf5 and csv in the new version of datavault

edit:
I have a synchronous script that connects and writes data to datavault line-by-line (version 3.0.1 this time). When I synchronously get() data from a remote computer, the script pauses and doesn't write data anymore until get() returns a dataset. Is there any easy way to get data from datavault without my script pausing?

thanks

@maffoo
Copy link
Contributor

maffoo commented Aug 7, 2017

For big datasets, you should not try to get the whole thing all at once, but rather load the data in chunks. When you call get, pass the number of rows and then loop until no more data comes back, something like:

rows = []
while True:
    r = cxn.data_vault.get(1000)
    if not r:
        break
    rows.extend(r)

Fetching the data in small chunks like this will ensure that the server can do other things in the meantime, like accept writes from your other measurement script.

Note that with the csv backend, the server still loads the entire dataset from disk into memory, so if you have a very large dataset there may still be a small server pause while the data is loaded, even if you then fetch the data over the network in smaller chunks. As for your other questions, the new version of the data vault does not support choosing between hdf5 and csv format for data storage; it will continue to read csv data sets, but all new data sets will be stored with hdf5. Of course, we could add the ability to select a file format if that is something you need. Also, it is certainly possible to open partially completed datasets that are stored as hdf5; other clients can, for example, open the dataset and then get notifications when new data is added to the dataset.

@ckometter
Copy link
Author

Thank you!

I have switched to an asynchronous connection to data_vault for grabbing data. It seems to be working with smaller datasets so far. I was wondering if this is the right way to do it.

from labrad.wrappers import connectAsync
from twisted.internet import reactor
from twisted.internet.defer import inlineCallbacks, returnValue  
import scipy.io as sio
import sys

@inlineCallbacks
def get_file(host, fdir, fname):
    try:
        cxn = yield connectAsync(host=host)
        print("connected")
        dv = cxn.data_vault
        print(yield dv.cd(fdir))
        print(yield dv.open(fname))
        print("opened file")
        M = yield dv.get()
        print M
    except:
        reactor.stop()
    
    reactor.stop()
    
    returnValue(M)
 
@inlineCallbacks   
def save_file():
    M = yield get_file('rashba', sys.argv[1], int(sys.argv[2]))
    path = "C:/Users/carli/Dropbox/NHMFL_AUG2017/matlab/"
    filename = path + sys.argv[1] + "/" + sys.argv[2] + ".mat"
    sio.savemat(filename, {'d'+str(sys.argv[1]):M})

if __name__ == '__main__':
    print(sys.argv)
    save_file()
    reactor.run()

A couple of more things:

  • When I call get(startOver = True) I get "unexpected keyword argument error". Is there a proper way to indicate get() to retrieve the full dataset from the start?
  • Sorry, when I said opening an hdf5 file while the dataset is active, I meant opening it from an external program such as matlab or hdf5view. Sometimes when I open from hdf5view, I get a java error. But I suppose I've answered my question.

@ckometter ckometter reopened this Aug 9, 2017
@maffoo
Copy link
Contributor

maffoo commented Aug 9, 2017

labrad doesn't support passing keyword args when you call remote settings, at least not yet, so you have to pass conditional args instead. You could, for example, do dv.get(1000, True), where the first argument is the limit on the max number of rows to send, and the second argument indicates to start over at the beginning of the dataset. As I said before, it's important to use a row limit and call get in a loop if you want the datavault to stay responsive while getting large datasets.

As for reading an hdf5 file while writing to it from a separate program, I have no idea whether that will work; it's certainly not something I would recommend if you can avoid it, because we haven't tested that sort of scenario to ensure that the data can't get corrupted or something like that. What was the exact java error you were seeing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants