Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickle error when dealing with large corpora #20

Closed
owo opened this issue Dec 11, 2014 · 3 comments
Closed

Pickle error when dealing with large corpora #20

owo opened this issue Dec 11, 2014 · 3 comments

Comments

@owo
Copy link

owo commented Dec 11, 2014

While this is not an issue with glove-python, it's worth noting that pickling large corpora/models causes the following error:

SystemError: error return without exception set 

According to this numpy issue, it is a bug in pickle that has been fixed in Python 3.3.
I think it would be worth pointing that out on the README for future reference.

@maciejkula
Copy link
Owner

Thanks. Pickling in general is a bad idea, so this will go away as I move
from pickle.
On 11 Dec 2014 17:07, "Ossama W. Obeid" notifications@github.com wrote:

While this is not an issue with glove-python, it's worth noting that
pickling large corpora/models causes the following error:

SystemError: error return without exception set

According to this numpy issue numpy/numpy#2396,
it is a bug in pickle that has been fixed in Python 3.3.
I think it would be worth pointing that out on the README for future
reference.


Reply to this email directly or view it on GitHub
#20.

@geovedi
Copy link

geovedi commented Feb 9, 2015

This is my quick workaround using h5py as I try hard not to touch the Cython part. I'm sure there's a better way to solve this problem.

diff --git a/glove/corpus.py b/glove/corpus.py
index 4545ba4..9fc3c45 100644
--- a/glove/corpus.py
+++ b/glove/corpus.py
@@ -1,11 +1,13 @@
 # Cooccurrence matrix construction tools
 # for fitting the GloVe model.
 import numpy as np
+import scipy.sparse as sp
 try:
     # Python 2 compat
     import cPickle as pickle
 except ImportError:
     import pickle
+import h5py

 from .corpus_cython import construct_cooccurrence_matrix

@@ -70,10 +72,12 @@ class Corpus(object):
                                                     max_map_size)

     def save(self, filename):
-        
-        with open(filename, 'wb') as savefile:
-            pickle.dump((self.dictionary, self.matrix),
-                        savefile,
+
+        f = h5py.File('{0}.h5'.format(filename), 'w')
+        _ = f.create_dataset('matrix', data=self.matrix.todense())
+
+        with open('{0}.dict'.format(filename), 'wb') as savefile:
+            pickle.dump(self.dictionary, savefile,
                         protocol=pickle.HIGHEST_PROTOCOL)

     @classmethod
@@ -81,7 +85,10 @@ class Corpus(object):

         instance = cls()

-        with open(filename, 'rb') as savefile:
-            instance.dictionary, instance.matrix = pickle.load(savefile)
+        f = h5py.File('{0}.h5'.format(filename), 'r')
+        instance.matrix = sp.sparse.coo_matrix(f['matrix'])
+
+        with open('{0}.dict'.format(filename), 'rb') as savefile:
+            instance.dictionary = pickle.load(savefile)

         return instance
diff --git a/glove/glove.py b/glove/glove.py
index c93f69e..f104e8f 100644
--- a/glove/glove.py
+++ b/glove/glove.py
@@ -11,9 +11,31 @@ except ImportError:

 import numpy as np
 import scipy.sparse as sp
+import h5py

 from .glove_cython import fit_vectors, transform_paragraph

 class Glove(object):
     """
@@ -189,6 +211,19 @@ class Glove(object):
         Serialize model to filename.
         """

+        f = h5py.File('{0}.h5'.format(filename), 'w')
+        if isinstance(self.word_vectors, np.ndarray):
+            _ = f.create_dataset('word_vectors', data=self.word_vectors)
+        if isinstance(self.word_biases, np.ndarray):
+            _ = f.create_dataset('word_biases', data=self.word_biases)
+        if isinstance(self.vectors_sum_gradients, np.ndarray):
+            _ = f.create_dataset('vectors_sum_gradients', data=self.vectors_sum_gradients)
+        if isinstance(self.biases_sum_gradients, np.ndarray):
+            _ = f.create_dataset('biases_sum_gradients', data=self.biases_sum_gradients)
+
+        for field in ['word_vectors', 'word_biases', 'vectors_sum_gradients', 'biases_sum_gradients']:
+            self.__dict__.pop(field)
+
         with open(filename, 'wb') as savefile:
             pickle.dump(self.__dict__,
                         savefile,
@@ -205,6 +240,16 @@ class Glove(object):
         with open(filename, 'rb') as savefile:
             instance.__dict__ = pickle.load(savefile)

+        f = h5py.File('{0}.h5'.format(filename), 'r')
+        if 'word_vectors' in f.keys():
+            instance.word_vectors = np.asarray(f['word_vectors'])
+        if 'word_biases' in f.keys():
+            instance.word_biases = np.asarray(f['word_biases'])
+        if 'vectors_sum_gradients' in f.keys():
+            instance.vectors_sum_gradients = np.asarray(f['vectors_sum_gradients'])
+        if 'biases_sum_gradients' in f.keys():
+            instance.biases_sum_gradients = np.asarray(f['biases_sum_gradients'])
+
         return instance

     @classmethod

@akkineniramesh
Copy link

I used python3.4 when processing large data using glove-python. pickling works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants