Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing Error and 'generator raised StopIteration' error #1182

Open
AbhayGaur19 opened this issue Jan 19, 2024 · 1 comment
Open

Comments

@AbhayGaur19
Copy link

tried many solution, still getting this error:

StopIteration Traceback (most recent call last)
File ~/.local/lib/python3.10/site-packages/dedupe/api.py:259, in DedupeMatching.pairs(self, data)
257 self.fingerprinter.index_all(data)
--> 259 id_type = core.sqlite_id_type(data)
261 # Blocking and pair generation are typically the first memory
262 # bottlenecks, so we'll use sqlite3 to avoid doing them in memory

File ~/.local/lib/python3.10/site-packages/dedupe/core.py:335, in sqlite_id_type(data)
334 def sqlite_id_type(data: Data) -> Literal["text", "integer"]:
--> 335 example = next(iter(data.keys()))
336 python_type = type(example)

StopIteration:

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)
File ~/.local/lib/python3.10/site-packages/dedupe/api.py:125, in IntegralMatching.score(self, pairs)
124 try:
--> 125 matches = core.scoreDuplicates(
126 pairs, self.data_model.distances, self.classifier, self.num_cores
127 )
128 except RuntimeError:

File ~/.local/lib/python3.10/site-packages/dedupe/core.py:124, in scoreDuplicates(record_pairs, featurizer, classifier, num_cores)
122 from .backport import Process, Queue # type: ignore
--> 124 first, record_pairs = peek(record_pairs)
125 if first is None:

File ~/.local/lib/python3.10/site-packages/dedupe/core.py:278, in peek(seq)
277 try:
--> 278 first = next(seq)
279 except TypeError as e:

RuntimeError: generator raised StopIteration

During handling of the above exception, another exception occurred:

RuntimeError Traceback (most recent call last)
Cell In[14], line 117
114 deduper.write_settings(sf)
116 print('clustering...')
--> 117 clustered_dupes = deduper.partition(data_d, 0.7)
119 print('# duplicate sets', len(clustered_dupes))
121 cluster_membership = {}

File ~/.local/lib/python3.10/site-packages/dedupe/api.py:200, in DedupeMatching.partition(self, data, threshold)
162 """
163 Identifies records that all refer to the same entity, returns
164 tuples containing a sequence of record ids and corresponding
(...)
197 ]
198 """
199 pairs = self.pairs(data)
--> 200 pair_scores = self.score(pairs)
201 clusters = self.cluster(pair_scores, threshold)
202 clusters = self._add_singletons(data.keys(), clusters)

File ~/.local/lib/python3.10/site-packages/dedupe/api.py:129, in IntegralMatching.score(self, pairs)
125 matches = core.scoreDuplicates(
126 pairs, self.data_model.distances, self.classifier, self.num_cores
127 )
128 except RuntimeError:
--> 129 raise RuntimeError(
130 """
131 You need to either turn off multiprocessing or protect
132 the calls to the Dedupe methods with a
133 if __name__ == '__main__' in your main module, see
134 https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods"""
135 )
137 return matches

RuntimeError:
You need to either turn off multiprocessing or protect
the calls to the Dedupe methods with a
if __name__ == '__main__' in your main module, see
https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods

@fgregg
Copy link
Contributor

fgregg commented Jan 23, 2024

can you provide something reproducible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants