Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does "force_parallel(enable=True)" not work? #206

Open
kongbo96 opened this issue Dec 30, 2022 · 2 comments
Open

Why does "force_parallel(enable=True)" not work? #206

kongbo96 opened this issue Dec 30, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@kongbo96
Copy link

kongbo96 commented Dec 30, 2022

In this code, dask works:

def has_inter(x_cat_set, now_set):
    inter = x_cat_set.intersection(now_set)
    return len(inter) == 0 

def get_negs2(now_set,si_doc, num, df3):
    negs_set = set(df3[df3.loc[:,'s_cat'].swifter.progress_bar(False).apply(has_inter, args=(now_set, ))].s_id)
    negs = list(negs_set)
    return negs

neg_dict = df2.loc[:, 's_cat'].swifter.force_parallel(enable=True).apply(get_negs2, args=(si_doc, n_neg, df3,))

This is the result:
image

In this code, dask doesn't works:


def get_negs(line, si_doc, num, df3):
    now_set = line['s_cat']
    negs_set = set(df3[df3.loc[:,'s_cat'].swifter.progress_bar(False).apply(has_inter, args=(now_set, ))].s_id)
    negs = list(negs_set)
    return negs

neg_dict = df2.swifter.force_parallel(enable=True).allow_dask_on_strings(enable=True).apply(get_negs,args=(si_doc,n_neg, df3,),axis=1)

This is the result:
image

Why are there different results? I want to use the second method, because I need to use two columns of data in other cases.

@jmcarpenter2
Copy link
Owner

Hmmm, this is strange behavior. It must be trying to use dask and failing to validate the apply on the sample dataset. Is there any chance you could provide an example (or fake) dataset for me to run this code and try to debug the core of the issue?

@jmcarpenter2 jmcarpenter2 added the bug Something isn't working label Mar 24, 2023
@kongbo96
Copy link
Author

Sorry, it's been too long and the data can't be found. The only difference between the two pieces of code is that the now_set of different lines of the second piece of code is different, while the first piece of code has only one now_set.
In fact, the purpose is to find out the data in df3 that does not intersect with the s_cat of the row of df2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants