Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommendations for handling large datasets #83

Open
leeanapeters opened this issue Jun 15, 2023 · 1 comment
Open

Recommendations for handling large datasets #83

leeanapeters opened this issue Jun 15, 2023 · 1 comment

Comments

@leeanapeters
Copy link

Hi, thank you for creating this great tool!

I was wondering if you could offer some guidance on handling large datasets in the unsupervised workflow? In particular this seems to be a problem with the clustering/KNN classification steps as it seems to be prohibitively memory-expensive.

I think that downsampling is interfering with the classification accuracy so I would like to use all the data if possible.

Thanks so much for your help!

Leeana

@sophiachen1
Copy link

Hi I am also using this tool with large datasets (~150k sequences). The KNN classification returns empty knn_seq.pkl and an error like below. I am wondering if you have ever encountered this error? and I suspect it may be an out-of-memory issue of KNN?


ValueError Traceback (most recent call last)
/tmp/ipykernel_15992/968723552.py in
----> 1 DTCRU.KNN_Sequence_Classifier(metrics=['AUC'],plot_metrics=True,n_jobs=-1, Load_Prev_Data=True,by_class=True)

~/deeptcr/lib/python3.7/site-packages/DeepTCR/DeepTCR.py in KNN_Sequence_Classifier(self, folds, k_values, rep, plot_metrics, by_class, plot_type, metrics, n_jobs, Load_Prev_Data)
2429 if plot_metrics is True:
2430 if by_class is True:
-> 2431 sns.catplot(data=df_out, x='Metric', y='Value', hue='Classes', kind=plot_type)
2432 else:
2433 sns.catplot(data=df_out, x='Metric', y='Value', kind=plot_type)

~/deeptcr/lib/python3.7/site-packages/seaborn/_decorators.py in inner_f(*args, **kwargs)
44 )
45 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46 return f(**kwargs)
47 return inner_f
48

~/deeptcr/lib/python3.7/site-packages/seaborn/categorical.py in catplot(x, y, hue, data, row, col, col_wrap, estimator, ci, n_boot, units, seed, order, hue_order, row_order, col_order, kind, height, aspect, orient, color, palette, legend, legend_out, sharex, sharey, margin_titles, facet_kws, **kwargs)
3801 # so we need to define palette to get default behavior for the
3802 # categorical functions
-> 3803 p.establish_colors(color, palette, 1)
3804 if kind != "point" or hue is not None:
3805 palette = p.colors

~/deeptcr/lib/python3.7/site-packages/seaborn/categorical.py in establish_colors(self, color, palette, saturation)
317 # Determine the gray color to use for the lines framing the plot
318 light_vals = [colorsys.rgb_to_hls(*c)[1] for c in rgb_colors]
--> 319 lum = min(light_vals) * .6
320 gray = mpl.colors.rgb2hex((lum, lum, lum))
321

ValueError: min() arg is an empty sequence


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants