HardBalance: Balance strategy based on hard sample mining using semantic similarity #1443

rohitgarud · 2023-05-19T03:52:24Z

rohitgarud
May 19, 2023

Here I am presenting HardBalance balancing strategy. As there is an imbalance between the relevant and irrelevant classes, we use oversampling or undersampling to balance the classes. HardBalance is an undersampling strategy where the irrelevant class is undersampled to match the size of the relevant class. The undersampling is performed in such a way that for each relevant record, we find the irrelevant record which is most semantically similar. The name HardBalance comes from the fact that although it is hard for the classifier to classify these 'hard' irrelevant records due to their similarity to relevant records, it can learn more nuanced differences between the relevant and irrelevant classes during training. This concept is called hard mining.

This is just an idea and has not yet been tested. Hoping to get some comments from the ASReview community.

J535D165 · 2023-05-20T10:47:42Z

J535D165
May 20, 2023
Maintainer

Awesome idea! I'm wondering what @qubixes thinks about it. He made these balancers.

5 replies

rohitgarud May 20, 2023
Author

Thank you.. would love to know what @qubixes thinks about this.. I think this balancer will work better with neural network classifiers than classical ones.. but this is just a hunch and have not performed any simulations yet

qubixes May 22, 2023

@rohitgarud Even though I made these balancers, I'm not sure I have that much to add. I honestly never really understood why the balancer that I implemented had a pretty large impact with the Naive Bayes classifier (and wasn't much of a factor in any of the other classifiers if I recall correctly).

It's definitely an interesting idea for a balancer. One thing you might want to consider is that in my experience with running simulations, is that the classification process is already quite hard for most/all datasets that I have seen. It is quite probable that the irrelevant part of the matching abstracts was close to be a relevant abstract (the reviewer barely thought it wasn't worth the effort). If that is the case the classifiers are trying to find the difference between abstracts that relevant and semi-relevant. This will have a big pay-off if it can distinguish between them, but can backfire if it cannot really understand the difference at all.

Obviously the above is all theory crafting, and I'm afraid we'll only know whether it works well if a simulation study is done with multiple models/balancers.

rohitgarud May 22, 2023
Author

@qubixes Thank you for your response.. I agree with you.. the classification is indeed sometimes between relevant and semi-relevant and makes it hard for classifier to find relevant records. The features of relevant and irrelevant are not well separated in the higher dimensional space because of very subtle and subjective differences in relevant and irrelevant.

I have an idea, something like evolving features rather than constant features extracted at start. All the feature vectors should evolve based on the already labelled features, possibly separating the relevant from irrelevant in the higher dimensional feature space, making it increasing easier (at least that's the hope) to classify the records as we label more of them. I have layed down an initial pipeline but need someone experienced to discuss. The idea is based on neural networks.

qubixes May 30, 2023

If you're changing the feature vectors depending on the labels, that might be more of a meta-classifier than it is part of the feature extraction pipeline though. If that is your goal, then it might be better/easier to implement it as a classifier, and just use the balancer that passes everything through.

rohitgarud May 30, 2023
Author

Yes @qubixes, thank you for your suggestions .. did something like this.. the main issue with my current pipeline is reproducibility, which is the primary requirement for ASReview

Rensvandeschoot · 2023-05-27T08:02:39Z

Rensvandeschoot
May 27, 2023
Maintainer

This is an excellent idea! A simulation study comparing the different balancers for the synergy data might be a nice paper to write!

1 reply

rohitgarud May 27, 2023
Author

Thank you.. definitely.. I am also curious to know the effectiveness of this approach compared to other balancers..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HardBalance: Balance strategy based on hard sample mining using semantic similarity #1443

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

HardBalance: Balance strategy based on hard sample mining using semantic similarity #1443

rohitgarud May 19, 2023

Replies: 2 comments · 6 replies

J535D165 May 20, 2023 Maintainer

rohitgarud May 20, 2023 Author

qubixes May 22, 2023

rohitgarud May 22, 2023 Author

qubixes May 30, 2023

rohitgarud May 30, 2023 Author

Rensvandeschoot May 27, 2023 Maintainer

rohitgarud May 27, 2023 Author

rohitgarud
May 19, 2023

Replies: 2 comments 6 replies

J535D165
May 20, 2023
Maintainer

rohitgarud May 20, 2023
Author

rohitgarud May 22, 2023
Author

rohitgarud May 30, 2023
Author

Rensvandeschoot
May 27, 2023
Maintainer

rohitgarud May 27, 2023
Author