Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Improve the calculation of similarity scores between answers and correct solutions #2

Open
blmage opened this issue May 15, 2020 · 7 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@blmage
Copy link
Owner

blmage commented May 15, 2020

Currently, the similarity between answers and correct solutions is computed as-is, with only Unicode normalization being applied. Therefore, accented letters and their unaccented counterparts are considered completely different characters.

While this is desirable when the user enters a "perfect" answer with regards to accents, it turns out that the results can get quite random in the contrary case.

A solution would be to compute two similarity scores, applying more or less normalization, then averaging them in a consistent way.

@blmage blmage added the enhancement New feature or request label May 15, 2020
@blmage blmage self-assigned this May 15, 2020
@blmage blmage changed the title Improve the calculation of similarity scores between answers and correct solutions Feature: Improve the calculation of similarity scores between answers and correct solutions Jun 6, 2020
@blmage blmage added this to the 2.4.0 milestone Jun 14, 2020
@blmage
Copy link
Owner Author

blmage commented Jun 21, 2020

Rather leave the choice to the user of what is significant and what is not, as an option (see #25). This could include:

  • case,
  • accents,
  • punctuation,
  • spaces,
  • word order (using an adapted version of the diff package, or probably rather the SentenceSimilarity package - benchmark this on big lists of solutions to check whether this is a no-go).

@tobiornottobi
Copy link

In my experience the order is completely off. There have been absurd sentences at the top (without any noticeable similarity) when the alphabetical sort gave me much more similar answers.

@blmage blmage added the question Further information is requested label Sep 5, 2020
@blmage
Copy link
Owner Author

blmage commented Sep 5, 2020

@tobiornottobi Could you please send one or two screenshots with examples of such behavior?

I'm only aware of this happening with missing or different diacritics, but I'll increase the priority of this issue if this happens to be more widespread.

Thanks!

@tobiornottobi
Copy link

@blmage Yes, I can. One thing I have to add: I wasn't sure if .* sort↓ button toggles the other option or says which option is currently active. The results weren't sorted alphabetically, so maybe it's actually the alphabetical sort that is broken for me.
I haven't gotten absurd suggestions this time – because the accepted answers are all reasonable and similar, but I still don't understand the order.
This is neither sorted by similarity nor alphabetically. Unless only the first word is taken into account.
image
This makes sense similarity-wise:
image

I'll try to remember making a screenshot in the future.

@blmage
Copy link
Owner Author

blmage commented Sep 8, 2020

@tobiornottobi Thanks for the screenshots!

The UI reflects the current state, so when "Alphabetical sort ↓" is displayed, solutions are/should be sorted alphabetically and in descending order.

The order on the first screenshot seems correct, apart from the two solutions at the top, but I couldn't reproduce the same result in isolation (when testing the comparison algorithm, "ä" comes before "b", as expected).

Could you point me to a skill in the Norwegian tree that uses a lot of accented words? (I'll try to reproduce it from there instead)

@tobiornottobi
Copy link

@blmage Thank you. :)
The screenshot was from the Swedish tree. I can't search at the moment unfortunately.

@blmage
Copy link
Owner Author

blmage commented Oct 26, 2020

My bad! In the case of Swedish then, this seems to be the expected behavior:

In addition to the basic twenty-six letters, A–Z, the Swedish alphabet includes Å, Ä, and Ö at the end. They are distinct letters in Swedish, and are sorted after Z as shown above.

Wikipedia

@blmage blmage removed this from the 3.1.0 milestone Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants