Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display remaining tasks overview #724

Open
jum-s opened this issue Feb 1, 2024 · 3 comments
Open

Display remaining tasks overview #724

jum-s opened this issue Feb 1, 2024 · 3 comments
Labels

Comments

@jum-s
Copy link
Contributor

jum-s commented Feb 1, 2024

In today's codebase:

  • there is only one task type (deduplicate) stored in db
  • suspectUri and suggestionUri pair is unique
  • there are two kinds of deduplicate tasks, based on their entity type (human and work)
  • all tasks with human entities are generated automatically, which currently creates a lot of tasks
  • all tasks with work entities are based on user feedback ; those have a reporter: userId

The objective is to have users access a dashboard of main tasks to do, aka grouping tasks by user interests (categories)

Proposed dashboard categories: [edited to integrate max comment below]

  • merge : entities could be of any type.
    • works : already developed
    • humans: should first return all user feedback tasks (aka "reporter tasks"), then autogenerated tasks
  • delete : entities could be of any type
    • works
    • humans

Proposed implementation:

  • add a delete task type, aside deduplicate (which could be renamed to merge (?))
  • rename by-entities-type -> by-type: endpoint would have two arguments: type and entities-type. To be able to query action=by-type?type=deduplicate&entities-type=humans which would return only tasks with a reporter
  • query autogenerated tasks would be kept as is (with by-score endpoint)
  • since an entity would possibly both have a merge and a delete task, endpoint by-suspect-uri would necessarily need a type argument.
  • rename suspectUri -> uri (as suspect term does not qualify anything useful)
@jum-s jum-s added the tasks label Feb 1, 2024
@maxlath
Copy link
Member

maxlath commented Feb 24, 2024

What do you think of unifying user generated and robot generated merge tasks? The current human deduplication process would just be one provider among others of merge tasks, specialized on humans, but the rest would be entity-type agnostic and reporter (user or bot) agnostic(?). In that direction, and working from memories, I think it might make sense to make that author deduplication process create less tasks: it automerges what it can and creates tasks when it's not quite sure, but doesn't create a task for every homonym returned by Elasticsearch(?) as that information is of lower quality than if a user reports that A and B should be merged.

@jum-s
Copy link
Contributor Author

jum-s commented Feb 24, 2024

I had in mind a hard split (different category) to give priority to reporter tasks, as a real human is in pain seeing a mismatch somewhere. But yes, it could be a softer way, ie. to sort human tasks by reporter first, then the others.

Reducing the amount of autogenerated tasks seems like a good idea. The easiest would be to introduce a hardcoded threshold on score (dont create task if score is lower than 100)

@jum-s
Copy link
Contributor Author

jum-s commented Feb 25, 2024

Here is a query to find the 10000th task sorted by descending score:

curljson "http://[couchdb]/tasks-prod/_design/tasks/_view/byScore" | jq ".rows[]|sort_by(.key[2])|reverse|.[10000]"

Rough idea of the results:
1 000th task score: 674.73
10 000th task score: 395.09
15 000th task score: 352.46
25 000th task score: 299.79
100 000th task score: 182.11

This could allow us to set a threshold of 350, and still create ~15k tasks (against ~700k today). Threshold could be a config setting to allow to recreate tasks without having to push a commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants