Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flavio/issue 12 #20

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from
Draft

Flavio/issue 12 #20

wants to merge 13 commits into from

Conversation

f-hafner
Copy link
Collaborator

@f-hafner f-hafner commented Jan 8, 2024

Issue

Fixes #12

Description of changes

Main change: In the actor tagger, the method postag_text is split into two: (1) make_html that generates the html for the GUI, and (2) postag_text_to_table that runs the NLP model does that POS tagging.

Details of the change:

  • postag_text_to_table creates two dataframes that are indexed by the sentence identifier: self.sentence_df with the sentence text; self.entities_df with the start&end index of the POS tagged entities and the two prominence score metrics
  • make_html queries the two dataframes created above, filters on selected prominence score and threshold, and creates the html for display.

This is an imperfect solution:

  1. it removes the re-computation for a given selected story, but still re-computes the model and entities whenever a new story is selected.
  2. some functionality is not run anymore, for instance __update_postagging_metrics. -> do we need to rerun this somewhere?

Smaller changes:

  • use int() for slider and some other things in the OWSNActorAnalysis for compatibility with python>3.9
  • use testdata from packge for the actor analysis widget

Open tasks

  • comments marked with TODO (PR) or TODO NOW
  • the timing is not better than before: for story 2, changing the selected entities takes 0.3 seconds with the new approach but only 0.15 seconds with the old approach. FIXED: avoid unnecessary steps in the for loop. It is still not faster than than the dictionary approach in the old version, but I think the pandas version will scale better than the dictionary version: https://stackoverflow.com/questions/22084338/pandas-dataframe-performance
  • the results are different: for instance, "Mounir" is currently not tagged as a subject in story 2, while it is on the master branch. I think this holds more generally for subject entities, but I need to check.
Includes
  • Code changes
  • Tests
  • Documentation

@f-hafner f-hafner marked this pull request as draft January 8, 2024 13:31
@@ -417,23 +417,34 @@ def make_html(self, text, nouns, subjs, custom, custom_dict, selected_prominence
selected_tags.append(self.pos_tags["custom"])

selected_tags = [tag for taglist in selected_tags for tag in taglist]


metric = prominence_map[selected_prominence_metric]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and the following lines use the vectorization in pandas and avoid the repeated subsetting of the dataframe during the for loop that was done before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimisation of tagging feature
1 participant