Flavio/issue 12 #20

f-hafner · 2024-01-08T13:31:23Z

Issue

Fixes #12

Description of changes

Main change: In the actor tagger, the method postag_text is split into two: (1) make_html that generates the html for the GUI, and (2) postag_text_to_table that runs the NLP model does that POS tagging.

Details of the change:

postag_text_to_table creates two dataframes that are indexed by the sentence identifier: self.sentence_df with the sentence text; self.entities_df with the start&end index of the POS tagged entities and the two prominence score metrics
make_html queries the two dataframes created above, filters on selected prominence score and threshold, and creates the html for display.

This is an imperfect solution:

it removes the re-computation for a given selected story, but still re-computes the model and entities whenever a new story is selected.
some functionality is not run anymore, for instance __update_postagging_metrics. -> do we need to rerun this somewhere?

Smaller changes:

use int() for slider and some other things in the OWSNActorAnalysis for compatibility with python>3.9
use testdata from packge for the actor analysis widget

Open tasks

comments marked with TODO (PR) or TODO NOW
the timing is not better than before: for story 2, changing the selected entities takes 0.3 seconds with the new approach but only 0.15 seconds with the old approach. FIXED: avoid unnecessary steps in the for loop. It is still not faster than than the dictionary approach in the old version, but I think the pandas version will scale better than the dictionary version: https://stackoverflow.com/questions/22084338/pandas-dataframe-performance
the results are different: for instance, "Mounir" is currently not tagged as a subject in story 2, while it is on the master branch. I think this holds more generally for subject entities, but I need to check.

Includes

Code changes
Tests
Documentation

f-hafner · 2024-01-08T16:21:59Z

orangecontrib/storynavigation/modules/actoranalysis.py

@@ -417,23 +417,34 @@ def make_html(self, text, nouns, subjs, custom, custom_dict, selected_prominence
            selected_tags.append(self.pos_tags["custom"])

        selected_tags = [tag for taglist in selected_tags for tag in taglist]
-
+
+        metric = prominence_map[selected_prominence_metric]


this and the following lines use the vectorization in pandas and avoid the repeated subsetting of the dataframe during the for loop that was done before.

f-hafner added 9 commits January 4, 2024 13:48

do not download spacy model for each run

bde8a18

expose make_html instead of postag_text to the widget

f0791af

start collecting POS data in a table

d7697f1

add prominence scores to POS tag dfs

76d640b

finish separating making html and POS tagging

47a0b6d

slider: only ints; change data for testing widget

44be6be

add logging and another breakpoint

8d421f0

add docstrings to new functions

8b44579

remove old function

4f2f4a9

f-hafner marked this pull request as draft January 8, 2024 13:31

f-hafner added 4 commits January 8, 2024 14:36

re-initialize sentences_df and entities_df in selection_changed

db700f9

fix bug with non-actor entities

5fbd321

move filter out of for loop

01d3c78

iterate over grouped df

76f0175

f-hafner commented Jan 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flavio/issue 12 #20

Flavio/issue 12 #20

f-hafner commented Jan 8, 2024 •

edited

f-hafner Jan 8, 2024

Flavio/issue 12 #20

Are you sure you want to change the base?

Flavio/issue 12 #20

Conversation

f-hafner commented Jan 8, 2024 • edited

Issue

Description of changes

Includes

f-hafner Jan 8, 2024

Choose a reason for hiding this comment

f-hafner commented Jan 8, 2024 •

edited