Improved quote detection and attribution #382

afriedman412 · 2023-06-20T15:12:18Z

Description

Pairwise, incremental quote detection looks for specific pairs of characters, no longer requires an even number of quotation marks to work.
Attribution window expanded and adjusted to improve accuracy and prevent some false positives.
Code added to prep/standardize text for quote detection

Motivation and Context

This is part of a larger project to create a package to combine quote detection and attribution with coreference resolution, which will be used for the analysis of several thousand newspaper articles.

How Has This Been Tested?

A/B testing with random samples of said articles, test creation after major changes.

(New tests added as well.)

…s and tests

bdewilde

hey @afriedman412 , thanks for your patience on this. i've left comments requesting a handful of minor changes. there's also a consistent formatting issue that makes diffing a bit hard; could you run black over the changed modules, so that the code formatting is standard / consistent with the rest of textacy?

bdewilde · 2023-07-19T00:20:08Z

src/textacy/extract/triples.py

 )
 from spacy.tokens import Doc, Span, Token
+import regex as re


textacy doesn't currently have regex as a dependency. is it possible to use the stdlib re module here instead?

probably? I've had some issues using re when working with finicky regular expressions, so I kind of use it by default now but I can test it.

bdewilde · 2023-07-19T00:20:45Z

src/textacy/extract/triples.py

@@ -9,12 +9,12 @@

 import collections
 from operator import attrgetter
-from typing import Iterable, Mapping, Optional, Pattern
+from typing import Iterable, Mapping, Optional, Pattern, Literal


Suggested change

from typing import Iterable, Mapping, Optional, Pattern, Literal

from typing import Iterable, Literal, Mapping, Optional, Pattern

bdewilde · 2023-07-19T00:22:01Z

src/textacy/extract/triples.py

-        content = doc[qtok_start_idx : qtok_end_idx + 1]
+    # pairs up quotation-like characters based on acceptable start/end combos
+    # see constants for more info
+    qtoks = [tok for tok in doc if tok.is_quote or (re.match(r"\n", tok.text))]


why do we consider tokens with "\n" in them to be quotation-like?

some formatting dictates that if you start a new paragraph while quoting someone, you start the next paragraph with a quotation mark even though the original quotation mark is never closed. in that case the linebreak functions as a closing quotation mark.

i actually added a test for it but it's actually not a great example -- I'll find a better one.

bdewilde · 2023-07-19T00:22:53Z

src/textacy/extract/triples.py

@@ -27,9 +27,10 @@
    nsubjpass,
    obj,
    pobj,
-    xcomp,
+    xcomp


this comma was here for a reason -- black put it there automatically :)

Suggested change

xcomp

xcomp,

bdewilde · 2023-07-19T00:25:54Z

src/textacy/constants.py

@@ -21,6 +21,51 @@
 OBJ_DEPS: set[str] = {"attr", "dobj", "dative", "oprd"}
 AUX_DEPS: set[str] = {"aux", "auxpass", "neg"}

+MIN_QUOTE_LENGTH: int=4


i'm not sure this is a "constant" value, seems more like a parameter with a default value that should go in the direct_quotations extraction function. what do you think?

yeah that makes sense

src/textacy/extract/triples.py

bdewilde · 2023-07-19T00:37:50Z

src/textacy/extract/triples.py

+            and q.i > qtok_idx_pairs[-1][1]
+            ):
+            for q_ in qtoks[n+1:]:
+                if (ord(q.text), ord(q_.text)) in constants.QUOTATION_MARK_PAIRS:


why do we store -- and compare against -- the ord values instead of just the "raw" text quotation marks?

less room for error

could you elaborate on that?

there are lots of ways something that is supposed to be a raw text quotation mark gets tokenized incorrectly, when you start dealing with encoding, decoding, pulling text from html, escape character issues, the whitespace_ issue above, etc etc etc. as the edge cases piled up, I realized the ord value was consistent no matter what. so I decided to use that instead.

got it. i think at a later point i may revisit some of this logic, under the assumption that the user has dealt with bad text encodings, etc. before attempting quote detection. but it's probably fine for now :)

bdewilde · 2023-07-19T00:42:05Z

src/textacy/extract/triples.py

@@ -305,15 +304,105 @@ def expand_noun(tok: Token) -> list[Token]:
        child
        for tc in tok_and_conjuncts
        for child in tc.children
-        # TODO: why doesn't compound import from spacy.symbols?


just wondering why this line was deleted? it's a comment, for me! :)

i thought it was my comment lol

bdewilde · 2023-07-19T00:46:02Z

tests/extract/test_triples.py

-        ),
+        )
    ],
 )
+


looks like your code editor is making some spurious changes (here and elsewhere) and that aren't PEP-compliant / black-enforced. we'll want to fix these before any merge.

bdewilde · 2023-07-19T00:46:59Z

tests/extract/test_triples.py

+        )
+        ]
+    )
+


Suggested change

…x` package, min_quote_length is now a `direct_quotations` parameter (not a constant), added a better example for testing linebreaks that function as closing quotes

user and others added 7 commits May 1, 2023 17:36

Improve quote detection by finding start and end quotation marks

23abe0d

revert some unrelated changes

e8e8971

added improved direct_quotations function, helper functions, constant…

5c67017

…s and tests

updates made, tests added

1c8a8ac

simplifying integrating linebreaks into quote detection

ff1ce50

final revamp

e2568fe

Merge branch 'chartbeat-labs:main' into dev

5c12aa9

bdewilde mentioned this pull request Jul 19, 2023

Improve quote detection by finding start and end quotation marks #380

Open

bdewilde reviewed Jul 19, 2023

View reviewed changes

files formatted with black, changed code to use re instead of `rege…

6170a7b

…x` package, min_quote_length is now a `direct_quotations` parameter (not a constant), added a better example for testing linebreaks that function as closing quotes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved quote detection and attribution #382

Improved quote detection and attribution #382

afriedman412 commented Jun 20, 2023

bdewilde left a comment

bdewilde Jul 19, 2023

afriedman412 Jul 20, 2023

bdewilde Jul 19, 2023

bdewilde Jul 19, 2023

afriedman412 Jul 20, 2023

bdewilde Jul 19, 2023

bdewilde Jul 19, 2023

afriedman412 Jul 20, 2023

bdewilde Jul 19, 2023

afriedman412 Jul 20, 2023

bdewilde Jul 21, 2023

afriedman412 Jul 21, 2023

bdewilde Jul 25, 2023

bdewilde Jul 19, 2023

afriedman412 Jul 20, 2023

bdewilde Jul 19, 2023

bdewilde Jul 19, 2023

	from typing import Iterable, Mapping, Optional, Pattern, Literal
	from typing import Iterable, Literal, Mapping, Optional, Pattern

Improved quote detection and attribution #382

Are you sure you want to change the base?

Improved quote detection and attribution #382

Conversation

afriedman412 commented Jun 20, 2023

Description

Motivation and Context

How Has This Been Tested?

bdewilde left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment