add coordination ruler #13337

india-kerle · 2024-02-19T12:41:07Z

Description

This PR adds two files:

spacy/pipeline/coordinationruler.py: This file contains 3 simple coordination splitting rules and acoordination_splitter factory that allows user to add this as a pipe to use the default splitting rules or add their own.
spacy/tests/pipeline/test_coordinationruler.py: This file contains tests associated to each method for the CoordinationSplitter class.

It does NOT include anything for documentation as this will be added after the PR is more finalised.

A few questions:

I've expanded the initial splitting rules very slightly to be more generalisable to full sentences and not the original skill spans. Should I add additional generalisable splitting rules? There is also a very specific skill splitting function i.e. the token skill must be at the end of phrase.
I made this a factory as opposed to a function component because I thought it would be nice for users to be able to add their own custom rules - thoughts?

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

honnibal · 2024-02-19T15:00:26Z

Thanks! Really excited to have something like this in the library.

I've expanded the initial splitting rules very slightly to be more generalisable to full sentences and not the original skill spans. Should I add additional generalisable splitting rules? There is also a very specific skill splitting function i.e. the token skill must be at the end of phrase.

I think one construction that will be especially useful to people is coordination of modifiers in noun phrases. This could be coordination of adjectives, or nouns themselves. Section 2.2 of this thesis has a nice background on one type of construction that will be important to think about, compound nouns: https://www.researchgate.net/profile/Mark-Lauer-2/publication/2784243_Designing_Statistical_Language_Learners_Experiments_on_Noun_Compounds/links/53f9ccf60cf2e3cbf5604ec4/Designing-Statistical-Language-Learners-Experiments-on-Noun-Compounds.pdf

In general we'd like to detect and process stuff like "green and red apples" into "green apples" and "red apples". But we can have deeper nesting than that: stuff like "hot and cold chicken soup", which ends up as "hot chicken soup" and "cold chicken soup". Ultimately we're going to trust the tree structure in the parser (which isn't always fantastic on these things, due to limitations in the training data annotation) but we still want to have some concept of the range of tree shapes so we can make the test cases for them.

I would suggest first focussing on the cases where we have coordination inside a noun phrase. These will be the ones most useful for entity recognition. If we can enumerate the main construction cases we want to cover, we can then put together the target trees for them, and then test for those. For the tests, we definitely want to specify the dependency parse as part of the test case rather than letting it be predicted by the model. This way the test describes the tree, and also if we have different versions of the model the test doesn't break because it predicted something unexpected.

I made this a factory as opposed to a function component because I thought it would be nice for users to be able to add their own custom rules - thoughts?

Yes the extensibility is definitely good. Arguably we also want to support matcher or dependency matcher patterns directly, but this could be done via a function that takes the patterns as an argument.

svlandeg · 2024-02-26T09:37:20Z

spacy/pipeline/coordinationruler.py

+from typing import List, Callable, Optional, Union
+from pydantic import BaseModel, validator
+import re
+import en_core_web_sm


We'll want to find another solution for this, because we don't want to enforce all users to have exactly this model in their environment

svlandeg · 2024-02-26T09:38:05Z

spacy/pipeline/coordinationruler.py

+from ..tokens import Doc
+from ..language import Language
+from ..vocab import Vocab
+from .pipe import Pipe


Could you run isort on all files? (the test suite will fail otherwise)

india-kerle · 2024-02-29T18:11:06Z

spacy/pipeline/coordinationruler.py

+def split_noun_coordination(doc: Doc) -> Union[List[str], None]:
+    """Identifies and splits phrases with multiple nouns, a modifier
+        and a conjunction.


FYI @honnibal

india-kerle · 2024-03-04T13:04:34Z

spacy/tests/pipeline/test_coordinationruler.py

+    return spacy.blank("en")
+
+
+### CONSTRUCTION CASES ###


doesn't account for i.e. the "water and power meters and electrical sockets"

add coordination ruler

2c37811

india-kerle marked this pull request as ready for review February 19, 2024 12:49

india-kerle requested a review from ines February 19, 2024 12:50

svlandeg added enhancement Feature requests and improvements feat / pipeline Feature: Processing pipeline and components labels Feb 22, 2024

svlandeg reviewed Feb 26, 2024

View reviewed changes

svlandeg marked this pull request as draft February 26, 2024 09:37

svlandeg reviewed Feb 26, 2024

View reviewed changes

india-kerle and others added 3 commits February 29, 2024 12:39

Merge branch 'explosion:master' into coordination-component

d66a616

add usecase

81c52c8

update test

e263b6c

india-kerle commented Feb 29, 2024

View reviewed changes

India Kerle added 3 commits March 4, 2024 09:34

update splitter

d82d98b

update typing hint

3b37fb6

use field validator

59d8ee4

india-kerle commented Mar 4, 2024

View reviewed changes

India Kerle added 6 commits March 7, 2024 08:10

minor changes

8b64741

run isort

b502de4

change field validator

84bdaf1

deal with import error

fca1f3d

add type ignore

52342fc

use pydantic version instead

7abfb4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add coordination ruler #13337

add coordination ruler #13337

india-kerle commented Feb 19, 2024 •

edited

honnibal commented Feb 19, 2024

svlandeg Feb 26, 2024

svlandeg Feb 26, 2024

india-kerle Feb 29, 2024

india-kerle Mar 4, 2024

add coordination ruler #13337

Are you sure you want to change the base?

add coordination ruler #13337

Conversation

india-kerle commented Feb 19, 2024 • edited

Description

Checklist

honnibal commented Feb 19, 2024

svlandeg Feb 26, 2024

Choose a reason for hiding this comment

svlandeg Feb 26, 2024

Choose a reason for hiding this comment

india-kerle Feb 29, 2024

Choose a reason for hiding this comment

india-kerle Mar 4, 2024

Choose a reason for hiding this comment

india-kerle commented Feb 19, 2024 •

edited