Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization issues #642

Open
6 of 9 tasks
maxaalexeeva opened this issue Jun 14, 2022 · 8 comments
Open
6 of 9 tasks

Normalization issues #642

maxaalexeeva opened this issue Jun 14, 2022 · 8 comments

Comments

@maxaalexeeva
Copy link
Contributor

maxaalexeeva commented Jun 14, 2022

  • number + preposition of time/place (in) mistakenly labeled as measurement:
ex. 1. Popular varieties in the wet season were Sahel 202 (65% of farmers) and Sahel 201 (30%), while 60% and 92% grew Sahel 108 in 2012DS and 2013DS, respectively.
ex. 2. released as Sahel 108 in Senegal in 1994

upd 1: attempted to add a constraint here with a negative lookahead (?! [entity = /B-LOC/]), but maybe entities are not available at this point in the pipeline?
upd2: this seems to work (?![tag = /NNP|CD/]), but is there a better solution?

  • pluses are included as B-MEASUREMENT: feature or a bug? (norms are correct, e.g., 100.0 kg/ha for + 100 kg ha-1):
das days after sowing, Fert fertilizer treatment, with F1: recommended dose (80 kg N ha−1), i.e., 200 kg ha−1 NPK (15.15.15) at sowing + 100 kg ha−1 urea at 20 das + 50 kg ha−1 urea at 50 das. F2: F1/4 (20 kg N ha−1)
  • range not extracted:
Ex. 1. Rice grain yield measured at maturity ranged from 2.7 t ha-1 to 7.1 t ha-1 , with an average of 4.8 t ha-1 .
Ex. 2 rice grain yields were between 8.8 t ha-1 and 9.2 t ha-1 ( i.e. about 1 t ha-1 more than in the 1998WS
  • how to normalize two separate values with shared unit, e.g., 9 (t/ha) and 10 t/ha below?
- Potential yield of all the varieties in the Senegal River delta was estimated at 9 and 10 t / ha in wet and dry seasons , respectively , and potential yield was taken as 8 t / ha for both seasons in the middle valley .
- rice yield will increase from 3600 in 2000-2009 to 4500 kg ha-1 in 2090-2099 ( Fig. 4a ) .

Probably similar to three values in the next example. Maybe extract as one unit and split with an action? (unsure if it will be easy to differentiate these from ranges---which will need to be stay unsplit):

Target yields on average were set to 6.4 , 7.9 , and 7.1 t / ha in 2011WS , 2012DS , and 2013DS , respectively ( Table 1 ) .
The planned timing for the first split was 23 days after sowing ( from 3 to 13 August in the 1999WS and from 14 to 25 August in the 2000WS , see Tab .
  • slash- and dash-separated dates:
- The areas sown for this 2021/2022 wintering campaign are.
- rice yield will increase from 3600 in 2000-2009 to 4500 kg ha-1 in 2090-2099 ( Fig. 4a ) .
  • more units come up split by the type of fertilizer, what to do other than keep adding them to vocab?
P and K concentrations in irrigation and floodwater were estimated at 0.1 mg P l-1 and 3.2 mg K l-1 .
  • units separated with a conjunction:
- In plots receiving fertilizer, DAP was applied basally (19.3 and 21.5 kg N and P ha-1 )
  • numerical range is extracted as a date:
an increase in yield from 2700-2800 in 2000s to 3200-3500 kg ha-1 in the 2050s
@MihaiSurdeanu
Copy link
Contributor

Thanks @maxaalexeeva! I'll take a look at these soon.

@maxaalexeeva
Copy link
Contributor Author

@MihaiSurdeanu I started on these today. I will fiddle with them more tomorrow and will let you know if I run into issues.

@maxaalexeeva
Copy link
Contributor Author

@MihaiSurdeanu I have addressed several of the issues (#649). I will need your feedback on the unchecked items here. Thanks!

@MihaiSurdeanu
Copy link
Contributor

Thanks @maxaalexeeva !

  • "pluses are included as B-MEASUREMENT" - this is a feature :) Let's keep it. Recognizing signs is useful, sometimes. And the number normalization handles this.
  • "more units come up split by the type of fertilizer" - let's ignore this. This requires a complicated fix. And I suspect this is not common: it is the first time I see a measurement reported this way.
    -"units separated with a conjunction" - can you please add a rule for this? This should not be too complicated: V1 and V2 U1 and U2.

Thanks!

@maxaalexeeva
Copy link
Contributor Author

@MihaiSurdeanu, will try! Thanks! Another thing came up:

Would you actually also look at the first issue (number + preposition of time/place, including my previous updates to it). My suboptimal solution seemed to work but I now have found a case when it interferes with a different extraction:

I have an example where very similar spans of text get different pos tags:

200     kg     ha-1     NPK
CD     NN      JJ        NN

vs.

,       2000    kg     ha-1     NPK
,        CD    VBG    JJ        NNP

And because of that NNP on the last token, the rule below (the one I added the constraint to to solve the first issue) does not extract the complete unit (kg ha-1):

  - name: measurement-unit
    label: MeasurementUnit
    priority: ${ rulepriority }
    type: token
    pattern: |
      [entity=/B-MEASUREMENT-UNIT/ & !word = "DS"] [entity=/I-MEASUREMENT-UNIT/]* (?![tag = /NNP|CD/])

Any thoughts on what to do? Accept the incomplete unit? Or maybe you know a better solution for the first issue?

@MihaiSurdeanu
Copy link
Contributor

Hi Masha,
I think the "kg" issue can be fixed in the POS postProcessing method here:
https://github.com/clulab/processors/blob/master/main/src/main/scala/org/clulab/processors/clu/CluProcessor.scala#L595

@maxaalexeeva
Copy link
Contributor Author

@MihaiSurdeanu, the kg part actually does get extracted even with the wrong POS; the issue seems to be the very unique token NPK, which in the second case gets labeled as NNP. That interferes with my constraint in the measurement-unit rule. The constraint is supposed to disallow units immediately followed by a personal pronoun or a digit to solve the first issue in the original post (released as Sahel 108 in Senegal in 1994). Is it worth it adding this unique token to the post-processing method?

@MihaiSurdeanu
Copy link
Contributor

Probably Ok to ignore... This seems too specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants