Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyzing with TreeBuilderProcess: Can't instantiate abstract class TreeBuilderProcess with abstract method run #1179

Open
GideonK opened this issue Aug 1, 2022 · 6 comments
Assignees
Labels

Comments

@GideonK
Copy link

GideonK commented Aug 1, 2022

Description
Attempting to analyze a CLTK sentence using cltk.dependency.processes.TreeBuilderProcess added to variable pipeline of class NLP, in order produce a dependency graph, produces an error output that states that the abstract class TreeBuilderProcess cannot be instantiated with the abstract method "run".

To Reproduce

  1. Install Python version 3.10.4
  2. Install CLTK version 1.1.5 with dependencies using pip in virtualenv
  3. In a REPL, run the following code:
from cltk import NLP
from cltk.dependency.processes import TreeBuilderProcess
nlp = NLP(language="lat", suppress_banner=True)
nlp.pipeline.add_process(TreeBuilderProcess)
from cltk.languages.example_texts import get_example_text
doc = nlp.analyze(text=get_example_text("lat"))
  1. See error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/.virtualenvs/textproc/lib/python3.10/site-packages/cltk/nlp.py", line 140, in analyze
    a_process = self._get_process_object(process)
  File "/home/user/.virtualenvs/textproc/lib/python3.10/site-packages/cltk/nlp.py", line 116, in _get_process_object
    a_process = process_object(self.language.iso_639_3_code)
TypeError: Can't instantiate abstract class TreeBuilderProcess with abstract method run

Expected behavior
This follows the example commands as listed on the following URL: https://docs.cltk.org/en/latest/cltk.dependency.html

According to the documentation, cltk.dependency.processes.TreeBuilderProcess is a "Process that takes a doc containing sentences of CLTK words and returns a dependency tree for each sentence."

Expected behavior is no error output, with the ability to use the "doc" object as a reference for dependency graphs (trees) of the example text in question.

Example language "got" leads to the same output.

Desktop (please complete the following information):

Pop!_OS 22.04 LTS jammy (ID_LIKE ubuntu debian)

@GideonK GideonK added the bug label Aug 1, 2022
@clemsciences
Copy link
Member

This is clearly an error from the library. The run method should be overriden and it is not. We will see why it was designed in such a way.

@kylepjohnson kylepjohnson self-assigned this Aug 1, 2022
@kylepjohnson
Copy link
Member

Thanks, Clément.

I wrote this a few years ago, but it looks like I never finished it. TreeBuilderProcess is not referenced anywhere else in the docs. If I recall correctly, my intention was that this would be a generic form of what we do with (StanzaProcess), but only for dependency information. For the run() method that @clemsciences references, you can see one here:

def run(self, input_doc: Doc) -> Doc:

@GideonK if you would explain what your goal is, we will help you as best we can.

@GideonK
Copy link
Author

GideonK commented Aug 1, 2022

@GideonK if you would explain what your goal is, we will help you as best we can.

Thank you very much for replying. I am interested in extracting person entities (subjects and direct/indirect objects) with associated verbs from Latin - predicate/argument structure, in a sense, but where persons interact with each other expressed in various syntactic patterns (e.g. "A accused B"). I am not a Latin expert, but I work on these texts as a computational linguist. Therefore, I'm looking at various different angles to achieve this, including producing dependency graphs with their labels, but also morphosyntactic features.

I have started with named entities but I also have a problem with the NER - it seems to look for a config file that doesn't exist. But I see that there is an open issue on NER already, which may address this (I have to study it more carefully) - I'm also aware of the proper_names.txt file. In any case, features such as dependency labels and morphosyntactic features provide much needed information for the task I have in mind.

I am able to produce the full analysis that outputs everything including definitions. This seems to also produce output from which the graphs can be deduced with a little scripting.

@kylepjohnson
Copy link
Member

Hey, I'll offer a quick response --

various syntactic patterns (e.g. "A accused B").

For things like actor-agent relationships, you could infer these from the case and/or dependency information that we already provide for Latin. This notebook illustrates how to work with our CLTK Doc object. We currently get this from the Stanza project and put them into this Doc object.

seems to look for a config file that doesn't exist.

Our NER for Latin is currently not implemented. @wjbmattingly has contributed a model but I have failed to implement it in a timely fashion.

proper_names.txt

You could use this to make own NER module by some kind of simple matching.

If you were to post a little code of what you're doing or trying to do, we might have some more detailed advice.

@GideonK
Copy link
Author

GideonK commented Aug 2, 2022

Thank you very much Kyle. The overarching goal is to extract data points such as discursive patterns that can be analyzed statistically in order to produce more insight into the texts. I would rather not say too much, as this is part of an ongoing research project. But one thing I'm looking into is how entities, specifically persons i.e. agents in this case, interact with each other, e.g. what kind of verbs are used, following by applying further analysis downstream based on predetermined categorisation.

The literature has some interesting approaches regarding relation extraction, word embeddings, sentiment analysis, etc. (not necessarily all for Latin) I think that an integrated NLP environment, afforded by CLTK, can provide some necessary building blocks at least for proof-of-concept experimental use, while it can assist downstream tasks by providing useful feature values.

My code is therefore meant to deal with analysing both text and CLTK objects/output, while cross-checking with external documents and producing linguistic patterns that may include frequency counts and other information. So far I was just testing the CLTK functionality, which includes producing dependency graphs.

My example text is rather large, so I was hoping to run focused analyses on it, such as pure NER extraction or dependency graphs. A full analysis takes too much time (several hours) to run more than once. So I am considering analysing this output file programmatically instead of using CLTK objects. I simply produced it per line as follows:

with open(kwargs['text'], 'r') as t:
    self.textfile = t.read().splitlines()
        self.cltk_nlp = NLP(language=self.language, suppress_banner=True)
...
    for l in self.textfile:
        sent = self.cltk_nlp.analyze(text=l)
        print(sent)

Extracting the dependency trees is not yet implemented, as I only tested it on the command-line using the aforementioned commands. I am yet to investigate how to navigate the object in order to extract the information I need. So I can either (perhaps) use the in-memory objects in question while circumventing TreeBuilderProcess, or I can read the analysis output file in memory and use my own data structures. For my initial investigation, I think I can also skip dependencies altogether and just look at the morphosyntactic categories, in which case I should be able to make use of CLTK objects.

As for NER, I have access to both proper_names.txt as well as another external document referencing the text in question, so I can use all the data together with morphosyntactic features to, hopefully, reliably extract persons.

@clemsciences
Copy link
Member

Hello Gideon,

I think it would be easier to discuss about it on or Discord server: https://discord.gg/ATUDJQX7cg and I'll see how to help you more precisely and see how to implement the TreeBuilderProcess class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants