Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can Kor be used to identify sections in a document? #171

Open
IP1102 opened this issue Jun 8, 2023 · 3 comments
Open

Can Kor be used to identify sections in a document? #171

IP1102 opened this issue Jun 8, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@IP1102
Copy link

IP1102 commented Jun 8, 2023

To give a brief overview, let's say I want to parse job application CVs. I don't know the structure of the data, i.e. various people write their CV in their own style and I want to identify sections belonging to specific topics such as Skills, Experience, Education, etc. Can Kor work with these kinds of unstructured data?

@eyurtsev
Copy link
Owner

eyurtsev commented Jun 9, 2023

Kor can't do that right now.

There might be a way of hacking a solution by introducing line numbers for each line, and asking kor to identify the start_line, end_line and section name.

But I don't know what kind of quality to expect from this and there are other approaches that one should try to get extraction results at a good enough quality.

Adding this functionality is not out of the question, but would require some effort so we'd want to see interest in this from the community.

@eyurtsev eyurtsev added the enhancement New feature or request label Jun 9, 2023
@IP1102
Copy link
Author

IP1102 commented Jun 12, 2023

Kor can't do that right now.

There might be a way of hacking a solution by introducing line numbers for each line, and asking kor to identify the start_line, end_line and section name.

But I don't know what kind of quality to expect from this and there are other approaches that one should try to get extraction results at a good enough quality.

Adding this functionality is not out of the question, but would require some effort so we'd want to see interest in this from the community.

Thanks for the reply @eyurtsev Also, when you say there are other approaches for extraction can you suggest some examples? This will help me figure out some solutions and I can contribute in adding this feature to Kor.

@eyurtsev
Copy link
Owner

If you're trying to use an LLM approach, you could try:

  • Ask LLM to repeat in verbatim the original text but to add xml tags around each section of interest.
  • Use the edit API from open AI, and ask it to add xml tags around each section of interest.

Alternatively, could generate word / sentence / paragraph level features and then classify on top with logistic regression. Features can be from generated using LLMs or other nlp approaches.

One of the issues that you'll probably bump with PDFs is layout analysis; i.e., figuring out how to map the content of the PDF into text in the best way. This step may be critical in getting good quality, but really depends on your problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants