You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To give a brief overview, let's say I want to parse job application CVs. I don't know the structure of the data, i.e. various people write their CV in their own style and I want to identify sections belonging to specific topics such as Skills, Experience, Education, etc. Can Kor work with these kinds of unstructured data?
The text was updated successfully, but these errors were encountered:
There might be a way of hacking a solution by introducing line numbers for each line, and asking kor to identify the start_line, end_line and section name.
But I don't know what kind of quality to expect from this and there are other approaches that one should try to get extraction results at a good enough quality.
Adding this functionality is not out of the question, but would require some effort so we'd want to see interest in this from the community.
There might be a way of hacking a solution by introducing line numbers for each line, and asking kor to identify the start_line, end_line and section name.
But I don't know what kind of quality to expect from this and there are other approaches that one should try to get extraction results at a good enough quality.
Adding this functionality is not out of the question, but would require some effort so we'd want to see interest in this from the community.
Thanks for the reply @eyurtsev Also, when you say there are other approaches for extraction can you suggest some examples? This will help me figure out some solutions and I can contribute in adding this feature to Kor.
If you're trying to use an LLM approach, you could try:
Ask LLM to repeat in verbatim the original text but to add xml tags around each section of interest.
Use the edit API from open AI, and ask it to add xml tags around each section of interest.
Alternatively, could generate word / sentence / paragraph level features and then classify on top with logistic regression. Features can be from generated using LLMs or other nlp approaches.
One of the issues that you'll probably bump with PDFs is layout analysis; i.e., figuring out how to map the content of the PDF into text in the best way. This step may be critical in getting good quality, but really depends on your problem.
To give a brief overview, let's say I want to parse job application CVs. I don't know the structure of the data, i.e. various people write their CV in their own style and I want to identify sections belonging to specific topics such as Skills, Experience, Education, etc. Can Kor work with these kinds of unstructured data?
The text was updated successfully, but these errors were encountered: