Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration: Improve Chunk splitter + Relationships between chunks for python files / repositories #13446

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jimysancho
Copy link

@jimysancho jimysancho commented May 12, 2024

Description

A new feature has been added using rag-pychunk: Python Library to chunk your python files levereging the python programming language to improve two things:

  • Chunk size: make your chunk size dynamic, keeping in the same chunk a hole funcion, a hole class method, a hole class and block of code.
  • Chunk relationships: create relationships between your chunks other than Parent-Child and Prev-Next. NodeRelationship.REFERENCE has been created, which will include these relationships. These type of relationship can be used as well for example in pdfs when a chunk is referencing another chunk (example: In section 3.7 there is an explanation ... -> section 3.7 will need to be a NodeRelationship.REFERENCE for "self" node).

Motivation: leverage python programming language syntax to improve the Chunking + relationship part of the RAG pipeline. This logic can be used for all programming languages, since all of them have a defined syntax.

New dependency: rag-pychunk library.

Fixes # (issue)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Added new notebook (that tests end-to-end)

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 12, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@logan-markewich
Copy link
Collaborator

I'm not sure we NEED to add a new relationship type for this? Wouldn't the property relationship just be parent or source?

I also think this should go as a new integration, rather than in core (i.e. in llama-index-integrations/node_parser)

@jimysancho
Copy link
Author

I think it is needed because of this:

  • Imagine you have a class, let's call it class A, and it has several methods: method a1, a2, .... The methods will be child nodes of the parent node class A. Now imagine that the method a2 is called inside the method a1. If the Class A is the parent for both of this methods it wouldn't make sense to use the parent relationship to describe how these two methods are related.
  • Now imagine the following: you have two functions, function X and function Y. The function Y is called inside function X. Here it wouldn't make sense either to have a PARENT-CHILD relationship between those node. This situation would also apply for the Class A and its methods of the previous example: what if the class A is called in another file? What if its methods are called in other files?

If the relationship SOURCE is suited for this kind of relationships then it would not be needed to create REFERENCE relationships. But if SOURCE only describes: where does self node belong? Then, it would be better to define REFERENCE because of the situations I've explained above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants