Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: CodeSplitter takes exactly 1 argument (2 given) #13521

Open
framsouza opened this issue May 15, 2024 · 7 comments
Open

[Bug]: CodeSplitter takes exactly 1 argument (2 given) #13521

framsouza opened this issue May 15, 2024 · 7 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@framsouza
Copy link

framsouza commented May 15, 2024

Bug Description

CodeSplitter doesn't seem to be working as expected

Version

0.10.36

Steps to Reproduce

from llama_index.core.node_parser import CodeSplitter
from llama_index.readers.file import FlatReader
from pathlib import Path

documents = FlatReader().load_data(Path("index.py"))
splitter = CodeSplitter(
    language="python"
    chunk_lines=40,  # lines per chunk
    chunk_lines_overlap=15  # lines overlap between chunks
    max_chars=1500,  # max chars per chunk
)
nodes = splitter.get_nodes_from_documents(documents)
nodes[0].text

Relevant Logs/Tracbacks

python test.py 
Could not get parser for language python. Check https://github.com/grantjenks/py-tree-sitter-languages#license for a list of valid languages.
Traceback (most recent call last):
  File "/Users/framsouza/git-assistant/test.py", line 6, in <module>
    splitter = CodeSplitter(
               ^^^^^^^^^^^^^
  File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/text/code.py", line 63, in __init__
    parser = tree_sitter_languages.get_parser(language)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "tree_sitter_languages/core.pyx", line 19, in tree_sitter_languages.core.get_parser
  File "tree_sitter_languages/core.pyx", line 14, in tree_sitter_languages.core.get_language
TypeError: __init__() takes exactly 1 argument (2 given)
@framsouza framsouza added bug Something isn't working triage Issue needs to be triaged/prioritized labels May 15, 2024
Copy link

dosubot bot commented May 15, 2024

Hello @framsouza! 👋 I'm Dosu, a bot here to lend a hand with bugs, answer questions, and guide you on your journey to becoming a contributor, all while we wait for a human maintainer to chime in. Delighted to meet you! I'm diving into your issue and will get back to you with a full answer shortly.

@logan-markewich
Copy link
Collaborator

@framsouza yea this seems like an issue with the tree-sitter version. I don't actually know what the solution is yet, but a band-aid fix was pip install -U "tree-sitter<0.22.0"

@framsouza
Copy link
Author

framsouza commented May 15, 2024

hey @logan-markewich , i just gave it a try:

tree-sitter                             0.21.3
tree-sitter-go                          0.21.0
tree-sitter-languages                   1.10.2
llama-index                             0.10.36
llama-index-core                        0.10.36
    pipeline = IngestionPipeline(
        transformations=[
            CodeSplitter(language="go", chunk_lines=20, chunk_lines_overlap=5),
            Settings.embed_model
        ],
        vector_store=get_es_vector_store()
    )

got a different error,

Data loaded from local directory.
Starting the pipeline...
Parsing nodes:   0%|                                                                                                           | 0/2002 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/framsouza/git-assistant/index.py", line 166, in <module>
    main()
  File "/Users/framsouza/git-assistant/index.py", line 159, in main
    pipeline.run(show_progress=True, documents=documents)
  File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/ingestion/pipeline.py", line 734, in run
    nodes = run_transformations(
            ^^^^^^^^^^^^^^^^^^^^
  File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/ingestion/pipeline.py", line 124, in run_transformations
    nodes = transform(nodes, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/interface.py", line 127, in __call__
    return self.get_nodes_from_documents(nodes, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/interface.py", line 76, in get_nodes_from_documents
    nodes = self._parse_nodes(documents, show_progress=show_progress, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/interface.py", line 145, in _parse_nodes
    splits = self.split_text(node.get_content())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/framsouza/git-assistant/lib/python3.11/site-packages/llama_index/core/node_parser/text/code.py", line 161, in split_text
    raise ValueError(f"Could not parse code with language {self.language}.")
ValueError: Could not parse code with language go.

I can see go on supported languages.

@logan-markewich
Copy link
Collaborator

@framsouza Seems like the if statement at the top if failing

image

@framsouza
Copy link
Author

This is the code, I'm moving from SentenceSpliter to CodeSpliter since I'm ingesting code type of information

Copy link

dosubot bot commented May 15, 2024

The issues you're encountering with CodeSplitter in your LlamaIndex setup seem to stem from two main areas: the initialization of CodeSplitter and the processing of Go language code.

  1. Initialization Error: The TypeError suggests a mismatch in the expected arguments for the CodeSplitter constructor. Ensure all required parameters are correctly provided and match the expected types. The error indicates that the constructor is not receiving the expected number of arguments, which could be due to a syntax error or missing parameters in your call.

  2. Processing Go Language Code: The ValueError indicates a problem with parsing Go language code. This could be due to the tree_sitter_languages package not supporting Go or a misconfiguration in your setup. Ensure that:

    • The tree_sitter_languages package is correctly installed and supports Go.
    • You have the correct version of tree_sitter and tree_sitter_languages that includes Go language support. If Go is not supported, you may need to manually load a Go language grammar for tree-sitter.

For both issues, verify that you're using compatible versions of llama-index, tree-sitter, and tree_sitter_languages. If the problem persists, consider manually initializing a tree-sitter Parser for Go and passing it to the CodeSplitter constructor, ensuring you have a Go language grammar available for tree-sitter.

Lastly, ensure your error handling in the implementation is robust, especially for unsupported languages or parsing errors, to gracefully manage these situations.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@alleywind
Copy link

alleywind commented May 21, 2024

  1. create a pure python env
  2. install llama_index

my version is 0.10.37
and this is my code:

import tree_sitter_java as tsjava
from llama_index.core.node_parser import CodeSplitter
from llama_index.readers.file.flat.base import FlatReader
from tree_sitter import Language, Parser

CODEBASE_DIR = "your code"

JAVA_LANGUAGE = Language(tsjava.language())
parser = Parser(JAVA_LANGUAGE)


language = "java"
documents = FlatReader().load_data(Path(CODEBASE_DIR))
splitter = CodeSplitter(
    parser = parser,
    language = language,
    chunk_lines = 40,  # lines per chunk
    chunk_lines_overlap = 15,  # lines overlap between chunks
    max_chars = 1500  # max chars per chunk
)


nodes = splitter.get_nodes_from_documents(documents)
print(len(nodes))

it works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

3 participants