Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: MarkdownElementNodeProcessor does not pass 'exclude_llm_metadata_keys' and 'exclude_embed_metadata_keys' from document to nodes #13468

Closed
OmriNach opened this issue May 13, 2024 · 3 comments · Fixed by #13567
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@OmriNach
Copy link

OmriNach commented May 13, 2024

Bug Description

When setting 'exclude_llm_metadata_keys' and 'exclude_embed_metadata_keys' of documents, they usually get passed to child nodes.

However, the MarkdownElementNodeParser does not exhibit this behaviour and inherits from BaseElementNodeBarser and does not pass these two parameters to child nodes

Version

0.10.30

Steps to Reproduce

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import TokenTextSplitter, MarkdownElementNodeParser
from llama_index.readers.file import PyMuPDFReader,MarkdownReader

markdown_reader = MarkdownReader()
pdf_reader = PyMuPDFReader()
path = 'example_markdown_file.md'
documents = markdown_reader.load_data(path)
for doc in documents:
    doc.metadata['source_id'] = '661add59c4e6f825668d0b93'
    doc.metadata['source_name'] = 'guidelines'
    doc.metadata['source_type'] = 'pdf'
    doc.metadata['source_url'] = 'https://www.guidelines.com'
    doc.metadata['source_title'] = 'Pharyngitis'
    doc.metadata['source_author'] = 'John Doe'
    doc.metadata['source_date'] = '2022-01-01'
    doc.excluded_embed_metadata_keys = ['source_id', 'source_date']
    doc.excluded_llm_metadata_keys = ['source_id', 'source_date']

print(documents[0].excluded_embed_metadata_keys)
print(documents[0].excluded_llm_metadata_keys)

pipeline = IngestionPipeline(transformations=[TokenTextSplitter()])
pipeline_md = IngestionPipeline(transformations=[MarkdownElementNodeParser()])
nodes = pipeline.run(documents=documents)
nodes_md = pipeline_md.run(documents=documents)

print('Token Text Splitter:',nodes[0].excluded_embed_metadata_keys)
print('Markdown Element Node Parser:',nodes_md[0].excluded_embed_metadata_keys)

Relevant Logs/Tracbacks

No response

@OmriNach OmriNach added bug Something isn't working triage Issue needs to be triaged/prioritized labels May 13, 2024
Copy link

dosubot bot commented May 13, 2024

The issue you've encountered with MarkdownElementNodeParser not passing exclude_llm_metadata_keys and exclude_embed_metadata_keys from the document to its child nodes is indeed due to its implementation. Unlike parsers that inherit from MetadataAwareTextSplitter, MarkdownElementNodeParser does not handle these specific metadata exclusion keys. This behavior is because it focuses on parsing markdown content into nodes without considering the exclusion of specific metadata keys.

To address this, you can modify the _build_node_from_split method within MarkdownElementNodeParser to filter out the unwanted metadata keys before they are added to the node. Here's how you could implement such a filter:

def _build_node_from_split(
    self,
    text_split: str,
    node: BaseNode,
    metadata: dict,
) -> TextNode:
    """Build node from single text split."""
    node = build_nodes_from_splits([text_split], node, id_func=self.id_func)[0]

    if self.include_metadata:
        # Filter out specific metadata keys
        filtered_metadata = {key: value for key, value in metadata.items() if key not in ['source_id', 'source_date']}
        node.metadata = {**node.metadata, **filtered_metadata}

    return node

In this example, filtered_metadata is created by excluding 'source_id' and 'source_date' from the metadata dictionary before it's assigned to node.metadata. This modification directly addresses the issue by ensuring that these keys are not passed to child nodes.

Please note, this solution requires modifying the LlamaIndex codebase, which might not be ideal for all situations and would need to be reapplied after any updates to the library.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@OmriNach
Copy link
Author

@logan-markewich can you implement a fix that does not require modifying source code? ideally all node parsers / transformations inherit these qualities from base document

@OmriNach
Copy link
Author

Comes down to these two methods from what I can tell:

build_nodes_from_splits in node utils:

elif isinstance(document, Document):
            node = TextNode(
                id_=id_func(i, document),
                text=text_chunk,
                embedding=document.embedding,
                excluded_embed_metadata_keys=document.excluded_embed_metadata_keys,
                excluded_llm_metadata_keys=document.excluded_llm_metadata_keys,
                metadata_seperator=document.metadata_seperator,
                metadata_template=document.metadata_template,
                text_template=document.text_template,
                relationships=relationships,
            )
            nodes.append(node)

versus:
get_node_from_elements in base_node.py:

        for node in nodes:
            if metadata_inherited:
                node.metadata.update(metadata_inherited)
        return [node for node in nodes if len(node.text) > 0]

pratiksinghchauhan added a commit to pratiksinghchauhan/llama_index that referenced this issue May 17, 2024
…embed_metadata_keys' from Nodes to MarkdownElementNodeProcessor and UnstructuredElementNodeParser
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
1 participant