[Bug]: MarkdownElementNodeProcessor does not pass 'exclude_llm_metadata_keys' and 'exclude_embed_metadata_keys' from document to nodes #13468

OmriNach · 2024-05-13T23:04:32Z

Bug Description

When setting 'exclude_llm_metadata_keys' and 'exclude_embed_metadata_keys' of documents, they usually get passed to child nodes.

However, the MarkdownElementNodeParser does not exhibit this behaviour and inherits from BaseElementNodeBarser and does not pass these two parameters to child nodes

Version

0.10.30

Steps to Reproduce

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import TokenTextSplitter, MarkdownElementNodeParser
from llama_index.readers.file import PyMuPDFReader,MarkdownReader

markdown_reader = MarkdownReader()
pdf_reader = PyMuPDFReader()
path = 'example_markdown_file.md'
documents = markdown_reader.load_data(path)
for doc in documents:
    doc.metadata['source_id'] = '661add59c4e6f825668d0b93'
    doc.metadata['source_name'] = 'guidelines'
    doc.metadata['source_type'] = 'pdf'
    doc.metadata['source_url'] = 'https://www.guidelines.com'
    doc.metadata['source_title'] = 'Pharyngitis'
    doc.metadata['source_author'] = 'John Doe'
    doc.metadata['source_date'] = '2022-01-01'
    doc.excluded_embed_metadata_keys = ['source_id', 'source_date']
    doc.excluded_llm_metadata_keys = ['source_id', 'source_date']

print(documents[0].excluded_embed_metadata_keys)
print(documents[0].excluded_llm_metadata_keys)

pipeline = IngestionPipeline(transformations=[TokenTextSplitter()])
pipeline_md = IngestionPipeline(transformations=[MarkdownElementNodeParser()])
nodes = pipeline.run(documents=documents)
nodes_md = pipeline_md.run(documents=documents)

print('Token Text Splitter:',nodes[0].excluded_embed_metadata_keys)
print('Markdown Element Node Parser:',nodes_md[0].excluded_embed_metadata_keys)

Relevant Logs/Tracbacks

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2024-05-13T23:05:10Z

The issue you've encountered with MarkdownElementNodeParser not passing exclude_llm_metadata_keys and exclude_embed_metadata_keys from the document to its child nodes is indeed due to its implementation. Unlike parsers that inherit from MetadataAwareTextSplitter, MarkdownElementNodeParser does not handle these specific metadata exclusion keys. This behavior is because it focuses on parsing markdown content into nodes without considering the exclusion of specific metadata keys.

To address this, you can modify the _build_node_from_split method within MarkdownElementNodeParser to filter out the unwanted metadata keys before they are added to the node. Here's how you could implement such a filter:

def _build_node_from_split(
    self,
    text_split: str,
    node: BaseNode,
    metadata: dict,
) -> TextNode:
    """Build node from single text split."""
    node = build_nodes_from_splits([text_split], node, id_func=self.id_func)[0]

    if self.include_metadata:
        # Filter out specific metadata keys
        filtered_metadata = {key: value for key, value in metadata.items() if key not in ['source_id', 'source_date']}
        node.metadata = {**node.metadata, **filtered_metadata}

    return node

In this example, filtered_metadata is created by excluding 'source_id' and 'source_date' from the metadata dictionary before it's assigned to node.metadata. This modification directly addresses the issue by ensuring that these keys are not passed to child nodes.

Please note, this solution requires modifying the LlamaIndex codebase, which might not be ideal for all situations and would need to be reapplied after any updates to the library.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

OmriNach · 2024-05-13T23:25:16Z

@logan-markewich can you implement a fix that does not require modifying source code? ideally all node parsers / transformations inherit these qualities from base document

OmriNach · 2024-05-13T23:34:44Z

Comes down to these two methods from what I can tell:

build_nodes_from_splits in node utils:

elif isinstance(document, Document):
            node = TextNode(
                id_=id_func(i, document),
                text=text_chunk,
                embedding=document.embedding,
                excluded_embed_metadata_keys=document.excluded_embed_metadata_keys,
                excluded_llm_metadata_keys=document.excluded_llm_metadata_keys,
                metadata_seperator=document.metadata_seperator,
                metadata_template=document.metadata_template,
                text_template=document.text_template,
                relationships=relationships,
            )
            nodes.append(node)

versus:
get_node_from_elements in base_node.py:

        for node in nodes:
            if metadata_inherited:
                node.metadata.update(metadata_inherited)
        return [node for node in nodes if len(node.text) > 0]

…embed_metadata_keys' from Nodes to MarkdownElementNodeProcessor and UnstructuredElementNodeParser

OmriNach added bug Something isn't working triage Issue needs to be triaged/prioritized labels May 13, 2024

pratiksinghchauhan added a commit to pratiksinghchauhan/llama_index that referenced this issue May 17, 2024

fixes run-llama#13468; pass 'exclude_llm_metadata_keys' and 'exclude_…

443c6cd

…embed_metadata_keys' from Nodes to MarkdownElementNodeProcessor and UnstructuredElementNodeParser

pratiksinghchauhan mentioned this issue May 17, 2024

Pass 'exclude_llm_metadata_keys' and 'exclude_embed_metadata_keys' in Node Parsers #13567

Merged

6 tasks

logan-markewich closed this as completed in #13567 May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: MarkdownElementNodeProcessor does not pass 'exclude_llm_metadata_keys' and 'exclude_embed_metadata_keys' from document to nodes #13468

[Bug]: MarkdownElementNodeProcessor does not pass 'exclude_llm_metadata_keys' and 'exclude_embed_metadata_keys' from document to nodes #13468

OmriNach commented May 13, 2024 •

edited

dosubot bot commented May 13, 2024 •

edited

Details

OmriNach commented May 13, 2024

OmriNach commented May 13, 2024

[Bug]: MarkdownElementNodeProcessor does not pass 'exclude_llm_metadata_keys' and 'exclude_embed_metadata_keys' from document to nodes #13468

[Bug]: MarkdownElementNodeProcessor does not pass 'exclude_llm_metadata_keys' and 'exclude_embed_metadata_keys' from document to nodes #13468

Comments

OmriNach commented May 13, 2024 • edited

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented May 13, 2024 • edited

Details

OmriNach commented May 13, 2024

OmriNach commented May 13, 2024

OmriNach commented May 13, 2024 •

edited

dosubot bot commented May 13, 2024 •

edited