implement deep search graph #260

VinciGit00 · 2024-05-17T07:56:50Z

Is your feature request related to a problem? Please describe.
Implement in a good way the deep search graph

Describe the solution you'd like

f-aguzzi · 2024-05-17T08:05:32Z

I'll help you with this as we discussed yesterday. Let's go 🤝

DiTo97 · 2024-05-18T09:24:23Z

@VinciGit00 main question is if any cascading deep search graph should have its own specific config, or even that is propagated $n$ times through the chain

mayurdb · 2024-05-18T09:28:47Z

@VinciGit00 can you please add some details on which are the things you are planning to do on top of what I have contributed?

Based on my experience while implementing and testing the deep search, these are the areas I feel need a lot of attention:

Finding out the most relevant links for the search given a large number of links on a webpage
Accurately deciding if a webpage has any answer related to the question

DiTo97 · 2024-05-18T09:35:16Z

@mayurdb coping with both your points might require having two cascading conditional nodes in each block

one deciding if the information collected by now is enough to answer the initial query; if the answer is negative, and the most relevant links are fetched, the other conditional node should weigh them and decide if there's any additional link that might aid in answering the query with undiscovered information or simply return with incomplete information.

f-aguzzi · 2024-05-18T14:10:57Z

I had a talk with @VinciGit00 a few days ago. Some of the nodes are likely to receive an overhaul to their inner workings, but without any breaking changes to their outer interface. He didn't like the idea of having to add signals right into the base node, at the core of the node system. That's why he rolled back one of @mayurdb's commits.

We then tried to figure out if there could be a way to reimplement deep scraping using the original nodes, without signals nor runtime loops. My initial idea was to use recursion, but the graph engine doesn't seem to handle it well - it would cause stack overflows even with stop conditions in place.

I then considered to readapt Mayur's original idea of iterating on a fetch/parse/rag/search_link sequence for each link through a GraphIterator - however, instead of using a runtime loop for the n layers, the graph could be pre-built with iteration in the constructor. That way, the resulting graph would still be a tree, without loops, and wouldn't mess with the current graph engine. Vinci suggested to add a conditional node for early returns.

The last thing we need to figure out before implementing this is what conditions to put in place for early return. Ideally, we could implement two modes for the graph:

a "cheap mode" that returns as soon as any relevant information is found
an "accurate mode" that keeps on crawling as long as there are relevant links, even if some relevant information has been found at the current level

This is not just deep scraping: if this configuration works - for how inelegant it may be in its current form - it would further prove that @PeriniM's modular system is expressive enough for complex scraping operations.

As for the actual implementation, Vinci asked me to have another talk before writing any code. That will take another couple of days to be possible. I'll be glad to have Mayur as a co-author and/or reviewer.

Closing note: the right-side diagram is wrong. It's not a chain of deep scrapers, but of graph iterators running deep scrapers. Between each layer of graph iterators, there are merge nodes both for links and answers, and a conditional node to provide an early return. I feel like there's something off with the left diagram too, but I'd need to look at the original drawings we made on paper.

DiTo97 · 2024-05-18T16:00:16Z

@f-aguzzi, exchanged some ideas with @VinciGit00 as well; we should collect all thoughts, and devise the most sensible solution without requiring a major refactoring effort from the current modular graph execution engine.

you could get the best of both cheap and accurate modes if you decouple responsibilities.

A very rough pseudo implementation might look like the following, but we have to simplify it, matching our nodes:

stateDiagram-v2
    note left of deep_search_graph
    - base_endpoint
    - max_rounds
    - max_subgraphs
    - user_query
    end note

    state deep_search_graph {
        [*] --> plan_node

        note right of plan_node
        - available_information
        - available_endpoints
        - visited_endpoints
        - rounds
        - user_query
        - early_exit := rounds EQ max_rounds
        - missing_information_query
        end note

        state rounds_expired <<choice>>
        state enough_information <<choice>>

        plan_node --> conditional_node_rounds
        conditional_node_rounds --> rounds_expired 
        rounds_expired --> conditional_node : if early_exit false
        rounds_expired --> merge_answers_node_final : if early_exit true
        conditional_node --> enough_information

        enough_information --> parallel_search_graph: not enough information
        enough_information --> merge_answers_node_final : enough information

        state parallel_search_graph {
            state fork_state <<fork>>
            state join_state <<join>>
            
            [*] --> rerank_link_node

            note right of rerank_link_node
            - missing_information_query
            - endpoints
            end note

            rerank_link_node --> graph_iterator_node

            note left of graph_iterator_node
            - k EQ max_subgraphs
            - top_k_endpoints
            end note

            graph_iterator_node --> fork_state

            fork_state --> search_graph_1
            fork_state --> search_graph_2
            fork_state --> search_graph_k

            state search_graph_1 {
                [*] --> fetch_node
                fetch_node --> parse_node
                parse_node --> RAG_node
                parse_node --> search_link_node
                RAG_node --> generate_answers_node
                search_link_node --> [*]
                generate_answers_node --> [*]
            }

            search_graph_1 --> join_state
            search_graph_2 --> join_state
            search_graph_k --> join_state
            
            join_state --> merge_answers_node

            note right of merge_answers_node
            - information
            - endpoints

            FIXME: how'd we handle concat of link candidates?
            end note

            merge_answers_node --> [*]
        }

        parallel_search_graph --> plan_node

        merge_answers_node_final --> [*]
    }

VinciGit00 assigned f-aguzzi and VinciGit00 May 17, 2024

PeriniM mentioned this issue May 22, 2024

A version SmartScraperGraph that scans the whole website #280

Closed

PeriniM added enhancement New feature or request feature request labels May 22, 2024

Repository owner deleted a comment from mayurdb May 22, 2024

DiTo97 mentioned this issue May 30, 2024

Kaggle dataset webis-de/archive-query-log#49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement deep search graph #260

implement deep search graph #260

VinciGit00 commented May 17, 2024 •

edited

f-aguzzi commented May 17, 2024

DiTo97 commented May 18, 2024

mayurdb commented May 18, 2024

DiTo97 commented May 18, 2024

f-aguzzi commented May 18, 2024

DiTo97 commented May 18, 2024 •

edited

implement deep search graph #260

implement deep search graph #260

Comments

VinciGit00 commented May 17, 2024 • edited

f-aguzzi commented May 17, 2024

DiTo97 commented May 18, 2024

mayurdb commented May 18, 2024

DiTo97 commented May 18, 2024

f-aguzzi commented May 18, 2024

DiTo97 commented May 18, 2024 • edited

VinciGit00 commented May 17, 2024 •

edited

DiTo97 commented May 18, 2024 •

edited