Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement deep search graph #260

Open
VinciGit00 opened this issue May 17, 2024 · 6 comments
Open

implement deep search graph #260

VinciGit00 opened this issue May 17, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request feature request

Comments

@VinciGit00
Copy link
Owner

VinciGit00 commented May 17, 2024

Is your feature request related to a problem? Please describe.
Implement in a good way the deep search graph

Describe the solution you'd like
Screenshot 2024-05-17 alle 10 05 35

@f-aguzzi
Copy link
Contributor

I'll help you with this as we discussed yesterday. Let's go 🤝

@DiTo97
Copy link
Contributor

DiTo97 commented May 18, 2024

@VinciGit00 main question is if any cascading deep search graph should have its own specific config, or even that is propagated $n$ times through the chain

@mayurdb
Copy link
Contributor

mayurdb commented May 18, 2024

@VinciGit00 can you please add some details on which are the things you are planning to do on top of what I have contributed?

Based on my experience while implementing and testing the deep search, these are the areas I feel need a lot of attention:

  1. Finding out the most relevant links for the search given a large number of links on a webpage
  2. Accurately deciding if a webpage has any answer related to the question

@DiTo97
Copy link
Contributor

DiTo97 commented May 18, 2024

@mayurdb coping with both your points might require having two cascading conditional nodes in each block

one deciding if the information collected by now is enough to answer the initial query; if the answer is negative, and the most relevant links are fetched, the other conditional node should weigh them and decide if there's any additional link that might aid in answering the query with undiscovered information or simply return with incomplete information.

@f-aguzzi
Copy link
Contributor

I had a talk with @VinciGit00 a few days ago. Some of the nodes are likely to receive an overhaul to their inner workings, but without any breaking changes to their outer interface. He didn't like the idea of having to add signals right into the base node, at the core of the node system. That's why he rolled back one of @mayurdb's commits.

We then tried to figure out if there could be a way to reimplement deep scraping using the original nodes, without signals nor runtime loops. My initial idea was to use recursion, but the graph engine doesn't seem to handle it well - it would cause stack overflows even with stop conditions in place.

I then considered to readapt Mayur's original idea of iterating on a fetch/parse/rag/search_link sequence for each link through a GraphIterator - however, instead of using a runtime loop for the n layers, the graph could be pre-built with iteration in the constructor. That way, the resulting graph would still be a tree, without loops, and wouldn't mess with the current graph engine. Vinci suggested to add a conditional node for early returns.

The last thing we need to figure out before implementing this is what conditions to put in place for early return. Ideally, we could implement two modes for the graph:

  • a "cheap mode" that returns as soon as any relevant information is found
  • an "accurate mode" that keeps on crawling as long as there are relevant links, even if some relevant information has been found at the current level

This is not just deep scraping: if this configuration works - for how inelegant it may be in its current form - it would further prove that @PeriniM's modular system is expressive enough for complex scraping operations.

As for the actual implementation, Vinci asked me to have another talk before writing any code. That will take another couple of days to be possible. I'll be glad to have Mayur as a co-author and/or reviewer.

Closing note: the right-side diagram is wrong. It's not a chain of deep scrapers, but of graph iterators running deep scrapers. Between each layer of graph iterators, there are merge nodes both for links and answers, and a conditional node to provide an early return. I feel like there's something off with the left diagram too, but I'd need to look at the original drawings we made on paper.

@DiTo97
Copy link
Contributor

DiTo97 commented May 18, 2024

@f-aguzzi, exchanged some ideas with @VinciGit00 as well; we should collect all thoughts, and devise the most sensible solution without requiring a major refactoring effort from the current modular graph execution engine.

you could get the best of both cheap and accurate modes if you decouple responsibilities.

A very rough pseudo implementation might look like the following, but we have to simplify it, matching our nodes:

stateDiagram-v2
    note left of deep_search_graph
    - base_endpoint
    - max_rounds
    - max_subgraphs
    - user_query
    end note

    state deep_search_graph {
        [*] --> plan_node

        note right of plan_node
        - available_information
        - available_endpoints
        - visited_endpoints
        - rounds
        - user_query
        - early_exit := rounds EQ max_rounds
        - missing_information_query
        end note

        state rounds_expired <<choice>>
        state enough_information <<choice>>

        plan_node --> conditional_node_rounds
        conditional_node_rounds --> rounds_expired 
        rounds_expired --> conditional_node : if early_exit false
        rounds_expired --> merge_answers_node_final : if early_exit true
        conditional_node --> enough_information

        enough_information --> parallel_search_graph: not enough information
        enough_information --> merge_answers_node_final : enough information

        state parallel_search_graph {
            state fork_state <<fork>>
            state join_state <<join>>
            
            [*] --> rerank_link_node

            note right of rerank_link_node
            - missing_information_query
            - endpoints
            end note

            rerank_link_node --> graph_iterator_node

            note left of graph_iterator_node
            - k EQ max_subgraphs
            - top_k_endpoints
            end note

            graph_iterator_node --> fork_state

            fork_state --> search_graph_1
            fork_state --> search_graph_2
            fork_state --> search_graph_k

            state search_graph_1 {
                [*] --> fetch_node
                fetch_node --> parse_node
                parse_node --> RAG_node
                parse_node --> search_link_node
                RAG_node --> generate_answers_node
                search_link_node --> [*]
                generate_answers_node --> [*]
            }

            search_graph_1 --> join_state
            search_graph_2 --> join_state
            search_graph_k --> join_state
            
            join_state --> merge_answers_node

            note right of merge_answers_node
            - information
            - endpoints

            FIXME: how'd we handle concat of link candidates?
            end note

            merge_answers_node --> [*]
        }

        parallel_search_graph --> plan_node

        merge_answers_node_final --> [*]
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

5 participants