We maintain a curated collection of papers exploring the path towards Deep Research (DR) Agents, focusing on formulating core concepts and mapping the research landscape.
⌛️ Coming soon – Version 2! We’re continuously compiling and updating cutting‑edge insights. Feel free to suggest any related work you find valuable!
Build a digital assistant on your screen. Generated by DALL-E-3.
WELCOME CONTRIBUTE!
🔥 This project is actively maintained, and we welcome your contributions. If you have any suggestions, such as missing papers or information, please feel free to open an issue or submit a pull request.
✨✨✨ Deep Research Agents: A Systematic Examination And Roadmap
Table of Contents- Search Engine Integration
- Tool Use
- Architecture & Workflow
- Tuning Methods
- Industrial Applications
- Benchmarks for DR Agents
📊 Search Engine · API vs Browser Comparison
Legend | ✔️ Primary focus | 🟫 Secondary/minor focus | — Not present |
---|
DR Agent | API | Browser | GAIA | HLE | QA | Base Model |
---|---|---|---|---|---|---|
Avatar | 🟫 | — | — | — | Stark | Claude‑3‑Opus, GPT‑4 |
CoSearch‑Agent | ✔️ | — | — | — | — | GPT‑3.5‑turbo |
MMAC-Copilot | ✔️ | — | ✔️ | — | — | GPT‑3.5, GPT‑4 |
Storm | 🟫 | — | — | — | FreshWiki | GPT‑3.5‑turbo |
OpenResearcher | ✔️ | — | — | — | Private QA | DeepSeek‑V2‑Chat |
The AI Scientist | ✔️ | — | — | — | MLE‑Bench | GPT‑4o, o1‑mini, o1‑preview |
Gemini DR | ✔️ | ✔️ | — | ✔️ | GPQA | Gemini‑2.0‑Flash |
Agent Laboratory | ✔️ | — | — | — | MLE‑Bench | GPT‑4o, o1‑preview |
Search‑o1 | ✔️ | — | — | — | GPQA·NQ·TriviaQA | QwQ‑32B‑preview |
Agentic Reasoning | ✔️ | — | — | — | GPQA | DeepSeek‑R1, Qwen2.5 |
AutoAgent | — | ✔️ | ✔️ | — | — | Claude‑Sonnet‑3.5 |
Grok DeepSearch | ✔️ | ✔️ | — | — | GPQA | Grok 3 |
OpenAI DR | — | ✔️ | ✔️ | ✔️ | ✔️ | GPT‑o3 |
Perplexity DR | ✔️ | 🟫 | — | ✔️ | SimoleQA | Flexible |
AgentRxiv | ✔️ | — | — | — | GPQA·MedQA | GPT‑4o‑mini |
Agent‑R1 | ✔️ | — | — | — | HotpotQA | Qwen2.5‑1.5B‑Inst |
AutoGLM Rumination | — | ✔️ | — | — | GPQA | GLM‑Z1‑Air |
Copilot Researcher | — | ✔️ | — | — | — | o3‑mini |
H2O.ai DR | ✔️ | ✔️ | ✔️ | — | — | h2ogpt‑oasst1‑512‑12b |
Manus | ✔️ | ✔️ | — | — | — | Claude3.5, GPT‑4o |
OpenManus | ✔️ | ✔️ | — | — | — | Claude3.5, GPT‑4o |
OWL | ✔️ | ✔️ | ✔️ | — | — | DeepSeek‑R1, Gemini‑2.5‑Pro, GPT‑4o |
R1‑Searcher | 🟫 | — | — | — | 2WikiMultiHopQA, HotpotQA | Llama3.1‑8B‑Inst, Qwen2.5‑7B |
ReSearch | 🟫 | — | — | — | 2WikiMultiHopQA, HotpotQA | Qwen2.5‑7B, Qwen2.5‑7B‑Inst |
Search‑R1 | 🟫 | — | — | — | 2WikiMultiHopQA, HotpotQA, NQ, TriviaQA | Llama3.2‑3B, Qwen2.5‑3B/7B |
DeepResearcher | — | ✔️ | — | — | HotpotQA, NQ, TriviaQA | Qwen2.5‑7B‑Inst |
Genspark Super Agent | ✔️ | ✔️ | ✔️ | — | — | Mixture of 9 LLMs |
WebThinker | ✔️ | — | ✔️ | ✔️ | GPQA, WebWalkerQA | QwQ‑32B |
SWIRL | — | ✔️ | — | — | HotQA, BeerQA | Gemma 2‑27B |
SimpleDeepSearcher | — | ✔️ | ✔️ | — | 2WikiMultiHopQA | Qwen2.5‑7B/32B‑In, DeepSeek‑D‑Qwen2.5‑32B, QwQ‑32B |
Suna AI | ✔️ | ✔️ | — | — | — | GPT‑4o, Claude |
AgenticSeek | — | ✔️ | — | — | — | GPT‑4o, DeepSeek‑R1, Claude |
Alita | ✔️ | ✔️ | ✔️ | — | PathVQA | GPT‑4o, Claude‑Sonnet‑4 |
DeerFlow | ✔️ | — | — | — | — | Doubao‑1.5‑Pro‑32k, DeepSeek‑R1, GPT‑4o, Qwen |
PANGU DEEPDIVER | ✔️ | — | — | — | C-SimpleQA, HotpotQA, ProxyQA | Pangu‑7B‑Reasoner |
WebDancer | ✔️ | ✔️ | ✔️ | — | GAIA, WebWalkerQA | WebDancer‑QwQ-32B |
📊 Tool Use Capabilities Comparison
Legend | ✔️ Involved | 🟫 Non Disclosure | — Not present |
---|
DR Agent | Code Interp. | Data Analytics | Multimodal | MCP |
---|---|---|---|---|
CoSearchAgent | — | ✔️ | — | — |
Storm | ✔️ | — | — | — |
The AI Scientist | ✔️ | — | — | — |
Agent Laboratory | ✔️ | — | — | — |
Agentic Reasoning | ✔️ | — | — | — |
AutoAgent | ✔️ | — | ✔️ | — |
Genspark DR | ✔️ | ✔️ | ✔️ | ✔️ |
Grok DeepSearch | ✔️ | ✔️ | ✔️ | 🟫 |
OpenAI DR | ✔️ | ✔️ | ✔️ | ✔️ |
Perplexity DR | ✔️ | ✔️ | ✔️ | 🟫 |
Agent‑R1 | ✔️ | — | — | — |
AutoGLM Rumination | ✔️ | — | ✔️ | 🟫 |
Copilot Researcher | ✔️ | ✔️ | 🟫 | 🟫 |
Manus | ✔️ | ✔️ | ✔️ | 🟫 |
OpenManus | ✔️ | ✔️ | — | ✔️ |
OWL | ✔️ | ✔️ | ✔️ | ✔️ |
H2O.ai DR | ✔️ | ✔️ | ✔️ | 🟫 |
Genspark Super Agent | ✔️ | ✔️ | ✔️ | ✔️ |
WebThinker | ✔️ | — | — | — |
Suna AI | ✔️ | ✔️ | — | — |
AgenticSeek | ✔️ | ✔️ | — | — |
Alita | ✔️ | 🟫 | 🟫 | — |
DeerFlow | ✔️ | ✔️ | — | — |
- AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning
- Agent Laboratory: Using LLM Agents as Research Assistants
- CoSearchAgent: A Lightweight Collaborative Search Agent with Large Language Models
- AgentRxiv: Towards Collaborative Autonomous Research
- The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
- OpenResearcher: Unleashing AI for Accelerated Scientific Research
- Search‑o1: Agentic Search‑Enhanced Large Reasoning Models
- Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research
- Agent‑R1: Training Powerful LLM Agents with End‑to‑End Reinforcement Learning
- R1‑Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
- Search‑R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
- DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real‑world Environments
- WebThinker: Empowering Large Reasoning Models with Deep Research Capability
- Open AI Deep Research
- Google Gemini Deep Research
- Grok Deep Search
- Perplexity Deep Research
- ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
- Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
- H2O.ai Deep Research
- SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis
- WebDancer: Towards Autonomous Information Seeking Agency
- Synthetic Data Generation & Multi‑Step RL for Reasoning & Tool Use
- SimpleDeepSearcher: Deep Information Seeking via Web‑Powered Reasoning Trajectory Synthesis
- Suna AI
- Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self‑Evolution
- AgenticSeek
- AutoGLM Rumination
- Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
- AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents
- OpenManus
- Manus
- OWL
- PANGU DEEPDIVER: ADAPTIVE SEARCH INTENSITY SCALING VIA OPEN-WEB REINFORCEMENT LEARNING
- H2O.ai Deep Research
- Copilot Researche
- Genspark Super Agent
- DeerFlow
- MMAC-Copilot
📊 Tuning Methods Comparison
Legend | ✔️ Implemented | 🟫 Details Unknown | — Not present |
---|
DR Agent | SFT | RL | Base Model | Data | Reward Design |
---|---|---|---|---|---|
Gemini DR | 🟫 | 🟫 | Gemini‑2.0‑Flash | — | 🟫 |
Grok DeepSearch | — | 🟫 | Grok 3 | — | 🟫 |
OpenAI DR | — | 🟫 | GPT‑o3 | — | 🟫 |
Agent‑R1 | — | PPO · Reinforce++ · GRPO | Qwen2.5‑1.5B‑In | HotpotQA | Rule‑Outcome |
AutoGLM Rumination | 🟫 | 🟫 | GLM‑Z1‑Air | — | 🟫 |
H2O.ai DR | ✔️ | 🟫 | h2ogpt‑oasst1‑512‑12b | — | 🟫 |
Copilot Researcher | 🟫 | 🟫 | o3‑mini | — | — |
ReSearch | — | GRPO | Qwen2.5‑7B‑In / 32B‑In | 2WikiMultiHopQA | Rule‑Outcome |
R1‑Searcher | ✔️ | Reinforce++ · GRPO | Qwen2.5‑7B‑In / Llama3.1‑8B‑In | 2WikiMultiHopQA · HotpotQA | Rule‑Outcome |
Search‑R1 | ✔️ | PPO · GRPO | Qwen2.5‑3B/7B / Llama3.2‑3B‑In | NQ · HotpotQA | Rule‑Outcome |
DeepResearcher | — | GRPO | Qwen2.5‑7B‑In | NQ · HotpotQA | Rule‑Outcome |
Genspark Super Agent | — | — | Mixture of Agents | — | — |
WebThinker | ✔️ | Iterative Online DPO | QwQ‑32B | Expert Dataset | Rule‑Outcome |
SWIRL | — | Offline‑RL | Gemma 2‑27B | HotpotQA | — |
SimpleDeepSearcher | ✔️ | DPO · REINFORCE++ | Qwen2.5‑7B/32B‑In · DeepSeek‑Distilled‑Qwen‑32B · QwQ‑32B | NQ · HotpotQA · 2WikiMultiHopQA · Musique · SimpleQA · MultiHop‑RAG | Process‑based reward |
PANGU DEEPDIVER | ✔️ | GRPO | Pangu‑7B‑Reasoner | WebPuzzle | Rule‑Outcome |
WebDancer | ✔️ | GRPO | QwQ-32B | CrawlQA· E2HQA | Rule‑Outcome |
📊 QA Benchmarks (Hotpot / 2Wiki / NQ / TQ / GPQA)
DR Agent | Base Model | Hotpot | 2Wiki | NQ | TQ | GPQA |
---|---|---|---|---|---|---|
Search‑o1 | QwQ‑32B‑preview | 57.3 | 71.4 | 49.7 | 74.1 | 57.9 |
Agentic Reasoning | DeepSeek‑R1, Qwen2.5 | — | — | — | — | 67.0 |
Grok DeepSearch | Grok 3 | — | — | — | — | 84.6 |
AgentRxiv | GPT‑4o‑mini | — | — | — | — | 41.0 |
R1‑Searcher | Qwen2.5‑7B‑Base | 71.9 | 63.8 | — | — | — |
ReSearch | Qwen2.5‑7B‑Base | 30.0 | 29.7 | — | — | — |
ReSearch | Qwen2.5‑7B‑In | 63.6 | 54.2 | — | — | — |
ReSearch | Qwen2.5‑32B‑Base | 64.3 | 45.6 | — | — | — |
ReSearch | Qwen2.5‑32B‑In | 67.7 | 50.0 | — | — | — |
Search‑R1 | Llama3.2‑3B‑Base | 30.0 | 29.7 | 43.1 | 61.2 | — |
Search‑R1 | Llama3.2‑3B‑In | 31.4 | 23.3 | 35.7 | 57.8 | — |
Search‑R1 | Qwen2.5‑7B‑Base | 28.3 | 27.3 | 39.6 | 58.2 | — |
Search‑R1 | Qwen2.5‑7B‑In | 34.5 | 36.9 | 40.9 | 55.2 | — |
DeepResearcher | Qwen2.5‑7B‑In | 64.3 | 66.6 | 61.9 | 85.0 | — |
Genspark Super Agent | Mixture of Agents | — | — | — | — | — |
WebThinker | QwQ‑32B | — | — | — | — | 68.7 |
SimpleDeepSearch | Qwen2.5‑7B‑In | — | 68.1 | — | — | — |
SimpleDeepSearch | Qwen2.5‑32B‑In | 70.5 | — | — | — | — |
SimpleDeepSearch | DeepSeek‑R1‑Distill‑Qwen‑32B | 68.1 | — | — | — | — |
SimpleDeepSearch | QwQ‑32B | 73.5 | — | — | — | — |
SWIRL | Gemma 2‑27B | 72.0 | — | — | — | — |
📊 GAIA (Test and Val) & HLE Benchmarks
DR Agent | Base Model | GAIA L-1 | L-2 | L-3 | Ave. | HLE | Split |
---|---|---|---|---|---|---|---|
MMAC-Copilot | GPT‑3.5, GPT‑4 | 45.16 | 20.75 | 6.12 | 25.91 | — | Test |
H2O.ai DR | Claude‑3.7‑Sonnet | 89.25 | 79.87 | 61.22 | 79.73 | — | Test |
Alita | Claude‑Sonnet‑4, GPT‑4o | 92.47 | 71.70 | 55.10 | 75.42 | — | Test |
AutoAgent | Claude‑Sonnet‑3.5 | 71.70 | 53.50 | 26.90 | 55.20 | — | Dev |
OpenAI DR | GPT‑o3‑custom | 78.70 | 73.20 | 58.00 | 67.40 | 26.6 | Dev |
Perplexity DR | Flexible | — | — | — | — | 21.1 | Dev |
Manus | Claude 3.5, GPT‑4o | 86.50 | 70.10 | 57.70 | 71.4 | — | Dev |
OWL | Claude‑3.7‑Sonnet | 84.90 | 68.60 | 42.30 | 69.70 | — | Dev |
H2O.ai DR | h2ogpt‑oasst1‑512‑12b | 67.92 | 67.44 | 42.31 | 63.64 | — | Dev |
Genspark Super Agent | Claude 3 Opus | 87.8 | 72.7 | 58.8 | 73.1 | — | Dev |
WebThinker | QwQ‑32B | 53.8 | 44.2 | 16.7 | 44.7 | 13.0 | Dev |
WebDancer | QwQ‑32B | 61.5 | 50.0 | 25.0 | 51.5 | - | Dev |
SimpleDeepSearch | QwQ‑32B | 50.5 | 45.8 | 13.8 | 43.9 | — | Dev |
Alita | Claude‑Sonnet‑4, GPT‑4o | 75.15 | — | 87.27 | — | — | Dev |
If you find this work helpful, please cite our paper:
@misc{huang2025deepresearchagentssystematic,
title={Deep Research Agents: A Systematic Examination And Roadmap},
author={Yuxuan Huang and Yihang Chen and Haozheng Zhang and Kang Li and Meng Fang and Linyi Yang and Xiaoguang Li and Lifeng Shang and Songcen Xu and Jianye Hao and Kun Shao and Jun Wang},
year={2025},
eprint={2506.18096},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.18096},
}