feat(websearch): 新增Exa搜索提供商,支持 Tavily/Exa API Base URL 可配置,补充搜索工具相关文档#7359
feat(websearch): 新增Exa搜索提供商,支持 Tavily/Exa API Base URL 可配置,补充搜索工具相关文档#7359piexian wants to merge 4 commits intoAstrBotDevs:masterfrom
Conversation
- 新增 Exa 搜索提供商,包含三个工具: - web_search_exa:语义搜索,支持 5 种搜索类型和 6 个垂直领域 - exa_extract_web_page:通过 /contents 端点提取网页全文 - exa_find_similar:通过 /findSimilar 端点查找语义相似网页 - Tavily 和 Exa 的 API Base URL 可在 WebUI 中配置,方便代理/自建实例 - 所有联网搜索工具统一添加可配置 timeout 参数(最小 30s) - MessageList.vue 引用解析支持 Exa/BoCha/findSimilar - 更新配置元数据、i18n、路由及 hooks - 更新中英文用户文档,补充 Tavily/BoCha/百度AI搜索的工具参数说明
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The minimum-timeout enforcement logic (
if timeout < 30: timeout = 30) is duplicated across many tools (fetch_url, Tavily/BoCha/Exa helpers, etc.); consider extracting a small utility (or a module-levelMIN_TIMEOUTconstant plus helper) to centralize this behavior and avoid inconsistencies (e.g.,_web_search_exacurrently lacks the clamp). - The Exa API key missing error message is in Chinese in
_get_exa_keywhile other user-facing errors in this module are English; aligning these messages to a consistent language will make debugging and UX more coherent. - The lists of supported web-search tools for reference extraction are now duplicated in multiple places (e.g.,
astr_agent_hooks._extract_web_search_refs, dashboard routes, andMessageList.vue), which makes it easy to miss a spot when adding new providers; consider centralizing this mapping or deriving it from a shared config to keep UI and backend behavior in sync.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The minimum-timeout enforcement logic (`if timeout < 30: timeout = 30`) is duplicated across many tools (`fetch_url`, Tavily/BoCha/Exa helpers, etc.); consider extracting a small utility (or a module-level `MIN_TIMEOUT` constant plus helper) to centralize this behavior and avoid inconsistencies (e.g., `_web_search_exa` currently lacks the clamp).
- The Exa API key missing error message is in Chinese in `_get_exa_key` while other user-facing errors in this module are English; aligning these messages to a consistent language will make debugging and UX more coherent.
- The lists of supported web-search tools for reference extraction are now duplicated in multiple places (e.g., `astr_agent_hooks._extract_web_search_refs`, dashboard routes, and `MessageList.vue`), which makes it easy to miss a spot when adding new providers; consider centralizing this mapping or deriving it from a shared config to keep UI and backend behavior in sync.
## Individual Comments
### Comment 1
<location path="astrbot/builtin_stars/web_searcher/main.py" line_range="159-168" />
<code_context>
self,
cfg: AstrBotConfig,
payload: dict,
+ timeout: int = 30,
) -> list[SearchResult]:
"""使用 Tavily 搜索引擎进行搜索"""
</code_context>
<issue_to_address>
**suggestion:** Normalize the timeout value inside `_web_search_exa` for consistency and safety.
Other helpers (`_get_from_url`, `_web_search_tavily`, `_extract_tavily`, `_web_search_bocha`, `_extract_exa`, `_find_similar_exa`) all enforce a minimum 30s timeout internally, while `_web_search_exa` relies on its caller (`search_from_exa`) to clamp the value. If `_web_search_exa` is reused elsewhere, it may see much smaller timeouts and behave inconsistently. Please add the same `if timeout < 30: timeout = 30` guard at the top of `_web_search_exa` to align behavior and avoid unexpectedly short timeouts.
Suggested implementation:
```python
) -> list[SearchResult]:
"""使用 Exa 搜索引擎进行搜索"""
if timeout < 30:
timeout = 30
```
If the `_web_search_exa` signature or docstring differ slightly (e.g., different Chinese text or no docstring), adjust the SEARCH pattern to match the actual function header and insert:
if timeout < 30:
timeout = 30
as the first statement in the function body, immediately after any docstring, to keep behavior consistent with `_get_from_url`, `_web_search_tavily`, `_extract_tavily`, `_web_search_bocha`, `_extract_exa`, and `_find_similar_exa`.
</issue_to_address>
### Comment 2
<location path="astrbot/builtin_stars/web_searcher/main.py" line_range="68" />
<code_context>
"""清理文本,去除空格、换行符等"""
return text.strip().replace("\n", " ").replace("\r", " ").replace(" ", " ")
- async def _get_from_url(self, url: str) -> str:
+ async def _get_from_url(self, url: str, timeout: int = 30) -> str:
"""获取网页内容"""
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared helpers for timeout handling, base-URL construction, and Exa HTTP requests to remove duplication and make the web search code easier to maintain.
You can keep all the new functionality while cutting a lot of duplication with a few small helpers. The main hot spots are timeout handling, base URL construction, and Exa HTTP calls.
### 1. Centralize timeout normalization
The `if timeout < 30: timeout = 30` pattern is repeated many times.
Add a helper:
```python
def _normalize_timeout(self, timeout: int | None, minimum: int = 30) -> aiohttp.ClientTimeout:
if timeout is None:
timeout = minimum
elif timeout < minimum:
timeout = minimum
return aiohttp.ClientTimeout(total=timeout)
```
Then use it at call sites instead of repeating the logic:
```python
async def _web_search_tavily(self, cfg: AstrBotConfig, payload: dict, timeout: int = 30) -> list[SearchResult]:
tavily_key = await self._get_tavily_key(cfg)
base_url = self._tavily_base_url(cfg)
url = f"{base_url}/search"
header = {
"Authorization": f"Bearer {tavily_key}",
"Content-Type": "application/json",
}
timeout_obj = self._normalize_timeout(timeout)
async with aiohttp.ClientSession(trust_env=True) as session:
async with session.post(url, json=payload, headers=header, timeout=timeout_obj) as response:
...
```
And for tools you can drop the inline clamp:
```python
@llm_tool(name="fetch_url")
async def fetch_website_content(self, event: AstrMessageEvent, url: str, timeout: int = 30) -> str:
timeout_obj = self._normalize_timeout(timeout)
resp = await self._get_from_url(url, timeout_obj.total)
return resp
```
(or just pass `timeout_obj` through if you adjust `_get_from_url`).
### 2. Extract base URL helpers for providers
The Tavily and Exa base URL logic is repeated.
Add:
```python
def _tavily_base_url(self, cfg: AstrBotConfig) -> str:
return (
cfg.get("provider_settings", {})
.get("websearch_tavily_base_url", "https://api.tavily.com")
.rstrip("/")
)
def _exa_base_url(self, cfg: AstrBotConfig) -> str:
return (
cfg.get("provider_settings", {})
.get("websearch_exa_base_url", "https://api.exa.ai")
.rstrip("/")
)
```
Then simplify call sites:
```python
base_url = self._tavily_base_url(cfg)
url = f"{base_url}/search"
```
```python
base_url = self._exa_base_url(cfg)
url = f"{base_url}/contents"
```
This removes duplication and keeps provider-specific config in one place.
### 3. Consolidate Exa HTTP request logic
`_web_search_exa`, `_extract_exa`, and `_find_similar_exa` all repeat the same HTTP boilerplate. You can pull that out into one internal helper that deals with key retrieval, base URL, headers, timeout, and error handling:
```python
async def _exa_request(
self,
cfg: AstrBotConfig,
path: str,
payload: dict,
timeout: int = 30,
) -> dict:
exa_key = await self._get_exa_key(cfg)
base_url = self._exa_base_url(cfg)
url = f"{base_url}/{path.lstrip('/')}"
header = {
"x-api-key": exa_key,
"Content-Type": "application/json",
}
timeout_obj = self._normalize_timeout(timeout)
async with aiohttp.ClientSession(trust_env=True) as session:
async with session.post(url, json=payload, headers=header, timeout=timeout_obj) as response:
if response.status != 200:
reason = await response.text()
raise Exception(
f"Exa request to {path} failed: {reason}, status: {response.status}",
)
return await response.json()
```
Then each high-level method only shapes payload and maps results:
```python
async def _web_search_exa(
self,
cfg: AstrBotConfig,
payload: dict,
timeout: int = 30,
) -> list[SearchResult]:
data = await self._exa_request(cfg, "search", payload, timeout=timeout)
results: list[SearchResult] = []
for item in data.get("results", []):
results.append(
SearchResult(
title=item.get("title", ""),
url=item.get("url", ""),
snippet=(item.get("text") or "")[:500],
)
)
return results
```
```python
async def _extract_exa(
self, cfg: AstrBotConfig, payload: dict, timeout: int = 30
) -> list[dict]:
data = await self._exa_request(cfg, "contents", payload, timeout=timeout)
results: list[dict] = data.get("results", [])
if not results:
raise ValueError("Error: Exa content extraction does not return any results.")
return results
```
```python
async def _find_similar_exa(
self, cfg: AstrBotConfig, payload: dict, timeout: int = 30
) -> list[SearchResult]:
data = await self._exa_request(cfg, "findSimilar", payload, timeout=timeout)
results: list[SearchResult] = []
for item in data.get("results", []):
results.append(
SearchResult(
title=item.get("title", ""),
url=item.get("url", ""),
snippet=(item.get("text") or "")[:500],
)
)
return results
```
This way, if you change headers, auth, or error handling, you only touch `_exa_request`.
### 4. Optional: small helpers for repeated validation
If you want to further simplify the public tool methods, a couple of tiny validators can keep them focused on behavior rather than plumbing.
For example, Exa config check and clamping:
```python
def _ensure_exa_config(self, cfg: AstrBotConfig) -> None:
if not cfg.get("provider_settings", {}).get("websearch_exa_key", []):
raise ValueError("Error: Exa API key is not configured in AstrBot.")
def _clamp_results(self, value: int, minimum: int, maximum: int) -> int:
return max(minimum, min(value, maximum))
```
Usage:
```python
@llm_tool("web_search_exa")
async def search_from_exa(..., max_results: int = 10, ...):
...
cfg = self.context.get_config(umo=event.unified_msg_origin)
self._ensure_exa_config(cfg)
max_results = self._clamp_results(max_results, 1, 100)
...
```
These changes keep all functionality (timeouts, base URLs, Exa/Tavily features) but reduce repetition and make future changes safer and easier.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request introduces the Exa search provider, adding tools for semantic search, web page extraction, and finding similar links. It also adds support for configurable base URLs for Tavily and Exa, and implements a minimum 30-second timeout across various search and extraction tools. Feedback includes addressing a potential IndexError during Exa API key rotation, reusing aiohttp sessions for efficiency, improving error handling when no extraction results are found, and refactoring tool management logic to reduce duplication.
| key = exa_keys[self.exa_key_index] | ||
| self.exa_key_index = (self.exa_key_index + 1) % len(exa_keys) |
There was a problem hiding this comment.
在 _get_exa_key 中,如果配置的 API 密钥列表长度发生变化(例如用户在后台删除了部分 Key),而 self.exa_key_index 仍然保持较大的旧值,直接通过索引访问 exa_keys[self.exa_key_index] 可能会导致 IndexError。建议在访问前先对当前列表长度取模以确保索引安全。
| key = exa_keys[self.exa_key_index] | |
| self.exa_key_index = (self.exa_key_index + 1) % len(exa_keys) | |
| self.exa_key_index %= len(exa_keys) | |
| key = exa_keys[self.exa_key_index] | |
| self.exa_key_index = (self.exa_key_index + 1) % len(exa_keys) |
| "x-api-key": exa_key, | ||
| "Content-Type": "application/json", | ||
| } | ||
| async with aiohttp.ClientSession(trust_env=True) as session: |
| if not results: | ||
| raise ValueError( | ||
| "Error: Exa content extraction does not return any results.", | ||
| ) | ||
| return results |
| tool_set.remove_tool("web_search_exa") | ||
| tool_set.remove_tool("exa_extract_web_page") | ||
| tool_set.remove_tool("exa_find_similar") |
There was a problem hiding this comment.
Pull request overview
This PR extends AstrBot’s web search capabilities by adding an Exa provider (semantic search + extraction + similar-page discovery), making Tavily/Exa API base URLs configurable for proxy/self-hosted endpoints, and updating the dashboard and docs to reflect the expanded toolchain and citation parsing.
Changes:
- Add Exa as a new
websearch_provider, includingweb_search_exa,exa_extract_web_page, andexa_find_similarLLM tools. - Make Tavily/Exa API Base URL configurable and thread it through web search + URL extraction/KB upload flows.
- Update WebUI citation/ref parsing and expand websearch documentation (ZH/EN) plus config metadata i18n.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/zh/use/websearch.md | Expands ZH docs for default/Tavily/Exa/BoCha/Baidu tool parameters and configuration. |
| docs/en/use/websearch.md | Expands EN docs for provider options, tool parameters, and Base URL configuration. |
| dashboard/src/i18n/locales/zh-CN/features/config-metadata.json | Adds metadata strings for Tavily/Exa Base URL + Exa key. |
| dashboard/src/i18n/locales/en-US/features/config-metadata.json | Adds metadata strings for Tavily/Exa Base URL + Exa key. |
| dashboard/src/i18n/locales/ru-RU/features/config-metadata.json | Adds metadata strings for Tavily/Exa Base URL + Exa key. |
| dashboard/src/components/chat/MessageList.vue | Updates supported tool-call parsing to recognize Exa + findSimilar results for refs. |
| astrbot/dashboard/routes/live_chat.py | Extends supported tool list for extracting <ref> citations (Exa + findSimilar). |
| astrbot/dashboard/routes/chat.py | Extends supported tool list for extracting <ref> citations (Exa + findSimilar). |
| astrbot/core/knowledge_base/parsers/url_parser.py | Adds Tavily Base URL support to the KB URL extractor wrapper. |
| astrbot/core/knowledge_base/kb_helper.py | Plumbs Tavily Base URL into KB “upload from URL” extraction. |
| astrbot/core/config/default.py | Adds new provider settings defaults + metadata for Exa and Base URLs. |
| astrbot/core/astr_agent_hooks.py | Extends webchat citation-injection logic to Exa + findSimilar tools. |
| astrbot/builtin_stars/web_searcher/main.py | Implements Exa tools, adds configurable Base URLs, and adds per-tool optional timeout support. |
| header = HEADERS | ||
| header.update({"User-Agent": random.choice(USER_AGENTS)}) |
There was a problem hiding this comment.
_get_from_url mutates the shared HEADERS dict via header = HEADERS + header.update(...). Since HEADERS is a module-level constant used elsewhere, this can cause cross-request header leakage and race conditions under concurrency. Use a per-request copy (e.g., HEADERS.copy() and then set User-Agent) instead of mutating the global dict.
| header = HEADERS | |
| header.update({"User-Agent": random.choice(USER_AGENTS)}) | |
| header = HEADERS.copy() | |
| header["User-Agent"] = random.choice(USER_AGENTS) |
| base_url = ( | ||
| cfg.get("provider_settings", {}) | ||
| .get("websearch_tavily_base_url", "https://api.tavily.com") | ||
| .rstrip("/") | ||
| ) | ||
| if timeout < 30: | ||
| timeout = 30 | ||
| url = f"{base_url}/search" |
There was a problem hiding this comment.
websearch_*_base_url values are used directly after rstrip('/'), but user-entered config often includes whitespace. If the value is " https://api..." or empty, requests will fail with invalid URLs. Consider normalizing with .strip().rstrip('/') and validating that the result starts with http:// or https:// before building endpoint URLs (and raise a clear ValueError if invalid).
| def __init__( | ||
| self, tavily_keys: list[str], tavily_base_url: str = "https://api.tavily.com" | ||
| ) -> None: | ||
| """ | ||
| 初始化 URL 提取器 | ||
|
|
||
| Args: | ||
| tavily_keys: Tavily API 密钥列表 | ||
| tavily_base_url: Tavily API 基础 URL | ||
| """ | ||
| if not tavily_keys: | ||
| raise ValueError("Error: Tavily API keys are not configured.") | ||
|
|
||
| self.tavily_keys = tavily_keys | ||
| self.tavily_key_index = 0 | ||
| self.tavily_key_lock = asyncio.Lock() | ||
| self.tavily_base_url = tavily_base_url.rstrip("/") | ||
|
|
There was a problem hiding this comment.
URLExtractor stores tavily_base_url after only rstrip('/'). If the configured base URL contains leading/trailing whitespace or is empty, later requests will fail with an invalid URL. Normalize with .strip().rstrip('/') and validate scheme (http/https) early in __init__ so errors are surfaced clearly when loading config.
| AstrBot 支持 5 种网页搜索源接入方式:`默认`、`Tavily`、`百度 AI 搜索`、`BoCha`、`Exa`。 | ||
|
|
||
| 前者使用 AstrBot 内置的网页搜索请求器请求 Google、Bing、搜狗搜索引擎,在能够使用 Google 的网络环境下表现最佳。**我们推荐使用 Tavily**。 | ||
| 前者使用 AstrBot 内置的网页搜索请求器请求 Google、Bing、搜狗搜索引擎,在能够使用 Google 的网络环境下表现最佳。**我们推荐使用 Tavily 或 Exa**。 | ||
|
|
There was a problem hiding this comment.
This sentence says the built-in/default provider queries Google, but the current implementation of the default provider uses Bing/Sogou (no Google engine is present). Please update the docs to match the actual engines to avoid misleading users.
| part.tool_calls.forEach(toolCall => { | ||
| // 检查是否是 web_search_tavily 工具调用 | ||
| if (toolCall.name !== 'web_search_tavily' || !toolCall.result) { | ||
| // 检查是否是网页搜索工具调用 | ||
| const supportedTools = ['web_search_tavily', 'web_search_bocha', 'web_search_exa', 'exa_find_similar']; | ||
| if (!supportedTools.includes(toolCall.name) || !toolCall.result) { |
There was a problem hiding this comment.
supportedTools is re-created as a new array for every toolCall iteration. Move it outside the loops (e.g., as a module-level constant or a const defined once per method) to avoid repeated allocations and make the supported-tool list easier to maintain in one place.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- 重构 web_search_utils.py 为分层结构,新增 build_web_search_refs() 和 _extract_ref_indices() 支持从 <ref> 标签提取引用索引 - 简化 chat.py/live_chat.py 中 ref 提取为调用 build_web_search_refs() - MessageList.vue 新增 getMessageRefs() 在后端未返回 refs 时前端自行降级提取 - 修复 chat.py 中消息保存条件判断逻辑
动机
改动点
Exa 搜索提供商:新增三个
@llm_tool工具web_search_exa:语义搜索,支持 5 种搜索类型(auto/neural/fast/instant/deep)和 6 个垂直领域(company/people/research paper/news/personal site/financial report)exa_extract_web_page:通过/contents端点提取网页全文内容exa_find_similar:通过/findSimilar端点查找语义相似网页API Base URL 可配置:Tavily 和 Exa 的 Base URL 可在 WebUI 中自定义,改动覆盖
web_searcher、url_parser、kb_helper等链路可选超时配置:AstrBot 自带联网搜索工具支持可选
timeout参数,默认 30 秒配置元数据 i18n:
default.py新增配置项及条件渲染元数据,en-US/ru-RU/zh-CN三语同步更新工具管理与共享能力整理:
astr_agent_hooks.py在 WebChat 中为搜索类工具补充<ref>index</ref>引用提示,帮助模型输出可追踪的来源标记引用来源链路补全:
<ref>chat.py/live_chat.py共用网页搜索引用提取逻辑MessageList.vue的<ref>引用解析支持 Exa / BoCha /exa_find_similar,不再只识别web_search_tavily<ref>的消息时,仍可降级展示来源列表测试补充:新增
tests/unit/test_web_search_utils.py,覆盖网页搜索结果映射、favicon 透传、显式<ref>命中和无<ref>回退等场景文档:中英文
websearch.md补充default/Tavily/Baidu AI Search/BoCha/Exa的工具说明与参数说明这不是一个破坏性变更
运行截图或测试结果
本地验证命令:
检查清单
Summary by Sourcery
Add Exa as a configurable web search provider alongside Tavily and BoCha, extend web search tools with timeouts and base URL settings, and update UI, knowledge base integration, and docs accordingly.
New Features:
Enhancements:
Documentation: