diff --git a/.github/workflows/crawler.yml b/.github/workflows/crawler.yml index cb0e231..bec1508 100644 --- a/.github/workflows/crawler.yml +++ b/.github/workflows/crawler.yml @@ -21,7 +21,7 @@ jobs: - name: Set up Python uses: actions/setup-python@v4 with: - python-version: "3.9" + python-version: "3.10" - name: Install dependencies run: | diff --git a/README-Cherry-Studio.md b/README-Cherry-Studio.md new file mode 100644 index 0000000..bd59711 --- /dev/null +++ b/README-Cherry-Studio.md @@ -0,0 +1,144 @@ +# TrendRadar × Cherry Studio 部署指南 🍒 + +> **适合人群**:零编程基础的用户 +> **客户端**:Cherry Studio(免费开源 GUI 客户端) + +--- + +## 📥 第一步:下载 Cherry Studio + +### Windows 用户 + +访问官网下载:https://cherry-ai.com/ +或直接下载:[Cherry-Studio-Windows.exe](https://github.com/kangfenmao/cherry-studio/releases/latest) + +### Mac 用户 + +访问官网下载:https://cherry-ai.com/ +或直接下载:[Cherry-Studio-Mac.dmg](https://github.com/kangfenmao/cherry-studio/releases/latest) + + +--- + +## 📦 第二步:获取项目代码 + +为什么需要获取项目代码? + +AI 分析功能需要读取项目中的新闻数据才能工作。无论你使用 GitHub Actions 还是 Docker 部署,爬虫生成的新闻数据都保存在项目的 output 目录中。因此,在配置 MCP 服务器之前,需要先获取完整的项目代码(包含数据文件)。 + +根据你的技术水平,可以选择以下任一方式获取:: + +### 方法一:Git Clone(推荐给技术用户) + +如果你熟悉 Git,可以使用以下命令克隆项目: + +```bash +git clone https://github.com/你的用户名/你的项目名.git +cd 你的项目名 +``` + +**优点**: + +- 可以随时拉取一个命令就可以更新最新数据到本地了(`git pull`) + +### 方法二:直接下载 ZIP 压缩包(推荐给初学者) + + +1. **访问 GitHub 项目页面** + + - 项目链接:`https://github.com/你的用户名/你的项目名` + +2. **下载压缩包** + + - 点击绿色的 "Code" 按钮 + - 选择 "Download ZIP" + - 或直接访问:`https://github.com/你的用户名/你的项目名/archive/refs/heads/master.zip` + + +**注意事项**: + +- 步骤稍微麻烦,后续更新数据需要重复上面步骤,然后覆盖本地数据(output 目录) + +--- + +## 🚀 第三步:一键部署 MCP 服务器 + +### Windows 用户 + +1. **双击运行**项目文件夹中的 `setup-windows.bat` +2. **等待安装完成** +3. **记录显示的配置信息**(命令路径和参数) + +### Mac 用户 + +1. **打开终端**(在启动台搜索"终端") +2. **拖拽**项目文件夹中的 `setup-mac.sh` 到终端窗口 +3. **按回车键** +4. **记录显示的配置信息** + +--- + +## 🔧 第四步:配置 Cherry Studio + +### 1. 打开设置 + +启动 Cherry Studio,点击右上角 ⚙️ **设置** 按钮 + +### 2. 添加 MCP 服务器 + +在设置页面找到:**MCP** → 点击 **添加** + +### 3. 填写配置(重要!) + +根据刚才的安装脚本显示的信息填写 + +### 4. 保存并启用 + +- 点击 **保存** 按钮 +- 确保 MCP 服务器列表中的开关是 **开启** 状态 ✅ + +--- + +## ✅ 第五步:验证是否成功 + +### 1. 测试连接 + +在 Cherry Studio 的对话框中输入: + +``` +帮我爬取最新的新闻 +``` + +### 2. 成功标志 + +如果配置成功,AI 会: + +- ✅ 调用 TrendRadar 工具 +- ✅ 返回真实的新闻数据 +- ✅ 显示平台、标题、排名等信息 + + +--- + +## 🎯 进阶配置 + +### HTTP 模式(可选) + +如果需要远程访问或多客户端共享,可以使用 HTTP 模式: + +#### Windows + +双击运行 `start-http.bat` + +#### Mac + +```bash +./start-http.sh +``` + +然后在 Cherry Studio 中配置: + +``` +类型: streamableHttp +URL: http://localhost:3333/mcp +``` diff --git a/README-MCP-FAQ.md b/README-MCP-FAQ.md new file mode 100644 index 0000000..6e30dbc --- /dev/null +++ b/README-MCP-FAQ.md @@ -0,0 +1,442 @@ +# TrendRadar MCP 工具使用问答 + +> AI 提问指南 - 如何通过对话使用新闻热点分析工具 + +## ⚙️ 默认设置说明(重要!) + +默认采用以下优化策略,主要是为了节约 AI token 消耗: + +| 默认设置 | 说明 | 如何调整 | +| -------------- | --------------------------------------- | ------------------------------------- | +| **限制条数** | 默认返回 50 条新闻 | 对话中说"返回前 10 条"或"给我 100 条" | +| **时间范围** | 默认查询今天的数据 | 说"查询昨天"或"最近一周" | +| **URL 链接** | 默认不返回链接(节省约 160 tokens/条) | 说"需要链接"或"包含 URL" | +| **关键词列表** | 默认不使用 frequency_words.txt 过滤新闻 | 只有调用"趋势话题"工具时才使用 | + +**⚠️ 重要:** AI 模型的选择直接影响工具调用效果,AI 越智能,调用越准确。当你解除上面的限制,比如从今天的查询,放宽到一周的查询,首先你要在本地有一周的数据,其次,token 消耗量可能会倍增(为什么说可能,比如我查询 分析'苹果'最近一周的热度趋势,如果一周中没多少苹果的新闻,那么 token消耗量可能反而很少) + + +## 💰 AI 模型 + +下面我以 **[硅基流动](https://cloud.siliconflow.cn)** 平台作为例子,里面有很多大模型可选择。在开发和测试本项目的过程中,我使用本平台进行了许多的功能测试和验证。 + +### 📊 注册方式对比 + +| 注册方式 | 无邀请链接直接注册 | 含有邀请链接注册 | +|:-------:|:-------:|:-----------------:| +| 注册链接 | [siliconflow.cn](https://cloud.siliconflow.cn) | [邀请链接](https://cloud.siliconflow.cn/i/fqnyVaIU) | +| 免费额度 | 0 tokens | **2000万 tokens** (≈14元) | +| 额外福利 | ❌ | ✅ 邀请者也获得2000万tokens | + +> 💡 **提示**:上面的赠送额度,应该可以询问 **200次以上** + + +### 🚀 快速开始 + +#### 1️⃣ 注册并获取 API 密钥 + +1. 使用上方链接完成注册 +2. 访问 [API 密钥管理页面](https://cloud.siliconflow.cn/me/account/ak) +3. 点击「新建 API 密钥」 +4. 复制生成的密钥(请妥善保管) + +#### 2️⃣ 在 Cherry Studio 中配置 + +1. 打开 **Cherry Studio** +2. 进入「模型服务」设置 +3. 找到「硅基流动」 +4. 将复制的密钥粘贴到 **[API密钥]** 输入框 +5. 确保右上角勾选框打开后显示为 **绿色** ✅ + +--- + +### ✨ 配置完成! + +现在你可以开始使用本项目,享受稳定快速的 AI 服务了! + +在你测试一次询问后,请立刻去 [硅基流动账单](https://cloud.siliconflow.cn/me/bills) 查询这一次的消耗量,心底有个估算。 + + +## 基础查询 + +### Q1: 如何查看最新的新闻? + +**你可以这样问:** + +- "给我看看最新的新闻" +- "查询今天的热点新闻" +- "获取知乎和微博的最新 10 条新闻" +- "查看最新新闻,需要包含链接" + +**调用的工具:** `get_latest_news` + +**工具返回行为:** + +- MCP 工具会返回所有平台的最新 50 条新闻给 AI +- 不包含 URL 链接(节省 token) + +**AI 展示行为(重要):** + +- ⚠️ **AI 通常会自动总结**,只展示部分新闻(如 TOP 10-20 条) +- ✅ 如果你想看全部 50 条,需要明确要求:"展示所有新闻"或"完整列出所有 50 条" +- 💡 这是 AI 模型的自然行为,不是工具的限制 + +**可以调整:** + +- 指定平台:如"只看知乎的" +- 调整数量:如"返回前 20 条" +- 包含链接:如"需要链接" +- **要求完整展示**:如"展示全部,不要总结" + +--- + +### Q2: 如何查询特定日期的新闻? + +**你可以这样问:** + +- "查询昨天的新闻" +- "看看 3 天前知乎的新闻" +- "2025-10-10 的新闻有哪些" +- "上周一的新闻" +- "给我看看最新新闻"(自动查询今天) + +**调用的工具:** `get_news_by_date` + +**支持的日期格式:** + +- 相对日期:今天、昨天、前天、3 天前 +- 星期:上周一、本周三、last monday +- 绝对日期:2025-10-10、10 月 10 日 + +**工具返回行为:** + +- 不指定日期时自动查询今天(节省 token) +- MCP 工具会返回所有平台的 50 条新闻给 AI +- 不包含 URL 链接 + +**AI 展示行为(重要):** + +- ⚠️ **AI 通常会自动总结**,只展示部分新闻(如 TOP 10-20 条) +- ✅ 如果你想看全部,需要明确要求:"展示所有新闻,不要总结" + +--- + +### Q3: 如何查看我关注的话题频率统计? + +**你可以这样问:** + +- "我关注的词今天出现了多少次" +- "看看我的关注词列表中哪些词最热门" +- "统计一下 frequency_words.txt 中的关注词频率" + +**调用的工具:** `get_trending_topics` + +**重要说明:** + +- 本工具**不是**自动提取新闻热点 +- 而是统计你在 `config/frequency_words.txt` 中设置的**个人关注词** +- 这是一个**可自定义**的列表,你可以根据兴趣添加关注词 + +--- + +## 搜索检索 + +### Q4: 如何搜索包含特定关键词的新闻? + +**你可以这样问:** + +- "搜索包含'人工智能'的新闻" +- "查找关于'特斯拉降价'的报道" +- "搜索马斯克相关的新闻,返回前 20 条" +- "查找'iPhone 16 发布'这条新闻的链接" + +**调用的工具:** `search_news` + +**工具返回行为:** + +- 使用关键词模式搜索 +- 搜索今天的数据 +- MCP 工具会返回最多 50 条结果给 AI +- 不包含 URL 链接 + +**AI 展示行为(重要):** + +- ⚠️ **AI 通常会自动总结**,只展示部分搜索结果 +- ✅ 如果你想看全部,需要明确要求:"展示所有搜索结果" + +**可以调整:** + +- 指定时间范围:如"搜索最近一周的" +- 指定平台:如"只搜索知乎" +- 调整排序:如"按权重排序" +- 包含链接:如"需要链接" + +--- + +### Q5: 如何查找历史相关新闻? + +**你可以这样问:** + +- "查找昨天与'人工智能突破'相关的新闻" +- "搜索上周关于'特斯拉'的历史报道" +- "找出上个月与'ChatGPT'相关的新闻" +- "看看'iPhone 发布会'相关的历史新闻" + +**调用的工具:** `search_related_news_history` + +**工具返回行为:** + +- 搜索昨天的数据 +- 相似度阈值 0.4 +- MCP 工具会返回最多 50 条结果给 AI +- 不包含 URL 链接 + +**AI 展示行为(重要):** + +- ⚠️ **AI 通常会自动总结**,只展示部分相关新闻 +- ✅ 如果你想看全部,需要明确要求:"展示所有相关新闻" + +--- + +## 趋势分析 + +### Q6: 如何分析话题的热度趋势? + +**你可以这样问:** + +- "分析'人工智能'最近一周的热度趋势" +- "看看'特斯拉'话题是昙花一现还是持续热点" +- "检测今天有哪些突然爆火的话题" +- "预测接下来可能的热点话题" + +**调用的工具:** `analyze_topic_trend` + +**工具返回行为:** + +- 热度趋势模式 +- 分析最近 7 天数据 +- 按天粒度统计 + +**AI 展示行为:** + +- 通常会展示趋势分析结果和图表 +- AI 可能会总结关键发现 + +--- + +## 数据洞察 + +### Q7: 如何对比不同平台对话题的关注度? + +**你可以这样问:** + +- "对比各个平台对'人工智能'话题的关注度" +- "看看哪个平台更新最频繁" +- "分析一下哪些关键词经常一起出现" + +**调用的工具:** `analyze_data_insights` + +**三种洞察模式:** + +| 模式 | 功能 | 示例问法 | +| -------------- | ---------------- | -------------------------- | +| **平台对比** | 对比各平台关注度 | "对比各平台对'AI'的关注度" | +| **活跃度统计** | 统计平台发布频率 | "看看哪个平台更新最频繁" | +| **关键词共现** | 分析关键词关联 | "哪些关键词经常一起出现" | + +**工具返回行为:** + +- 平台对比模式 +- 分析今天的数据 +- 关键词共现最小频次 3 次 + +**AI 展示行为:** + +- 通常会展示分析结果和统计数据 +- AI 可能会总结洞察发现 + +--- + +## 情感分析 + +### Q8: 如何分析新闻的情感倾向? + +**你可以这样问:** + +- "分析一下今天新闻的情感倾向" +- "看看'特斯拉'相关新闻是正面还是负面的" +- "分析各平台对'人工智能'的情感态度" +- "看看'比特币'一周内的情感倾向,选择前 20 条最重要的" + +**调用的工具:** `analyze_sentiment` + +**工具返回行为:** + +- 分析今天的数据 +- MCP 工具会返回最多 50 条新闻给 AI +- 按权重排序(优先展示重要新闻) +- 不包含 URL 链接 + +**AI 展示行为(重要):** + +- ⚠️ 本工具返回 **AI 提示词**,不是直接的情感分析结果 +- AI 会根据提示词生成情感分析报告 +- 通常会展示情感分布、关键发现和代表性新闻 + +**可以调整:** + +- 指定话题:如"关于'特斯拉'" +- 指定时间:如"最近一周" +- 调整数量:如"返回前 20 条" + +--- + +### Q9: 如何查找相似的新闻报道? + +**你可以这样问:** + +- "找出和'特斯拉降价'相似的新闻" +- "查找关于 iPhone 发布的类似报道" +- "看看有没有和这条新闻相似的报道" +- "找相似新闻,需要链接" + +**调用的工具:** `find_similar_news` + +**工具返回行为:** + +- 相似度阈值 0.6 +- MCP 工具会返回最多 50 条结果给 AI +- 不包含 URL 链接 + +**AI 展示行为(重要):** + +- ⚠️ **AI 通常会自动总结**,只展示部分相似新闻 +- ✅ 如果你想看全部,需要明确要求:"展示所有相似新闻" + +--- + +### Q10: 如何生成每日或每周的热点摘要? + +**你可以这样问:** + +- "生成今天的新闻摘要报告" +- "给我一份本周的热点总结" +- "生成过去 7 天的新闻分析报告" + +**调用的工具:** `generate_summary_report` + +**报告类型:** + +- 每日摘要:总结当天的热点新闻 +- 每周摘要:总结一周的热点趋势 + +--- + +## 系统管理 + +### Q11: 如何查看系统配置? + +**你可以这样问:** + +- "查看当前系统配置" +- "显示配置文件内容" +- "有哪些可用的平台?" +- "当前的权重配置是什么?" + +**调用的工具:** `get_current_config` + +**可以查询:** + +- 可用平台列表 +- 爬虫配置(请求间隔、超时设置) +- 权重配置(排名权重、频次权重) +- 通知配置(钉钉、微信) + +--- + +### Q12: 如何检查系统运行状态? + +**你可以这样问:** + +- "检查系统状态" +- "系统运行正常吗?" +- "最后一次爬取是什么时候?" +- "有多少天的历史数据?" + +**调用的工具:** `get_system_status` + +**返回信息:** + +- 系统版本和状态 +- 最后爬取时间 +- 历史数据天数 +- 健康检查结果 + +--- + +### Q13: 如何手动触发爬取任务? + +**你可以这样问:** + +- "请你爬取当前的今日头条的新闻"(临时查询) +- "帮我抓取一下知乎和微博的最新新闻并保存"(持久化) +- "触发一次爬取并保存数据"(持久化) +- "获取 36 氪 的实时数据但不保存"(临时查询) + +**调用的工具:** `trigger_crawl` + +**两种模式:** + +| 模式 | 用途 | 示例 | +| -------------- | -------------------- | -------------------- | +| **临时爬取** | 只返回数据不保存 | "爬取今日头条的新闻" | +| **持久化爬取** | 保存到 output 文件夹 | "抓取并保存知乎新闻" | + +**工具返回行为:** + +- 临时爬取模式(不保存) +- 爬取所有平台 +- 不包含 URL 链接 + +**AI 展示行为(重要):** + +- ⚠️ **AI 通常会总结爬取结果**,只展示部分新闻 +- ✅ 如果你想看全部,需要明确要求:"展示所有爬取的新闻" + +**可以调整:** + +- 指定平台:如"只爬取知乎" +- 保存数据:说"并保存"或"保存到本地" +- 包含链接:说"需要链接" + +--- + +## 💡 使用技巧 + +### 1. 如何让 AI 展示全部数据而不是自动总结? + +**背景**: 有时 AI 会自动总结数据,只展示部分内容,即使工具返回了完整的 50 条数据。 + +**如果 AI 仍然总结,你可以**: + +- **方法 1 - 明确要求**: "请展示全部新闻,不要总结" +- **方法 2 - 指定数量**: "展示所有 50 条新闻" +- **方法 3 - 质疑行为**: "为什么只展示了 15 条?我要看全部" +- **方法 4 - 提前说明**: "查询今天的新闻,完整展示所有结果" + +**注意**: AI 仍可能根据上下文调整展示方式。 + + +### 2. 如何组合使用多个工具? + +**示例:深度分析某个话题** + +1. 先搜索:"搜索'人工智能'相关新闻" +2. 再分析趋势:"分析'人工智能'的热度趋势" +3. 最后情感分析:"分析'人工智能'新闻的情感倾向" + +**示例:追踪某个事件** + +1. 查看最新:"查询今天关于'iPhone'的新闻" +2. 查找历史:"查找上周与'iPhone'相关的历史新闻" +3. 找相似报道:"找出和'iPhone 发布会'相似的新闻" diff --git a/_image/github-pages.png b/_image/github-pages.png index f0933dd..b52852b 100644 Binary files a/_image/github-pages.png and b/_image/github-pages.png differ diff --git a/docker/Dockerfile b/docker/Dockerfile index 22c74c8..8f09425 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -6,50 +6,55 @@ WORKDIR /app ARG TARGETARCH ENV SUPERCRONIC_VERSION=v0.2.34 +# supercronic + locale RUN set -ex && \ apt-get update && \ - apt-get install -y --no-install-recommends curl ca-certificates && \ + apt-get install -y --no-install-recommends curl ca-certificates locales && \ + sed -i -e 's/# zh_CN.UTF-8 UTF-8/zh_CN.UTF-8 UTF-8/' /etc/locale.gen && \ + sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen && \ + locale-gen && \ + # 根据架构选择并下载 supercronic case ${TARGETARCH} in \ amd64) \ - export SUPERCRONIC_URL=https://github.com/aptible/supercronic/releases/download/${SUPERCRONIC_VERSION}/supercronic-linux-amd64; \ - export SUPERCRONIC_SHA1SUM=e8631edc1775000d119b70fd40339a7238eece14; \ - export SUPERCRONIC=supercronic-linux-amd64; \ - ;; \ + export SUPERCRONIC_URL=https://github.com/aptible/supercronic/releases/download/${SUPERCRONIC_VERSION}/supercronic-linux-amd64; \ + export SUPERCRONIC_SHA1SUM=e8631edc1775000d119b70fd40339a7238eece14; \ + export SUPERCRONIC=supercronic-linux-amd64; \ + ;; \ arm64) \ - export SUPERCRONIC_URL=https://github.com/aptible/supercronic/releases/download/${SUPERCRONIC_VERSION}/supercronic-linux-arm64; \ - export SUPERCRONIC_SHA1SUM=4ab6343b52bf9da592e8b4bb7ae6eb5a8e21b71e; \ - export SUPERCRONIC=supercronic-linux-arm64; \ - ;; \ + export SUPERCRONIC_URL=https://github.com/aptible/supercronic/releases/download/${SUPERCRONIC_VERSION}/supercronic-linux-arm64; \ + export SUPERCRONIC_SHA1SUM=4ab6343b52bf9da592e8b4bb7ae6eb5a8e21b71e; \ + export SUPERCRONIC=supercronic-linux-arm64; \ + ;; \ arm) \ - export SUPERCRONIC_URL=https://github.com/aptible/supercronic/releases/download/${SUPERCRONIC_VERSION}/supercronic-linux-arm; \ - export SUPERCRONIC_SHA1SUM=4ba4cd0da62082056b6def085fa9377d965fbe01; \ - export SUPERCRONIC=supercronic-linux-arm; \ - ;; \ + export SUPERCRONIC_URL=https://github.com/aptible/supercronic/releases/download/${SUPERCRONIC_VERSION}/supercronic-linux-arm; \ + export SUPERCRONIC_SHA1SUM=4ba4cd0da62082056b6def085fa9377d965fbe01; \ + export SUPERCRONIC=supercronic-linux-arm; \ + ;; \ 386) \ - export SUPERCRONIC_URL=https://github.com/aptible/supercronic/releases/download/${SUPERCRONIC_VERSION}/supercronic-linux-386; \ - export SUPERCRONIC_SHA1SUM=80b4fff03a8d7bf2f24a1771f37640337855e949; \ - export SUPERCRONIC=supercronic-linux-386; \ - ;; \ + export SUPERCRONIC_URL=https://github.com/aptible/supercronic/releases/download/${SUPERCRONIC_VERSION}/supercronic-linux-386; \ + export SUPERCRONIC_SHA1SUM=80b4fff03a8d7bf2f24a1771f37640337855e949; \ + export SUPERCRONIC=supercronic-linux-386; \ + ;; \ *) \ - echo "Unsupported architecture: ${TARGETARCH}"; \ - exit 1; \ - ;; \ + echo "Unsupported architecture: ${TARGETARCH}"; \ + exit 1; \ + ;; \ esac && \ echo "Downloading supercronic for ${TARGETARCH} from ${SUPERCRONIC_URL}" && \ # 添加重试机制和超时设置 for i in 1 2 3 4 5; do \ - echo "Download attempt $i/5"; \ - if curl --fail --silent --show-error --location --retry 3 --retry-delay 2 --connect-timeout 30 --max-time 120 -o "$SUPERCRONIC" "$SUPERCRONIC_URL"; then \ - echo "Download successful"; \ - break; \ - else \ - echo "Download attempt $i failed, exit code: $?"; \ - if [ $i -eq 5 ]; then \ - echo "All download attempts failed"; \ - exit 1; \ - fi; \ - sleep $((i * 2)); \ - fi; \ + echo "Download attempt $i/5"; \ + if curl --fail --silent --show-error --location --retry 3 --retry-delay 2 --connect-timeout 30 --max-time 120 -o "$SUPERCRONIC" "$SUPERCRONIC_URL"; then \ + echo "Download successful"; \ + break; \ + else \ + echo "Download attempt $i failed, exit code: $?"; \ + if [ $i -eq 5 ]; then \ + echo "All download attempts failed"; \ + exit 1; \ + fi; \ + sleep $((i * 2)); \ + fi; \ done && \ echo "${SUPERCRONIC_SHA1SUM} ${SUPERCRONIC}" | sha1sum -c - && \ chmod +x "$SUPERCRONIC" && \ @@ -57,6 +62,7 @@ RUN set -ex && \ ln -s "/usr/local/bin/${SUPERCRONIC}" /usr/local/bin/supercronic && \ # 验证安装 supercronic -version && \ + # 清理(保留 locales,只删除 curl) apt-get remove -y curl && \ apt-get clean && \ rm -rf /var/lib/apt/lists/* @@ -77,6 +83,10 @@ RUN sed -i 's/\r$//' /entrypoint.sh.tmp && \ ENV PYTHONUNBUFFERED=1 \ CONFIG_PATH=/app/config/config.yaml \ - FREQUENCY_WORDS_PATH=/app/config/frequency_words.txt + FREQUENCY_WORDS_PATH=/app/config/frequency_words.txt \ + LANG=zh_CN.UTF-8 \ + LANGUAGE=zh_CN:zh:en_US:en \ + LC_ALL=zh_CN.UTF-8 \ + PYTHONIOENCODING=utf-8 ENTRYPOINT ["/entrypoint.sh"] \ No newline at end of file diff --git a/mcp_server/__init__.py b/mcp_server/__init__.py new file mode 100644 index 0000000..352560e --- /dev/null +++ b/mcp_server/__init__.py @@ -0,0 +1,7 @@ +""" +TrendRadar MCP Server + +提供基于MCP协议的新闻聚合数据查询和系统管理接口。 +""" + +__version__ = "1.0.0" diff --git a/mcp_server/server.py b/mcp_server/server.py new file mode 100644 index 0000000..6c15097 --- /dev/null +++ b/mcp_server/server.py @@ -0,0 +1,657 @@ +""" +TrendRadar MCP Server - FastMCP 2.0 实现 + +使用 FastMCP 2.0 提供生产级 MCP 工具服务器。 +支持 stdio 和 HTTP 两种传输模式。 +""" + +import json +from typing import List, Optional, Dict + +from fastmcp import FastMCP + +from .tools.data_query import DataQueryTools +from .tools.analytics import AnalyticsTools +from .tools.search_tools import SearchTools +from .tools.config_mgmt import ConfigManagementTools +from .tools.system import SystemManagementTools + + +# 创建 FastMCP 2.0 应用 +mcp = FastMCP('trendradar-news') + +# 全局工具实例(在第一次请求时初始化) +_tools_instances = {} + + +def _get_tools(project_root: Optional[str] = None): + """获取或创建工具实例(单例模式)""" + if not _tools_instances: + _tools_instances['data'] = DataQueryTools(project_root) + _tools_instances['analytics'] = AnalyticsTools(project_root) + _tools_instances['search'] = SearchTools(project_root) + _tools_instances['config'] = ConfigManagementTools(project_root) + _tools_instances['system'] = SystemManagementTools(project_root) + return _tools_instances + + +# ==================== 数据查询工具 ==================== + +@mcp.tool +async def get_latest_news( + platforms: Optional[List[str]] = None, + limit: int = 50, + include_url: bool = False +) -> str: + """ + 获取最新一批爬取的新闻数据,快速了解当前热点 + + Args: + platforms: 平台ID列表,如 ['zhihu', 'weibo', 'douyin'] + - 不指定时:使用 config.yaml 中配置的所有平台 + - 支持的平台来自 config/config.yaml 的 platforms 配置 + - 每个平台都有对应的name字段(如"知乎"、"微博"),方便AI识别 + limit: 返回条数限制,默认50,最大1000 + 注意:实际返回数量可能少于请求值,取决于当前可用的新闻总数 + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + JSON格式的新闻列表 + + **重要:数据展示建议** + 本工具会返回完整的新闻列表(通常50条)给你。但请注意: + - **工具返回**:完整的50条数据 ✅ + - **建议展示**:向用户展示全部数据,除非用户明确要求总结 + - **用户期望**:用户可能需要完整数据,请谨慎总结 + + **何时可以总结**: + - 用户明确说"给我总结一下"或"挑重点说" + - 数据量超过100条时,可先展示部分并询问是否查看全部 + + **注意**:如果用户询问"为什么只显示了部分",说明他们需要完整数据 + """ + tools = _get_tools() + result = tools['data'].get_latest_news(platforms=platforms, limit=limit, include_url=include_url) + return json.dumps(result, ensure_ascii=False, indent=2) + + +@mcp.tool +async def get_trending_topics( + top_n: int = 10, + mode: str = 'current' +) -> str: + """ + 获取个人关注词的新闻出现频率统计(基于 config/frequency_words.txt) + + 注意:本工具不是自动提取新闻热点,而是统计你在 config/frequency_words.txt 中 + 设置的个人关注词在新闻中出现的频率。你可以自定义这个关注词列表。 + + Args: + top_n: 返回TOP N关注词,默认10 + mode: 模式选择 + - daily: 当日累计数据统计 + - current: 最新一批数据统计(默认) + + Returns: + JSON格式的关注词频率统计列表 + """ + tools = _get_tools() + result = tools['data'].get_trending_topics(top_n=top_n, mode=mode) + return json.dumps(result, ensure_ascii=False, indent=2) + + +@mcp.tool +async def get_news_by_date( + date_query: Optional[str] = None, + platforms: Optional[List[str]] = None, + limit: int = 50, + include_url: bool = False +) -> str: + """ + 获取指定日期的新闻数据,用于历史数据分析和对比 + + Args: + date_query: 日期查询,可选格式: + - 自然语言: "今天", "昨天", "前天", "3天前" + - 标准日期: "2024-01-15", "2024/01/15" + - 默认值: "今天"(节省token) + platforms: 平台ID列表,如 ['zhihu', 'weibo', 'douyin'] + - 不指定时:使用 config.yaml 中配置的所有平台 + - 支持的平台来自 config/config.yaml 的 platforms 配置 + - 每个平台都有对应的name字段(如"知乎"、"微博"),方便AI识别 + limit: 返回条数限制,默认50,最大1000 + 注意:实际返回数量可能少于请求值,取决于指定日期的新闻总数 + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + JSON格式的新闻列表,包含标题、平台、排名等信息 + + **重要:数据展示建议** + 本工具会返回完整的新闻列表(通常50条)给你。但请注意: + - **工具返回**:完整的50条数据 ✅ + - **建议展示**:向用户展示全部数据,除非用户明确要求总结 + - **用户期望**:用户可能需要完整数据,请谨慎总结 + + **何时可以总结**: + - 用户明确说"给我总结一下"或"挑重点说" + - 数据量超过100条时,可先展示部分并询问是否查看全部 + + **注意**:如果用户询问"为什么只显示了部分",说明他们需要完整数据 + """ + tools = _get_tools() + result = tools['data'].get_news_by_date( + date_query=date_query, + platforms=platforms, + limit=limit, + include_url=include_url + ) + return json.dumps(result, ensure_ascii=False, indent=2) + + + +# ==================== 高级数据分析工具 ==================== + +@mcp.tool +async def analyze_topic_trend( + topic: str, + analysis_type: str = "trend", + time_range: str = "7d", + granularity: str = "day", + threshold: float = 3.0, + time_window: int = 24, + lookback_days: int = 7, + lookahead_hours: int = 6, + confidence_threshold: float = 0.7 +) -> str: + """ + 统一话题趋势分析工具 - 整合多种趋势分析模式 + + Args: + topic: 话题关键词(必需) + analysis_type: 分析类型,可选值: + - "trend": 热度趋势分析(追踪话题的热度变化) + - "lifecycle": 生命周期分析(从出现到消失的完整周期) + - "viral": 异常热度检测(识别突然爆火的话题) + - "predict": 话题预测(预测未来可能的热点) + time_range: 时间范围(trend模式),默认"7d"(7d/24h/1w/1m/2m) + granularity: 时间粒度(trend模式),默认"day"(仅支持 day,因为底层数据按天聚合) + threshold: 热度突增倍数阈值(viral模式),默认3.0 + time_window: 检测时间窗口小时数(viral模式),默认24 + lookback_days: 回溯天数(lifecycle模式),默认7 + lookahead_hours: 预测未来小时数(predict模式),默认6 + confidence_threshold: 置信度阈值(predict模式),默认0.7 + + Returns: + JSON格式的趋势分析结果 + + Examples: + - analyze_topic_trend(topic="人工智能", analysis_type="trend", time_range="7d") + - analyze_topic_trend(topic="特斯拉", analysis_type="lifecycle", lookback_days=7) + - analyze_topic_trend(topic="比特币", analysis_type="viral", threshold=3.0) + - analyze_topic_trend(topic="ChatGPT", analysis_type="predict", lookahead_hours=6) + """ + tools = _get_tools() + result = tools['analytics'].analyze_topic_trend_unified( + topic=topic, + analysis_type=analysis_type, + time_range=time_range, + granularity=granularity, + threshold=threshold, + time_window=time_window, + lookback_days=lookback_days, + lookahead_hours=lookahead_hours, + confidence_threshold=confidence_threshold + ) + return json.dumps(result, ensure_ascii=False, indent=2) + + +@mcp.tool +async def analyze_data_insights( + insight_type: str = "platform_compare", + topic: Optional[str] = None, + date_range: Optional[Dict[str, str]] = None, + min_frequency: int = 3, + top_n: int = 20 +) -> str: + """ + 统一数据洞察分析工具 - 整合多种数据分析模式 + + Args: + insight_type: 洞察类型,可选值: + - "platform_compare": 平台对比分析(对比不同平台对话题的关注度) + - "platform_activity": 平台活跃度统计(统计各平台发布频率和活跃时间) + - "keyword_cooccur": 关键词共现分析(分析关键词同时出现的模式) + topic: 话题关键词(可选,platform_compare模式适用) + date_range: 日期范围,格式: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} + min_frequency: 最小共现频次(keyword_cooccur模式),默认3 + top_n: 返回TOP N结果(keyword_cooccur模式),默认20 + + Returns: + JSON格式的数据洞察分析结果 + + Examples: + - analyze_data_insights(insight_type="platform_compare", topic="人工智能") + - analyze_data_insights(insight_type="platform_activity", date_range={...}) + - analyze_data_insights(insight_type="keyword_cooccur", min_frequency=5, top_n=15) + """ + tools = _get_tools() + result = tools['analytics'].analyze_data_insights_unified( + insight_type=insight_type, + topic=topic, + date_range=date_range, + min_frequency=min_frequency, + top_n=top_n + ) + return json.dumps(result, ensure_ascii=False, indent=2) + + +@mcp.tool +async def analyze_sentiment( + topic: Optional[str] = None, + platforms: Optional[List[str]] = None, + date_range: Optional[Dict[str, str]] = None, + limit: int = 50, + sort_by_weight: bool = True, + include_url: bool = False +) -> str: + """ + 分析新闻的情感倾向和热度趋势 + + Args: + keywords: 关键词列表,如 ["AI", "人工智能"] + date_range: 日期范围(天数),如 7 表示最近7天,默认3天 + platforms: 平台ID列表,如 ['zhihu', 'weibo', 'douyin'] + - 不指定时:使用 config.yaml 中配置的所有平台 + - 支持的平台来自 config/config.yaml 的 platforms 配置 + - 每个平台都有对应的name字段(如"知乎"、"微博"),方便AI识别 + limit: 返回新闻数量,默认50,最大100 + 注意:本工具会对新闻标题进行去重(同一标题在不同平台只保留一次), + 因此实际返回数量可能少于请求的 limit 值 + sort_by_weight: 是否按热度权重排序,默认True + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + JSON格式的分析结果,包含情感分布、热度趋势和相关新闻 + + **重要:数据展示策略** + - 本工具返回完整的分析结果和新闻列表 + - **默认展示方式**:展示完整的分析结果(包括所有新闻) + - 仅在用户明确要求"总结"或"挑重点"时才进行筛选 + """ + tools = _get_tools() + result = tools['analytics'].analyze_sentiment( + topic=topic, + platforms=platforms, + date_range=date_range, + limit=limit, + sort_by_weight=sort_by_weight, + include_url=include_url + ) + return json.dumps(result, ensure_ascii=False, indent=2) + + +@mcp.tool +async def find_similar_news( + reference_title: str, + threshold: float = 0.6, + limit: int = 50, + include_url: bool = False +) -> str: + """ + 查找与指定新闻标题相似的其他新闻 + + Args: + title: 新闻标题(完整或部分) + threshold: 相似度阈值,0-1之间,默认0.6 + 注意:阈值越高匹配越严格,返回结果越少 + limit: 返回条数限制,默认50,最大100 + 注意:实际返回数量取决于相似度匹配结果,可能少于请求值 + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + JSON格式的相似新闻列表,包含相似度分数 + + **重要:数据展示策略** + - 本工具返回完整的相似新闻列表 + - **默认展示方式**:展示全部返回的新闻(包括相似度分数) + - 仅在用户明确要求"总结"或"挑重点"时才进行筛选 + """ + tools = _get_tools() + result = tools['analytics'].find_similar_news( + reference_title=reference_title, + threshold=threshold, + limit=limit, + include_url=include_url + ) + return json.dumps(result, ensure_ascii=False, indent=2) + + +@mcp.tool +async def generate_summary_report( + report_type: str = "daily", + date_range: Optional[Dict[str, str]] = None +) -> str: + """ + 每日/每周摘要生成器 - 自动生成热点摘要报告 + + Args: + report_type: 报告类型(daily/weekly) + date_range: 自定义日期范围(可选) + + Returns: + JSON格式的摘要报告,包含Markdown格式内容 + """ + tools = _get_tools() + result = tools['analytics'].generate_summary_report( + report_type=report_type, + date_range=date_range + ) + return json.dumps(result, ensure_ascii=False, indent=2) + + +# ==================== 智能检索工具 ==================== + +@mcp.tool +async def search_news( + query: str, + search_mode: str = "keyword", + date_range: Optional[Dict[str, str]] = None, + platforms: Optional[List[str]] = None, + limit: int = 50, + sort_by: str = "relevance", + threshold: float = 0.6, + include_url: bool = False +) -> str: + """ + 统一搜索接口,支持多种搜索模式 + + Args: + query: 搜索关键词或内容片段 + search_mode: 搜索模式,可选值: + - "keyword": 精确关键词匹配(默认,适合搜索特定话题) + - "fuzzy": 模糊内容匹配(适合搜索内容片段,会过滤相似度低于阈值的结果) + - "entity": 实体名称搜索(适合搜索人物/地点/机构) + threshold: 相似度阈值(仅fuzzy模式有效),0-1之间,默认0.6 + 注意:阈值越高匹配越严格,返回结果越少 + platforms: 平台ID列表,如 ['zhihu', 'weibo', 'douyin'] + - 不指定时:使用 config.yaml 中配置的所有平台 + - 支持的平台来自 config/config.yaml 的 platforms 配置 + - 每个平台都有对应的name字段(如"知乎"、"微博"),方便AI识别 + lookback_days: 回溯天数,默认7天,最大30天 + limit: 返回条数限制,默认50,最大1000 + 注意:实际返回数量取决于搜索匹配结果(特别是 fuzzy 模式下会过滤低相似度结果) + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + JSON格式的搜索结果,包含标题、平台、排名等信息 + + **重要:数据展示策略** + - 本工具返回完整的搜索结果列表 + - **默认展示方式**:展示全部返回的新闻,无需总结或筛选 + - 仅在用户明确要求"总结"或"挑重点"时才进行筛选 + """ + tools = _get_tools() + result = tools['search'].search_news_unified( + query=query, + search_mode=search_mode, + date_range=date_range, + platforms=platforms, + limit=limit, + sort_by=sort_by, + threshold=threshold, + include_url=include_url + ) + return json.dumps(result, ensure_ascii=False, indent=2) + + +@mcp.tool +async def search_related_news_history( + reference_text: str, + time_range: str = "yesterday", + threshold: float = 0.4, + limit: int = 50, + include_url: bool = False +) -> str: + """ + 基于种子新闻,在历史数据中搜索相关新闻 + + Args: + seed_news_title: 种子新闻标题(完整或部分) + lookback_days: 向前查找的天数范围,默认7天,最大30天 + threshold: 相关性阈值,0-1之间,默认0.4 + 注意:综合相似度计算(70%关键词重合 + 30%文本相似度) + 阈值越高匹配越严格,返回结果越少 + platforms: 平台ID列表。不指定则搜索所有平台 + limit: 返回条数限制,默认50,最大100 + 注意:实际返回数量取决于相关性匹配结果,可能少于请求值 + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + JSON格式的相关新闻列表,包含相关性分数和时间分布 + + **重要:数据展示策略** + - 本工具返回完整的相关新闻列表 + - **默认展示方式**:展示全部返回的新闻(包括相关性分数) + - 仅在用户明确要求"总结"或"挑重点"时才进行筛选 + """ + tools = _get_tools() + result = tools['search'].search_related_news_history( + reference_text=reference_text, + time_range=time_range, + threshold=threshold, + limit=limit, + include_url=include_url + ) + return json.dumps(result, ensure_ascii=False, indent=2) + + +# ==================== 配置与系统管理工具 ==================== + +@mcp.tool +async def get_current_config( + section: str = "all" +) -> str: + """ + 获取当前系统配置 + + Args: + section: 配置节,可选值: + - "all": 所有配置(默认) + - "crawler": 爬虫配置 + - "push": 推送配置 + - "keywords": 关键词配置 + - "weights": 权重配置 + + Returns: + JSON格式的配置信息 + """ + tools = _get_tools() + result = tools['config'].get_current_config(section=section) + return json.dumps(result, ensure_ascii=False, indent=2) + + +@mcp.tool +async def get_system_status() -> str: + """ + 获取系统运行状态和健康检查信息 + + 返回系统版本、数据统计、缓存状态等信息 + + Returns: + JSON格式的系统状态信息 + """ + tools = _get_tools() + result = tools['system'].get_system_status() + return json.dumps(result, ensure_ascii=False, indent=2) + + +@mcp.tool +async def trigger_crawl( + platforms: Optional[List[str]] = None, + save_to_local: bool = False, + include_url: bool = False +) -> str: + """ + 手动触发一次爬取任务(可选持久化) + + Args: + platforms: 指定平台ID列表,如 ['zhihu', 'weibo', 'douyin'] + - 不指定时:使用 config.yaml 中配置的所有平台 + - 支持的平台来自 config/config.yaml 的 platforms 配置 + - 每个平台都有对应的name字段(如"知乎"、"微博"),方便AI识别 + - 注意:失败的平台会在返回结果的 failed_platforms 字段中列出 + save_to_local: 是否保存到本地 output 目录,默认 False + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + JSON格式的任务状态信息,包含: + - platforms: 成功爬取的平台列表 + - failed_platforms: 失败的平台列表(如有) + - total_news: 爬取的新闻总数 + - data: 新闻数据 + + Examples: + - 临时爬取: trigger_crawl(platforms=['zhihu']) + - 爬取并保存: trigger_crawl(platforms=['weibo'], save_to_local=True) + - 使用默认平台: trigger_crawl() # 爬取config.yaml中配置的所有平台 + """ + tools = _get_tools() + result = tools['system'].trigger_crawl(platforms=platforms, save_to_local=save_to_local, include_url=include_url) + return json.dumps(result, ensure_ascii=False, indent=2) + + +# ==================== 启动入口 ==================== + +def run_server( + project_root: Optional[str] = None, + transport: str = 'stdio', + host: str = '0.0.0.0', + port: int = 3333 +): + """ + 启动 MCP 服务器 + + Args: + project_root: 项目根目录路径 + transport: 传输模式,'stdio' 或 'http' + host: HTTP模式的监听地址,默认 0.0.0.0 + port: HTTP模式的监听端口,默认 3333 + """ + # 初始化工具实例 + _get_tools(project_root) + + # 打印启动信息 + print() + print("=" * 60) + print(" TrendRadar MCP Server - FastMCP 2.0") + print("=" * 60) + print(f" 传输模式: {transport.upper()}") + + if transport == 'stdio': + print(" 协议: MCP over stdio (标准输入输出)") + print(" 说明: 通过标准输入输出与 MCP 客户端通信") + elif transport == 'http': + print(f" 监听地址: http://{host}:{port}") + print(f" HTTP端点: http://{host}:{port}/mcp") + print(" 协议: MCP over HTTP (生产环境)") + + if project_root: + print(f" 项目目录: {project_root}") + else: + print(" 项目目录: 当前目录") + + print() + print(" 已注册的工具:") + print(" === 基础数据查询(P0核心)===") + print(" 1. get_latest_news - 获取最新新闻") + print(" 2. get_news_by_date - 按日期查询新闻(支持自然语言)") + print(" 3. get_trending_topics - 获取趋势话题") + print() + print(" === 智能检索工具 ===") + print(" 4. search_news - 统一新闻搜索(关键词/模糊/实体)") + print(" 5. search_related_news_history - 历史相关新闻检索") + print() + print(" === 高级数据分析 ===") + print(" 6. analyze_topic_trend - 统一话题趋势分析(热度/生命周期/爆火/预测)") + print(" 7. analyze_data_insights - 统一数据洞察分析(平台对比/活跃度/关键词共现)") + print(" 8. analyze_sentiment - 情感倾向分析") + print(" 9. find_similar_news - 相似新闻查找") + print(" 10. generate_summary_report - 每日/每周摘要生成") + print() + print(" === 配置与系统管理 ===") + print(" 11. get_current_config - 获取当前系统配置") + print(" 12. get_system_status - 获取系统运行状态") + print(" 13. trigger_crawl - 手动触发爬取任务") + print("=" * 60) + print() + + # 根据传输模式运行服务器 + if transport == 'stdio': + mcp.run(transport='stdio') + elif transport == 'http': + # HTTP 模式(生产推荐) + mcp.run( + transport='http', + host=host, + port=port, + path='/mcp' # HTTP 端点路径 + ) + else: + raise ValueError(f"不支持的传输模式: {transport}") + + +if __name__ == '__main__': + import sys + import argparse + + parser = argparse.ArgumentParser( + description='TrendRadar MCP Server - 新闻热点聚合 MCP 工具服务器', + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +使用示例: + # STDIO 模式(用于 Cherry Studio) + uv run python mcp_server/server.py + + # HTTP 模式(适合远程访问) + uv run python mcp_server/server.py --transport http --port 3333 + +Cherry Studio 配置示例: + 设置 > MCP Servers > 添加服务器 + - 名称: TrendRadar + - 类型: STDIO + - 命令: [UV的完整路径] + - 参数: --directory [项目路径] run python mcp_server/server.py + +详细配置教程请查看: README-Cherry-Studio.md + """ + ) + parser.add_argument( + '--transport', + choices=['stdio', 'http'], + default='stdio', + help='传输模式:stdio (默认) 或 http (生产环境)' + ) + parser.add_argument( + '--host', + default='0.0.0.0', + help='HTTP模式的监听地址,默认 0.0.0.0' + ) + parser.add_argument( + '--port', + type=int, + default=3333, + help='HTTP模式的监听端口,默认 3333' + ) + parser.add_argument( + '--project-root', + help='项目根目录路径' + ) + + args = parser.parse_args() + + run_server( + project_root=args.project_root, + transport=args.transport, + host=args.host, + port=args.port + ) diff --git a/mcp_server/services/__init__.py b/mcp_server/services/__init__.py new file mode 100644 index 0000000..81fd84e --- /dev/null +++ b/mcp_server/services/__init__.py @@ -0,0 +1,5 @@ +""" +服务层模块 + +提供数据访问、缓存、解析等核心服务。 +""" diff --git a/mcp_server/services/cache_service.py b/mcp_server/services/cache_service.py new file mode 100644 index 0000000..ce09d00 --- /dev/null +++ b/mcp_server/services/cache_service.py @@ -0,0 +1,136 @@ +""" +缓存服务 + +实现TTL缓存机制,提升数据访问性能。 +""" + +import time +from typing import Any, Optional +from threading import Lock + + +class CacheService: + """缓存服务类""" + + def __init__(self): + """初始化缓存服务""" + self._cache = {} + self._timestamps = {} + self._lock = Lock() + + def get(self, key: str, ttl: int = 900) -> Optional[Any]: + """ + 获取缓存数据 + + Args: + key: 缓存键 + ttl: 存活时间(秒),默认15分钟 + + Returns: + 缓存的值,如果不存在或已过期则返回None + """ + with self._lock: + if key in self._cache: + # 检查是否过期 + if time.time() - self._timestamps[key] < ttl: + return self._cache[key] + else: + # 已过期,删除缓存 + del self._cache[key] + del self._timestamps[key] + return None + + def set(self, key: str, value: Any) -> None: + """ + 设置缓存数据 + + Args: + key: 缓存键 + value: 缓存值 + """ + with self._lock: + self._cache[key] = value + self._timestamps[key] = time.time() + + def delete(self, key: str) -> bool: + """ + 删除缓存 + + Args: + key: 缓存键 + + Returns: + 是否成功删除 + """ + with self._lock: + if key in self._cache: + del self._cache[key] + del self._timestamps[key] + return True + return False + + def clear(self) -> None: + """清空所有缓存""" + with self._lock: + self._cache.clear() + self._timestamps.clear() + + def cleanup_expired(self, ttl: int = 900) -> int: + """ + 清理过期缓存 + + Args: + ttl: 存活时间(秒) + + Returns: + 清理的条目数量 + """ + with self._lock: + current_time = time.time() + expired_keys = [ + key for key, timestamp in self._timestamps.items() + if current_time - timestamp >= ttl + ] + + for key in expired_keys: + del self._cache[key] + del self._timestamps[key] + + return len(expired_keys) + + def get_stats(self) -> dict: + """ + 获取缓存统计信息 + + Returns: + 统计信息字典 + """ + with self._lock: + return { + "total_entries": len(self._cache), + "oldest_entry_age": ( + time.time() - min(self._timestamps.values()) + if self._timestamps else 0 + ), + "newest_entry_age": ( + time.time() - max(self._timestamps.values()) + if self._timestamps else 0 + ) + } + + +# 全局缓存实例 +_global_cache = None + + +def get_cache() -> CacheService: + """ + 获取全局缓存实例 + + Returns: + 全局缓存服务实例 + """ + global _global_cache + if _global_cache is None: + _global_cache = CacheService() + return _global_cache diff --git a/mcp_server/services/data_service.py b/mcp_server/services/data_service.py new file mode 100644 index 0000000..ddf1b93 --- /dev/null +++ b/mcp_server/services/data_service.py @@ -0,0 +1,564 @@ +""" +数据访问服务 + +提供统一的数据查询接口,封装数据访问逻辑。 +""" + +import re +from collections import Counter +from datetime import datetime, timedelta +from typing import Dict, List, Optional, Tuple + +from .cache_service import get_cache +from .parser_service import ParserService +from ..utils.errors import DataNotFoundError + + +class DataService: + """数据访问服务类""" + + def __init__(self, project_root: str = None): + """ + 初始化数据服务 + + Args: + project_root: 项目根目录 + """ + self.parser = ParserService(project_root) + self.cache = get_cache() + + def get_latest_news( + self, + platforms: Optional[List[str]] = None, + limit: int = 50, + include_url: bool = False + ) -> List[Dict]: + """ + 获取最新一批爬取的新闻数据 + + Args: + platforms: 平台ID列表,None表示所有平台 + limit: 返回条数限制 + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + 新闻列表 + + Raises: + DataNotFoundError: 数据不存在 + """ + # 尝试从缓存获取 + cache_key = f"latest_news:{','.join(platforms or [])}:{limit}:{include_url}" + cached = self.cache.get(cache_key, ttl=900) # 15分钟缓存 + if cached: + return cached + + # 读取今天的数据 + all_titles, id_to_name, timestamps = self.parser.read_all_titles_for_date( + date=None, + platform_ids=platforms + ) + + # 获取最新的文件时间 + if timestamps: + latest_timestamp = max(timestamps.values()) + fetch_time = datetime.fromtimestamp(latest_timestamp) + else: + fetch_time = datetime.now() + + # 转换为新闻列表 + news_list = [] + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + + for title, info in titles.items(): + # 取第一个排名 + rank = info["ranks"][0] if info["ranks"] else 0 + + news_item = { + "title": title, + "platform": platform_id, + "platform_name": platform_name, + "rank": rank, + "timestamp": fetch_time.strftime("%Y-%m-%d %H:%M:%S") + } + + # 条件性添加 URL 字段 + if include_url: + news_item["url"] = info.get("url", "") + news_item["mobileUrl"] = info.get("mobileUrl", "") + + news_list.append(news_item) + + # 按排名排序 + news_list.sort(key=lambda x: x["rank"]) + + # 限制返回数量 + result = news_list[:limit] + + # 缓存结果 + self.cache.set(cache_key, result) + + return result + + def get_news_by_date( + self, + target_date: datetime, + platforms: Optional[List[str]] = None, + limit: int = 50, + include_url: bool = False + ) -> List[Dict]: + """ + 按指定日期获取新闻 + + Args: + target_date: 目标日期 + platforms: 平台ID列表,None表示所有平台 + limit: 返回条数限制 + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + 新闻列表 + + Raises: + DataNotFoundError: 数据不存在 + + Examples: + >>> service = DataService() + >>> news = service.get_news_by_date( + ... target_date=datetime(2025, 10, 10), + ... platforms=['zhihu'], + ... limit=20 + ... ) + """ + # 尝试从缓存获取 + date_str = target_date.strftime("%Y-%m-%d") + cache_key = f"news_by_date:{date_str}:{','.join(platforms or [])}:{limit}:{include_url}" + cached = self.cache.get(cache_key, ttl=1800) # 30分钟缓存 + if cached: + return cached + + # 读取指定日期的数据 + all_titles, id_to_name, timestamps = self.parser.read_all_titles_for_date( + date=target_date, + platform_ids=platforms + ) + + # 转换为新闻列表 + news_list = [] + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + + for title, info in titles.items(): + # 计算平均排名 + avg_rank = sum(info["ranks"]) / len(info["ranks"]) if info["ranks"] else 0 + + news_item = { + "title": title, + "platform": platform_id, + "platform_name": platform_name, + "rank": info["ranks"][0] if info["ranks"] else 0, + "avg_rank": round(avg_rank, 2), + "count": len(info["ranks"]), + "date": date_str + } + + # 条件性添加 URL 字段 + if include_url: + news_item["url"] = info.get("url", "") + news_item["mobileUrl"] = info.get("mobileUrl", "") + + news_list.append(news_item) + + # 按排名排序 + news_list.sort(key=lambda x: x["rank"]) + + # 限制返回数量 + result = news_list[:limit] + + # 缓存结果(历史数据缓存更久) + self.cache.set(cache_key, result) + + return result + + def search_news_by_keyword( + self, + keyword: str, + date_range: Optional[Tuple[datetime, datetime]] = None, + platforms: Optional[List[str]] = None, + limit: Optional[int] = None + ) -> Dict: + """ + 按关键词搜索新闻 + + Args: + keyword: 搜索关键词 + date_range: 日期范围 (start_date, end_date) + platforms: 平台过滤列表 + limit: 返回条数限制(可选) + + Returns: + 搜索结果字典 + + Raises: + DataNotFoundError: 数据不存在 + """ + # 确定搜索日期范围 + if date_range: + start_date, end_date = date_range + else: + # 默认搜索今天 + start_date = end_date = datetime.now() + + # 收集所有匹配的新闻 + results = [] + platform_distribution = Counter() + + # 遍历日期范围 + current_date = start_date + while current_date <= end_date: + try: + all_titles, id_to_name, _ = self.parser.read_all_titles_for_date( + date=current_date, + platform_ids=platforms + ) + + # 搜索包含关键词的标题 + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + + for title, info in titles.items(): + if keyword.lower() in title.lower(): + # 计算平均排名 + avg_rank = sum(info["ranks"]) / len(info["ranks"]) if info["ranks"] else 0 + + results.append({ + "title": title, + "platform": platform_id, + "platform_name": platform_name, + "ranks": info["ranks"], + "count": len(info["ranks"]), + "avg_rank": round(avg_rank, 2), + "url": info.get("url", ""), + "mobileUrl": info.get("mobileUrl", ""), + "date": current_date.strftime("%Y-%m-%d") + }) + + platform_distribution[platform_id] += 1 + + except DataNotFoundError: + # 该日期没有数据,继续下一天 + pass + + # 下一天 + current_date += timedelta(days=1) + + if not results: + raise DataNotFoundError( + f"未找到包含关键词 '{keyword}' 的新闻", + suggestion="请尝试其他关键词或扩大日期范围" + ) + + # 计算统计信息 + total_ranks = [] + for item in results: + total_ranks.extend(item["ranks"]) + + avg_rank = sum(total_ranks) / len(total_ranks) if total_ranks else 0 + + # 限制返回数量(如果指定) + total_found = len(results) + if limit is not None and limit > 0: + results = results[:limit] + + return { + "results": results, + "total": len(results), + "total_found": total_found, + "statistics": { + "platform_distribution": dict(platform_distribution), + "avg_rank": round(avg_rank, 2), + "keyword": keyword + } + } + + def get_trending_topics( + self, + top_n: int = 10, + mode: str = "current" + ) -> Dict: + """ + 获取个人关注词的新闻出现频率统计 + + 注意:本工具基于 config/frequency_words.txt 中的个人关注词列表进行统计, + 而不是自动从新闻中提取热点话题。用户可以自定义这个关注词列表。 + + Args: + top_n: 返回TOP N关注词 + mode: 模式 - daily(当日累计), current(最新一批) + + Returns: + 关注词频率统计字典 + + Raises: + DataNotFoundError: 数据不存在 + """ + # 尝试从缓存获取 + cache_key = f"trending_topics:{top_n}:{mode}" + cached = self.cache.get(cache_key, ttl=1800) # 30分钟缓存 + if cached: + return cached + + # 读取今天的数据 + all_titles, id_to_name, timestamps = self.parser.read_all_titles_for_date() + + if not all_titles: + raise DataNotFoundError( + "未找到今天的新闻数据", + suggestion="请确保爬虫已经运行并生成了数据" + ) + + # 加载关键词配置 + word_groups = self.parser.parse_frequency_words() + + # 根据mode选择要处理的标题数据 + titles_to_process = {} + + if mode == "daily": + # daily模式:处理当天所有累计数据 + titles_to_process = all_titles + + elif mode == "current": + # current模式:只处理最新一批数据(最新时间戳的文件) + if timestamps: + # 找出最新的时间戳 + latest_timestamp = max(timestamps.values()) + + # 重新读取,只获取最新时间的数据 + # 这里我们通过timestamps字典反查找最新文件对应的平台 + latest_titles, _, _ = self.parser.read_all_titles_for_date() + + # 由于read_all_titles_for_date返回所有文件的合并数据, + # 我们需要通过timestamps来过滤出最新批次 + # 简化实现:使用当前所有数据作为最新批次 + # (更精确的实现需要解析服务支持按时间过滤) + titles_to_process = latest_titles + else: + titles_to_process = all_titles + + else: + raise ValueError( + f"不支持的模式: {mode}。支持的模式: daily, current" + ) + + # 统计词频 + word_frequency = Counter() + keyword_to_news = {} + + # 遍历要处理的标题 + for platform_id, titles in titles_to_process.items(): + for title in titles.keys(): + # 对每个关键词组进行匹配 + for group in word_groups: + all_words = group.get("required", []) + group.get("normal", []) + + for word in all_words: + if word and word in title: + word_frequency[word] += 1 + + if word not in keyword_to_news: + keyword_to_news[word] = [] + keyword_to_news[word].append(title) + + # 获取TOP N关键词 + top_keywords = word_frequency.most_common(top_n) + + # 构建话题列表 + topics = [] + for keyword, frequency in top_keywords: + matched_news = keyword_to_news.get(keyword, []) + + topics.append({ + "keyword": keyword, + "frequency": frequency, + "matched_news": len(set(matched_news)), # 去重后的新闻数量 + "trend": "stable", # TODO: 需要历史数据来计算趋势 + "weight_score": 0.0 # TODO: 需要实现权重计算 + }) + + # 构建结果 + result = { + "topics": topics, + "generated_at": datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + "mode": mode, + "total_keywords": len(word_frequency), + "description": self._get_mode_description(mode) + } + + # 缓存结果 + self.cache.set(cache_key, result) + + return result + + def _get_mode_description(self, mode: str) -> str: + """获取模式描述""" + descriptions = { + "daily": "当日累计统计", + "current": "最新一批统计" + } + return descriptions.get(mode, "未知模式") + + def get_current_config(self, section: str = "all") -> Dict: + """ + 获取当前系统配置 + + Args: + section: 配置节 - all/crawler/push/keywords/weights + + Returns: + 配置字典 + + Raises: + FileParseError: 配置文件解析错误 + """ + # 尝试从缓存获取 + cache_key = f"config:{section}" + cached = self.cache.get(cache_key, ttl=3600) # 1小时缓存 + if cached: + return cached + + # 解析配置文件 + config_data = self.parser.parse_yaml_config() + word_groups = self.parser.parse_frequency_words() + + # 根据section返回对应配置 + if section == "all" or section == "crawler": + crawler_config = { + "enable_crawler": config_data.get("crawler", {}).get("enable_crawler", True), + "use_proxy": config_data.get("crawler", {}).get("use_proxy", False), + "request_interval": config_data.get("crawler", {}).get("request_interval", 1), + "retry_times": 3, + "platforms": [p["id"] for p in config_data.get("platforms", [])] + } + + if section == "all" or section == "push": + push_config = { + "enable_notification": config_data.get("notification", {}).get("enable_notification", True), + "enabled_channels": [], + "message_batch_size": config_data.get("notification", {}).get("message_batch_size", 20), + "push_window": config_data.get("notification", {}).get("push_window", {}) + } + + # 检测已配置的通知渠道 + webhooks = config_data.get("notification", {}).get("webhooks", {}) + if webhooks.get("feishu_url"): + push_config["enabled_channels"].append("feishu") + if webhooks.get("dingtalk_url"): + push_config["enabled_channels"].append("dingtalk") + if webhooks.get("wework_url"): + push_config["enabled_channels"].append("wework") + + if section == "all" or section == "keywords": + keywords_config = { + "word_groups": word_groups, + "total_groups": len(word_groups) + } + + if section == "all" or section == "weights": + weights_config = { + "rank_weight": config_data.get("weight", {}).get("rank_weight", 0.6), + "frequency_weight": config_data.get("weight", {}).get("frequency_weight", 0.3), + "hotness_weight": config_data.get("weight", {}).get("hotness_weight", 0.1) + } + + # 组装结果 + if section == "all": + result = { + "crawler": crawler_config, + "push": push_config, + "keywords": keywords_config, + "weights": weights_config + } + elif section == "crawler": + result = crawler_config + elif section == "push": + result = push_config + elif section == "keywords": + result = keywords_config + elif section == "weights": + result = weights_config + else: + result = {} + + # 缓存结果 + self.cache.set(cache_key, result) + + return result + + def get_system_status(self) -> Dict: + """ + 获取系统运行状态 + + Returns: + 系统状态字典 + """ + # 获取数据统计 + output_dir = self.parser.project_root / "output" + + total_storage = 0 + oldest_record = None + latest_record = None + total_news = 0 + + if output_dir.exists(): + # 遍历日期文件夹 + for date_folder in output_dir.iterdir(): + if date_folder.is_dir(): + # 解析日期 + try: + date_str = date_folder.name + # 格式: YYYY年MM月DD日 + date_match = re.match(r'(\d{4})年(\d{2})月(\d{2})日', date_str) + if date_match: + folder_date = datetime( + int(date_match.group(1)), + int(date_match.group(2)), + int(date_match.group(3)) + ) + + if oldest_record is None or folder_date < oldest_record: + oldest_record = folder_date + if latest_record is None or folder_date > latest_record: + latest_record = folder_date + + except: + pass + + # 计算存储大小 + for item in date_folder.rglob("*"): + if item.is_file(): + total_storage += item.stat().st_size + + # 读取版本信息 + version_file = self.parser.project_root / "version" + version = "unknown" + if version_file.exists(): + try: + with open(version_file, "r") as f: + version = f.read().strip() + except: + pass + + return { + "system": { + "version": version, + "project_root": str(self.parser.project_root) + }, + "data": { + "total_storage": f"{total_storage / 1024 / 1024:.2f} MB", + "oldest_record": oldest_record.strftime("%Y-%m-%d") if oldest_record else None, + "latest_record": latest_record.strftime("%Y-%m-%d") if latest_record else None, + }, + "cache": self.cache.get_stats(), + "health": "healthy" + } diff --git a/mcp_server/services/parser_service.py b/mcp_server/services/parser_service.py new file mode 100644 index 0000000..6bd2969 --- /dev/null +++ b/mcp_server/services/parser_service.py @@ -0,0 +1,355 @@ +""" +文件解析服务 + +提供txt格式新闻数据和YAML配置文件的解析功能。 +""" + +import re +from pathlib import Path +from typing import Dict, List, Tuple, Optional +from datetime import datetime + +import yaml + +from ..utils.errors import FileParseError, DataNotFoundError +from .cache_service import get_cache + + +class ParserService: + """文件解析服务类""" + + def __init__(self, project_root: str = None): + """ + 初始化解析服务 + + Args: + project_root: 项目根目录,默认为当前目录的父目录 + """ + if project_root is None: + # 获取当前文件所在目录的父目录的父目录 + current_file = Path(__file__) + self.project_root = current_file.parent.parent.parent + else: + self.project_root = Path(project_root) + + # 初始化缓存服务 + self.cache = get_cache() + + @staticmethod + def clean_title(title: str) -> str: + """ + 清理标题文本 + + Args: + title: 原始标题 + + Returns: + 清理后的标题 + """ + # 移除多余空白 + title = re.sub(r'\s+', ' ', title) + # 移除特殊字符 + title = title.strip() + return title + + def parse_txt_file(self, file_path: Path) -> Tuple[Dict, Dict]: + """ + 解析单个txt文件的标题数据 + + Args: + file_path: txt文件路径 + + Returns: + (titles_by_id, id_to_name) 元组 + - titles_by_id: {platform_id: {title: {ranks, url, mobileUrl}}} + - id_to_name: {platform_id: platform_name} + + Raises: + FileParseError: 文件解析错误 + """ + if not file_path.exists(): + raise FileParseError(str(file_path), "文件不存在") + + titles_by_id = {} + id_to_name = {} + + try: + with open(file_path, "r", encoding="utf-8") as f: + content = f.read() + sections = content.split("\n\n") + + for section in sections: + if not section.strip() or "==== 以下ID请求失败 ====" in section: + continue + + lines = section.strip().split("\n") + if len(lines) < 2: + continue + + # 解析header: id | name 或 id + header_line = lines[0].strip() + if " | " in header_line: + parts = header_line.split(" | ", 1) + source_id = parts[0].strip() + name = parts[1].strip() + id_to_name[source_id] = name + else: + source_id = header_line + id_to_name[source_id] = source_id + + titles_by_id[source_id] = {} + + # 解析标题行 + for line in lines[1:]: + if line.strip(): + try: + title_part = line.strip() + rank = None + + # 提取排名 + if ". " in title_part and title_part.split(". ")[0].isdigit(): + rank_str, title_part = title_part.split(". ", 1) + rank = int(rank_str) + + # 提取 MOBILE URL + mobile_url = "" + if " [MOBILE:" in title_part: + title_part, mobile_part = title_part.rsplit(" [MOBILE:", 1) + if mobile_part.endswith("]"): + mobile_url = mobile_part[:-1] + + # 提取 URL + url = "" + if " [URL:" in title_part: + title_part, url_part = title_part.rsplit(" [URL:", 1) + if url_part.endswith("]"): + url = url_part[:-1] + + title = self.clean_title(title_part.strip()) + ranks = [rank] if rank is not None else [1] + + titles_by_id[source_id][title] = { + "ranks": ranks, + "url": url, + "mobileUrl": mobile_url, + } + + except Exception as e: + # 忽略单行解析错误 + continue + + except Exception as e: + raise FileParseError(str(file_path), str(e)) + + return titles_by_id, id_to_name + + def get_date_folder_name(self, date: datetime = None) -> str: + """ + 获取日期文件夹名称 + + Args: + date: 日期对象,默认为今天 + + Returns: + 文件夹名称,格式: YYYY年MM月DD日 + """ + if date is None: + date = datetime.now() + return date.strftime("%Y年%m月%d日") + + def read_all_titles_for_date( + self, + date: datetime = None, + platform_ids: Optional[List[str]] = None + ) -> Tuple[Dict, Dict, Dict]: + """ + 读取指定日期的所有标题文件(带缓存) + + Args: + date: 日期对象,默认为今天 + platform_ids: 平台ID列表,None表示所有平台 + + Returns: + (all_titles, id_to_name, all_timestamps) 元组 + - all_titles: {platform_id: {title: {ranks, url, mobileUrl, ...}}} + - id_to_name: {platform_id: platform_name} + - all_timestamps: {filename: timestamp} + + Raises: + DataNotFoundError: 数据不存在 + """ + # 生成缓存键 + date_str = self.get_date_folder_name(date) + platform_key = ','.join(sorted(platform_ids)) if platform_ids else 'all' + cache_key = f"read_all_titles:{date_str}:{platform_key}" + + # 尝试从缓存获取 + # 对于历史数据(非今天),使用更长的缓存时间(1小时) + # 对于今天的数据,使用较短的缓存时间(15分钟),因为可能有新数据 + is_today = (date is None) or (date.date() == datetime.now().date()) + ttl = 900 if is_today else 3600 # 15分钟 vs 1小时 + + cached = self.cache.get(cache_key, ttl=ttl) + if cached: + return cached + + # 缓存未命中,读取文件 + date_folder = self.get_date_folder_name(date) + txt_dir = self.project_root / "output" / date_folder / "txt" + + if not txt_dir.exists(): + raise DataNotFoundError( + f"未找到 {date_folder} 的数据目录", + suggestion="请先运行爬虫或检查日期是否正确" + ) + + all_titles = {} + id_to_name = {} + all_timestamps = {} + + # 读取所有txt文件 + txt_files = sorted(txt_dir.glob("*.txt")) + + if not txt_files: + raise DataNotFoundError( + f"{date_folder} 没有数据文件", + suggestion="请等待爬虫任务完成" + ) + + for txt_file in txt_files: + try: + titles_by_id, file_id_to_name = self.parse_txt_file(txt_file) + + # 更新id_to_name + id_to_name.update(file_id_to_name) + + # 合并标题数据 + for platform_id, titles in titles_by_id.items(): + # 如果指定了平台过滤 + if platform_ids and platform_id not in platform_ids: + continue + + if platform_id not in all_titles: + all_titles[platform_id] = {} + + for title, info in titles.items(): + if title in all_titles[platform_id]: + # 合并排名 + all_titles[platform_id][title]["ranks"].extend(info["ranks"]) + else: + all_titles[platform_id][title] = info.copy() + + # 记录文件时间戳 + all_timestamps[txt_file.name] = txt_file.stat().st_mtime + + except Exception as e: + # 忽略单个文件的解析错误,继续处理其他文件 + print(f"Warning: 解析文件 {txt_file} 失败: {e}") + continue + + if not all_titles: + raise DataNotFoundError( + f"{date_folder} 没有有效的数据", + suggestion="请检查数据文件格式或重新运行爬虫" + ) + + # 缓存结果 + result = (all_titles, id_to_name, all_timestamps) + self.cache.set(cache_key, result) + + return result + + def parse_yaml_config(self, config_path: str = None) -> dict: + """ + 解析YAML配置文件 + + Args: + config_path: 配置文件路径,默认为 config/config.yaml + + Returns: + 配置字典 + + Raises: + FileParseError: 配置文件解析错误 + """ + if config_path is None: + config_path = self.project_root / "config" / "config.yaml" + else: + config_path = Path(config_path) + + if not config_path.exists(): + raise FileParseError(str(config_path), "配置文件不存在") + + try: + with open(config_path, "r", encoding="utf-8") as f: + config_data = yaml.safe_load(f) + return config_data + except Exception as e: + raise FileParseError(str(config_path), str(e)) + + def parse_frequency_words(self, words_file: str = None) -> List[Dict]: + """ + 解析关键词配置文件 + + Args: + words_file: 关键词文件路径,默认为 config/frequency_words.txt + + Returns: + 词组列表 + + Raises: + FileParseError: 文件解析错误 + """ + if words_file is None: + words_file = self.project_root / "config" / "frequency_words.txt" + else: + words_file = Path(words_file) + + if not words_file.exists(): + return [] + + word_groups = [] + + try: + with open(words_file, "r", encoding="utf-8") as f: + for line in f: + line = line.strip() + if not line or line.startswith("#"): + continue + + # 使用 | 分隔符 + parts = [p.strip() for p in line.split("|")] + if not parts: + continue + + group = { + "required": [], + "normal": [], + "filter_words": [] + } + + for part in parts: + if not part: + continue + + words = [w.strip() for w in part.split(",")] + for word in words: + if not word: + continue + if word.endswith("+"): + # 必须词 + group["required"].append(word[:-1]) + elif word.endswith("!"): + # 过滤词 + group["filter_words"].append(word[:-1]) + else: + # 普通词 + group["normal"].append(word) + + if group["required"] or group["normal"]: + word_groups.append(group) + + except Exception as e: + raise FileParseError(str(words_file), str(e)) + + return word_groups diff --git a/mcp_server/tools/__init__.py b/mcp_server/tools/__init__.py new file mode 100644 index 0000000..6996540 --- /dev/null +++ b/mcp_server/tools/__init__.py @@ -0,0 +1,5 @@ +""" +MCP 工具模块 + +包含所有MCP工具的实现。 +""" diff --git a/mcp_server/tools/analytics.py b/mcp_server/tools/analytics.py new file mode 100644 index 0000000..5d84b7e --- /dev/null +++ b/mcp_server/tools/analytics.py @@ -0,0 +1,1989 @@ +""" +高级数据分析工具 + +提供热度趋势分析、平台对比、关键词共现、情感分析等高级分析功能。 +""" + +import re +from collections import Counter, defaultdict +from datetime import datetime, timedelta +from typing import Dict, List, Optional +from difflib import SequenceMatcher + +from ..services.data_service import DataService +from ..utils.validators import ( + validate_platforms, + validate_limit, + validate_keyword, + validate_top_n, + validate_date_range +) +from ..utils.errors import MCPError, InvalidParameterError, DataNotFoundError + + +def calculate_news_weight(news_data: Dict, rank_threshold: int = 5) -> float: + """ + 计算新闻权重(用于排序) + + 基于 main.py 的权重算法实现,综合考虑: + - 排名权重 (60%):新闻在榜单中的排名 + - 频次权重 (30%):新闻出现的次数 + - 热度权重 (10%):高排名出现的比例 + + Args: + news_data: 新闻数据字典,包含 ranks 和 count 字段 + rank_threshold: 高排名阈值,默认5 + + Returns: + 权重分数(0-100之间的浮点数) + """ + ranks = news_data.get("ranks", []) + if not ranks: + return 0.0 + + count = news_data.get("count", len(ranks)) + + # 权重配置(与 config.yaml 保持一致) + RANK_WEIGHT = 0.6 + FREQUENCY_WEIGHT = 0.3 + HOTNESS_WEIGHT = 0.1 + + # 1. 排名权重:Σ(11 - min(rank, 10)) / 出现次数 + rank_scores = [] + for rank in ranks: + score = 11 - min(rank, 10) + rank_scores.append(score) + + rank_weight = sum(rank_scores) / len(ranks) if ranks else 0 + + # 2. 频次权重:min(出现次数, 10) × 10 + frequency_weight = min(count, 10) * 10 + + # 3. 热度加成:高排名次数 / 总出现次数 × 100 + high_rank_count = sum(1 for rank in ranks if rank <= rank_threshold) + hotness_ratio = high_rank_count / len(ranks) if ranks else 0 + hotness_weight = hotness_ratio * 100 + + # 综合权重 + total_weight = ( + rank_weight * RANK_WEIGHT + + frequency_weight * FREQUENCY_WEIGHT + + hotness_weight * HOTNESS_WEIGHT + ) + + return total_weight + + +class AnalyticsTools: + """高级数据分析工具类""" + + def __init__(self, project_root: str = None): + """ + 初始化分析工具 + + Args: + project_root: 项目根目录 + """ + self.data_service = DataService(project_root) + + def analyze_data_insights_unified( + self, + insight_type: str = "platform_compare", + topic: Optional[str] = None, + date_range: Optional[Dict[str, str]] = None, + min_frequency: int = 3, + top_n: int = 20 + ) -> Dict: + """ + 统一数据洞察分析工具 - 整合多种数据分析模式 + + Args: + insight_type: 洞察类型,可选值: + - "platform_compare": 平台对比分析(对比不同平台对话题的关注度) + - "platform_activity": 平台活跃度统计(统计各平台发布频率和活跃时间) + - "keyword_cooccur": 关键词共现分析(分析关键词同时出现的模式) + topic: 话题关键词(可选,platform_compare模式适用) + date_range: 日期范围,格式: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} + min_frequency: 最小共现频次(keyword_cooccur模式),默认3 + top_n: 返回TOP N结果(keyword_cooccur模式),默认20 + + Returns: + 数据洞察分析结果字典 + + Examples: + - analyze_data_insights_unified(insight_type="platform_compare", topic="人工智能") + - analyze_data_insights_unified(insight_type="platform_activity", date_range={...}) + - analyze_data_insights_unified(insight_type="keyword_cooccur", min_frequency=5) + """ + try: + # 参数验证 + if insight_type not in ["platform_compare", "platform_activity", "keyword_cooccur"]: + raise InvalidParameterError( + f"无效的洞察类型: {insight_type}", + suggestion="支持的类型: platform_compare, platform_activity, keyword_cooccur" + ) + + # 根据洞察类型调用相应方法 + if insight_type == "platform_compare": + return self.compare_platforms( + topic=topic, + date_range=date_range + ) + elif insight_type == "platform_activity": + return self.get_platform_activity_stats( + date_range=date_range + ) + else: # keyword_cooccur + return self.analyze_keyword_cooccurrence( + min_frequency=min_frequency, + top_n=top_n + ) + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def analyze_topic_trend_unified( + self, + topic: str, + analysis_type: str = "trend", + time_range: str = "7d", + granularity: str = "day", + threshold: float = 3.0, + time_window: int = 24, + lookback_days: int = 7, + lookahead_hours: int = 6, + confidence_threshold: float = 0.7 + ) -> Dict: + """ + 统一话题趋势分析工具 - 整合多种趋势分析模式 + + Args: + topic: 话题关键词(必需) + analysis_type: 分析类型,可选值: + - "trend": 热度趋势分析(追踪话题的热度变化) + - "lifecycle": 生命周期分析(从出现到消失的完整周期) + - "viral": 异常热度检测(识别突然爆火的话题) + - "predict": 话题预测(预测未来可能的热点) + time_range: 时间范围(trend模式),默认"7d"(7d/24h/1w/1m/2m) + granularity: 时间粒度(trend模式),默认"day"(hour/day) + threshold: 热度突增倍数阈值(viral模式),默认3.0 + time_window: 检测时间窗口小时数(viral模式),默认24 + lookback_days: 回溯天数(lifecycle模式),默认7 + lookahead_hours: 预测未来小时数(predict模式),默认6 + confidence_threshold: 置信度阈值(predict模式),默认0.7 + + Returns: + 趋势分析结果字典 + + Examples: + - analyze_topic_trend_unified(topic="人工智能", analysis_type="trend", time_range="7d") + - analyze_topic_trend_unified(topic="特斯拉", analysis_type="lifecycle", lookback_days=7) + - analyze_topic_trend_unified(topic="比特币", analysis_type="viral", threshold=3.0) + - analyze_topic_trend_unified(topic="ChatGPT", analysis_type="predict", lookahead_hours=6) + """ + try: + # 参数验证 + topic = validate_keyword(topic) + + if analysis_type not in ["trend", "lifecycle", "viral", "predict"]: + raise InvalidParameterError( + f"无效的分析类型: {analysis_type}", + suggestion="支持的类型: trend, lifecycle, viral, predict" + ) + + # 根据分析类型调用相应方法 + if analysis_type == "trend": + return self.get_topic_trend_analysis( + topic=topic, + time_range=time_range, + granularity=granularity + ) + elif analysis_type == "lifecycle": + return self.analyze_topic_lifecycle( + topic=topic, + lookback_days=lookback_days + ) + elif analysis_type == "viral": + # viral模式不需要topic参数,使用通用检测 + return self.detect_viral_topics( + threshold=threshold, + time_window=time_window + ) + else: # predict + # predict模式不需要topic参数,使用通用预测 + return self.predict_trending_topics( + lookahead_hours=lookahead_hours, + confidence_threshold=confidence_threshold + ) + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def get_topic_trend_analysis( + self, + topic: str, + time_range: str = "7d", + granularity: str = "day" + ) -> Dict: + """ + 热度趋势分析 - 追踪特定话题的热度变化趋势 + + Args: + topic: 话题关键词 + time_range: 时间范围,格式:7d(7天)、24h(24小时)、1w(1周)、1m(1个月)、2m(2个月) + granularity: 时间粒度,仅支持 day(天) + + Returns: + 趋势分析结果字典 + + Examples: + 用户询问示例: + - "帮我分析一下'人工智能'这个话题最近一周的热度趋势" + - "查看'比特币'过去一周的热度变化" + - "看看'iPhone'最近7天的趋势如何" + - "分析'特斯拉'最近一个月的热度趋势" + - "查看'ChatGPT'过去2个月的趋势变化" + + 代码调用示例: + >>> tools = AnalyticsTools() + >>> # 分析7天趋势 + >>> result = tools.get_topic_trend_analysis( + ... topic="人工智能", + ... time_range="7d", + ... granularity="day" + ... ) + >>> # 分析1个月趋势 + >>> result = tools.get_topic_trend_analysis( + ... topic="特斯拉", + ... time_range="1m", + ... granularity="day" + ... ) + >>> print(result['trend_data']) + """ + try: + # 验证参数 + topic = validate_keyword(topic) + + # 验证粒度参数(只支持day) + if granularity != "day": + from ..utils.errors import InvalidParameterError + raise InvalidParameterError( + f"不支持的粒度参数: {granularity}", + suggestion="当前仅支持 'day' 粒度,因为底层数据按天聚合" + ) + + # 解析时间范围 + days = self._parse_time_range(time_range) + + # 收集趋势数据 + trend_data = [] + start_date = datetime.now() - timedelta(days=days) + current_date = start_date + + while current_date <= datetime.now(): + try: + all_titles, _, _ = self.data_service.parser.read_all_titles_for_date( + date=current_date + ) + + # 统计该时间点的话题出现次数 + count = 0 + matched_titles = [] + + for _, titles in all_titles.items(): + for title in titles.keys(): + if topic.lower() in title.lower(): + count += 1 + matched_titles.append(title) + + trend_data.append({ + "date": current_date.strftime("%Y-%m-%d"), + "count": count, + "sample_titles": matched_titles[:3] # 只保留前3个样本 + }) + + except DataNotFoundError: + trend_data.append({ + "date": current_date.strftime("%Y-%m-%d"), + "count": 0, + "sample_titles": [] + }) + + # 按天增加时间 + current_date += timedelta(days=1) + + # 计算趋势指标 + counts = [item["count"] for item in trend_data] + + if len(counts) >= 2: + # 计算涨跌幅度 + first_non_zero = next((c for c in counts if c > 0), 0) + last_count = counts[-1] + + if first_non_zero > 0: + change_rate = ((last_count - first_non_zero) / first_non_zero) * 100 + else: + change_rate = 0 + + # 找到峰值时间 + max_count = max(counts) + peak_index = counts.index(max_count) + peak_time = trend_data[peak_index]["date"] + else: + change_rate = 0 + peak_time = None + max_count = 0 + + return { + "success": True, + "topic": topic, + "time_range": time_range, + "granularity": granularity, + "trend_data": trend_data, + "statistics": { + "total_mentions": sum(counts), + "average_mentions": round(sum(counts) / len(counts), 2) if counts else 0, + "peak_count": max_count, + "peak_time": peak_time, + "change_rate": round(change_rate, 2) + }, + "trend_direction": "上升" if change_rate > 10 else "下降" if change_rate < -10 else "稳定" + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def compare_platforms( + self, + topic: Optional[str] = None, + date_range: Optional[Dict[str, str]] = None + ) -> Dict: + """ + 平台对比分析 - 对比不同平台对同一话题的关注度 + + Args: + topic: 话题关键词(可选,不指定则对比整体活跃度) + date_range: 日期范围,格式: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} + + Returns: + 平台对比分析结果 + + Examples: + 用户询问示例: + - "对比一下各个平台对'人工智能'话题的关注度" + - "看看知乎和微博哪个平台更关注科技新闻" + - "分析各平台今天的热点分布" + + 代码调用示例: + >>> tools = AnalyticsTools() + >>> result = tools.compare_platforms( + ... topic="人工智能", + ... date_range={"start": "2025-10-01", "end": "2025-10-11"} + ... ) + >>> print(result['platform_stats']) + """ + try: + # 参数验证 + if topic: + topic = validate_keyword(topic) + date_range_tuple = validate_date_range(date_range) + + # 确定日期范围 + if date_range_tuple: + start_date, end_date = date_range_tuple + else: + start_date = end_date = datetime.now() + + # 收集各平台数据 + platform_stats = defaultdict(lambda: { + "total_news": 0, + "topic_mentions": 0, + "unique_titles": set(), + "top_keywords": Counter() + }) + + # 遍历日期范围 + current_date = start_date + while current_date <= end_date: + try: + all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date( + date=current_date + ) + + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + + for title in titles.keys(): + platform_stats[platform_name]["total_news"] += 1 + platform_stats[platform_name]["unique_titles"].add(title) + + # 如果指定了话题,统计包含话题的新闻 + if topic and topic.lower() in title.lower(): + platform_stats[platform_name]["topic_mentions"] += 1 + + # 提取关键词(简单分词) + keywords = self._extract_keywords(title) + platform_stats[platform_name]["top_keywords"].update(keywords) + + except DataNotFoundError: + pass + + current_date += timedelta(days=1) + + # 转换为可序列化的格式 + result_stats = {} + for platform, stats in platform_stats.items(): + coverage_rate = 0 + if stats["total_news"] > 0: + coverage_rate = (stats["topic_mentions"] / stats["total_news"]) * 100 + + result_stats[platform] = { + "total_news": stats["total_news"], + "topic_mentions": stats["topic_mentions"], + "unique_titles": len(stats["unique_titles"]), + "coverage_rate": round(coverage_rate, 2), + "top_keywords": [ + {"keyword": k, "count": v} + for k, v in stats["top_keywords"].most_common(5) + ] + } + + # 找出各平台独有的热点 + unique_topics = self._find_unique_topics(platform_stats) + + return { + "success": True, + "topic": topic, + "date_range": { + "start": start_date.strftime("%Y-%m-%d"), + "end": end_date.strftime("%Y-%m-%d") + }, + "platform_stats": result_stats, + "unique_topics": unique_topics, + "total_platforms": len(result_stats) + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def analyze_keyword_cooccurrence( + self, + min_frequency: int = 3, + top_n: int = 20 + ) -> Dict: + """ + 关键词共现分析 - 分析哪些关键词经常同时出现 + + Args: + min_frequency: 最小共现频次 + top_n: 返回TOP N关键词对 + + Returns: + 关键词共现分析结果 + + Examples: + 用户询问示例: + - "分析一下哪些关键词经常一起出现" + - "看看'人工智能'经常和哪些词一起出现" + - "找出今天新闻中的关键词关联" + + 代码调用示例: + >>> tools = AnalyticsTools() + >>> result = tools.analyze_keyword_cooccurrence( + ... min_frequency=5, + ... top_n=15 + ... ) + >>> print(result['cooccurrence_pairs']) + """ + try: + # 参数验证 + min_frequency = validate_limit(min_frequency, default=3, max_limit=100) + top_n = validate_top_n(top_n, default=20) + + # 读取今天的数据 + all_titles, _, _ = self.data_service.parser.read_all_titles_for_date() + + # 关键词共现统计 + cooccurrence = Counter() + keyword_titles = defaultdict(list) + + for platform_id, titles in all_titles.items(): + for title in titles.keys(): + # 提取关键词 + keywords = self._extract_keywords(title) + + # 记录每个关键词出现的标题 + for kw in keywords: + keyword_titles[kw].append(title) + + # 计算两两共现 + if len(keywords) >= 2: + for i, kw1 in enumerate(keywords): + for kw2 in keywords[i+1:]: + # 统一排序,避免重复 + pair = tuple(sorted([kw1, kw2])) + cooccurrence[pair] += 1 + + # 过滤低频共现 + filtered_pairs = [ + (pair, count) for pair, count in cooccurrence.items() + if count >= min_frequency + ] + + # 排序并取TOP N + top_pairs = sorted(filtered_pairs, key=lambda x: x[1], reverse=True)[:top_n] + + # 构建结果 + result_pairs = [] + for (kw1, kw2), count in top_pairs: + # 找出同时包含两个关键词的标题样本 + titles_with_both = [ + title for title in keyword_titles[kw1] + if kw2 in self._extract_keywords(title) + ] + + result_pairs.append({ + "keyword1": kw1, + "keyword2": kw2, + "cooccurrence_count": count, + "sample_titles": titles_with_both[:3] + }) + + return { + "success": True, + "cooccurrence_pairs": result_pairs, + "total_pairs": len(result_pairs), + "min_frequency": min_frequency, + "generated_at": datetime.now().strftime("%Y-%m-%d %H:%M:%S") + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def analyze_sentiment( + self, + topic: Optional[str] = None, + platforms: Optional[List[str]] = None, + date_range: Optional[Dict[str, str]] = None, + limit: int = 50, + sort_by_weight: bool = True, + include_url: bool = False + ) -> Dict: + """ + 情感倾向分析 - 生成用于 AI 情感分析的结构化提示词 + + 本工具收集新闻数据并生成优化的 AI 提示词,你可以将其发送给 AI 进行深度情感分析。 + + Args: + topic: 话题关键词(可选),只分析包含该关键词的新闻 + platforms: 平台过滤列表(可选),如 ['zhihu', 'weibo'] + date_range: 日期范围(可选),格式: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} + 不指定则默认查询今天的数据 + limit: 返回新闻数量限制,默认50,最大100 + sort_by_weight: 是否按权重排序,默认True(推荐) + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + 包含 AI 提示词和新闻数据的结构化结果 + + Examples: + 用户询问示例: + - "分析一下今天新闻的情感倾向" + - "看看'特斯拉'相关新闻是正面还是负面的" + - "分析各平台对'人工智能'的情感态度" + - "看看'特斯拉'相关新闻是正面还是负面的,请选择一周内的前10条新闻来分析" + + 代码调用示例: + >>> tools = AnalyticsTools() + >>> # 分析今天的特斯拉新闻,返回前10条 + >>> result = tools.analyze_sentiment( + ... topic="特斯拉", + ... limit=10 + ... ) + >>> # 分析一周内的特斯拉新闻,返回前10条按权重排序 + >>> result = tools.analyze_sentiment( + ... topic="特斯拉", + ... date_range={"start": "2025-10-06", "end": "2025-10-13"}, + ... limit=10 + ... ) + >>> print(result['ai_prompt']) # 获取生成的提示词 + """ + try: + # 参数验证 + if topic: + topic = validate_keyword(topic) + platforms = validate_platforms(platforms) + limit = validate_limit(limit, default=50) + + # 处理日期范围 + if date_range: + date_range_tuple = validate_date_range(date_range) + start_date, end_date = date_range_tuple + else: + # 默认今天 + start_date = end_date = datetime.now() + + # 收集新闻数据(支持多天) + all_news_items = [] + current_date = start_date + + while current_date <= end_date: + try: + all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date( + date=current_date, + platform_ids=platforms + ) + + # 收集该日期的新闻 + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + for title, info in titles.items(): + # 如果指定了话题,只收集包含话题的标题 + if topic and topic.lower() not in title.lower(): + continue + + news_item = { + "platform": platform_name, + "title": title, + "ranks": info.get("ranks", []), + "count": len(info.get("ranks", [])), + "date": current_date.strftime("%Y-%m-%d") + } + + # 条件性添加 URL 字段 + if include_url: + news_item["url"] = info.get("url", "") + news_item["mobileUrl"] = info.get("mobileUrl", "") + + all_news_items.append(news_item) + + except DataNotFoundError: + # 该日期没有数据,继续下一天 + pass + + # 下一天 + current_date += timedelta(days=1) + + if not all_news_items: + time_desc = "今天" if start_date == end_date else f"{start_date.strftime('%Y-%m-%d')} 至 {end_date.strftime('%Y-%m-%d')}" + raise DataNotFoundError( + f"未找到相关新闻({time_desc})", + suggestion="请尝试其他话题、日期范围或平台" + ) + + # 去重(同一标题只保留一次) + unique_news = {} + for item in all_news_items: + key = f"{item['platform']}::{item['title']}" + if key not in unique_news: + unique_news[key] = item + else: + # 合并 ranks(如果同一新闻在多天出现) + existing = unique_news[key] + existing["ranks"].extend(item["ranks"]) + existing["count"] = len(existing["ranks"]) + + deduplicated_news = list(unique_news.values()) + + # 按权重排序(如果启用) + if sort_by_weight: + deduplicated_news.sort( + key=lambda x: calculate_news_weight(x), + reverse=True + ) + + # 限制返回数量 + selected_news = deduplicated_news[:limit] + + # 生成 AI 提示词 + ai_prompt = self._create_sentiment_analysis_prompt( + news_data=selected_news, + topic=topic + ) + + # 构建时间范围描述 + if start_date == end_date: + time_range_desc = start_date.strftime("%Y-%m-%d") + else: + time_range_desc = f"{start_date.strftime('%Y-%m-%d')} 至 {end_date.strftime('%Y-%m-%d')}" + + result = { + "success": True, + "method": "ai_prompt_generation", + "summary": { + "total_found": len(deduplicated_news), + "returned_count": len(selected_news), + "requested_limit": limit, + "duplicates_removed": len(all_news_items) - len(deduplicated_news), + "topic": topic, + "time_range": time_range_desc, + "platforms": list(set(item["platform"] for item in selected_news)), + "sorted_by_weight": sort_by_weight + }, + "ai_prompt": ai_prompt, + "news_sample": selected_news, + "usage_note": "请将 ai_prompt 字段的内容发送给 AI 进行情感分析" + } + + # 如果返回数量少于请求数量,增加提示 + if len(selected_news) < limit and len(deduplicated_news) >= limit: + result["note"] = "返回数量少于请求数量是因为去重逻辑(同一标题在不同平台只保留一次)" + elif len(deduplicated_news) < limit: + result["note"] = f"在指定时间范围内仅找到 {len(deduplicated_news)} 条匹配的新闻" + + return result + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def _create_sentiment_analysis_prompt( + self, + news_data: List[Dict], + topic: Optional[str] + ) -> str: + """ + 创建情感分析的 AI 提示词 + + Args: + news_data: 新闻数据列表(已排序和限制数量) + topic: 话题关键词 + + Returns: + 格式化的 AI 提示词 + """ + # 按平台分组 + platform_news = defaultdict(list) + for item in news_data: + platform_news[item["platform"]].append({ + "title": item["title"], + "date": item.get("date", "") + }) + + # 构建提示词 + prompt_parts = [] + + # 1. 任务说明 + if topic: + prompt_parts.append(f"请分析以下关于「{topic}」的新闻标题的情感倾向。") + else: + prompt_parts.append("请分析以下新闻标题的情感倾向。") + + prompt_parts.append("") + prompt_parts.append("分析要求:") + prompt_parts.append("1. 识别每条新闻的情感倾向(正面/负面/中性)") + prompt_parts.append("2. 统计各情感类别的数量和百分比") + prompt_parts.append("3. 分析不同平台的情感差异") + prompt_parts.append("4. 总结整体情感趋势") + prompt_parts.append("5. 列举典型的正面和负面新闻样本") + prompt_parts.append("") + + # 2. 数据概览 + prompt_parts.append(f"数据概览:") + prompt_parts.append(f"- 总新闻数:{len(news_data)}") + prompt_parts.append(f"- 覆盖平台:{len(platform_news)}") + + # 时间范围 + dates = set(item.get("date", "") for item in news_data if item.get("date")) + if dates: + date_list = sorted(dates) + if len(date_list) == 1: + prompt_parts.append(f"- 时间范围:{date_list[0]}") + else: + prompt_parts.append(f"- 时间范围:{date_list[0]} 至 {date_list[-1]}") + + prompt_parts.append("") + + # 3. 按平台展示新闻 + prompt_parts.append("新闻列表(按平台分类,已按重要性排序):") + prompt_parts.append("") + + for platform, items in sorted(platform_news.items()): + prompt_parts.append(f"【{platform}】({len(items)} 条)") + for i, item in enumerate(items, 1): + title = item["title"] + date_str = f" [{item['date']}]" if item.get("date") else "" + prompt_parts.append(f"{i}. {title}{date_str}") + prompt_parts.append("") + + # 4. 输出格式说明 + prompt_parts.append("请按以下格式输出分析结果:") + prompt_parts.append("") + prompt_parts.append("## 情感分布统计") + prompt_parts.append("- 正面:XX条 (XX%)") + prompt_parts.append("- 负面:XX条 (XX%)") + prompt_parts.append("- 中性:XX条 (XX%)") + prompt_parts.append("") + prompt_parts.append("## 平台情感对比") + prompt_parts.append("[各平台的情感倾向差异]") + prompt_parts.append("") + prompt_parts.append("## 整体情感趋势") + prompt_parts.append("[总体分析和关键发现]") + prompt_parts.append("") + prompt_parts.append("## 典型样本") + prompt_parts.append("正面新闻样本:") + prompt_parts.append("[列举3-5条]") + prompt_parts.append("") + prompt_parts.append("负面新闻样本:") + prompt_parts.append("[列举3-5条]") + + return "\n".join(prompt_parts) + + def find_similar_news( + self, + reference_title: str, + threshold: float = 0.6, + limit: int = 50, + include_url: bool = False + ) -> Dict: + """ + 相似新闻查找 - 基于标题相似度查找相关新闻 + + Args: + reference_title: 参考标题 + threshold: 相似度阈值(0-1之间) + limit: 返回条数限制,默认50 + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + 相似新闻列表 + + Examples: + 用户询问示例: + - "找出和'特斯拉降价'相似的新闻" + - "查找关于iPhone发布的类似报道" + - "看看有没有和这条新闻相似的报道" + + 代码调用示例: + >>> tools = AnalyticsTools() + >>> result = tools.find_similar_news( + ... reference_title="特斯拉宣布降价", + ... threshold=0.6, + ... limit=10 + ... ) + >>> print(result['similar_news']) + """ + try: + # 参数验证 + reference_title = validate_keyword(reference_title) + + if not 0 <= threshold <= 1: + raise InvalidParameterError( + "threshold 必须在 0 到 1 之间", + suggestion="推荐值:0.5-0.8" + ) + + limit = validate_limit(limit, default=50) + + # 读取数据 + all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date() + + # 计算相似度 + similar_items = [] + + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + + for title, info in titles.items(): + if title == reference_title: + continue + + # 计算相似度 + similarity = self._calculate_similarity(reference_title, title) + + if similarity >= threshold: + news_item = { + "title": title, + "platform": platform_id, + "platform_name": platform_name, + "similarity": round(similarity, 3), + "rank": info["ranks"][0] if info["ranks"] else 0 + } + + # 条件性添加 URL 字段 + if include_url: + news_item["url"] = info.get("url", "") + + similar_items.append(news_item) + + # 按相似度排序 + similar_items.sort(key=lambda x: x["similarity"], reverse=True) + + # 限制数量 + result_items = similar_items[:limit] + + if not result_items: + raise DataNotFoundError( + f"未找到相似度超过 {threshold} 的新闻", + suggestion="请降低相似度阈值或尝试其他标题" + ) + + result = { + "success": True, + "summary": { + "total_found": len(similar_items), + "returned_count": len(result_items), + "requested_limit": limit, + "threshold": threshold, + "reference_title": reference_title + }, + "similar_news": result_items + } + + if len(similar_items) < limit: + result["note"] = f"相似度阈值 {threshold} 下仅找到 {len(similar_items)} 条相似新闻" + + return result + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def search_by_entity( + self, + entity: str, + entity_type: Optional[str] = None, + limit: int = 50, + sort_by_weight: bool = True + ) -> Dict: + """ + 实体识别搜索 - 搜索包含特定人物/地点/机构的新闻 + + Args: + entity: 实体名称 + entity_type: 实体类型(person/location/organization),可选 + limit: 返回条数限制,默认50,最大200 + sort_by_weight: 是否按权重排序,默认True + + Returns: + 实体相关新闻列表 + + Examples: + 用户询问示例: + - "搜索马斯克相关的新闻" + - "查找关于特斯拉公司的报道,返回前20条" + - "看看北京有什么新闻" + + 代码调用示例: + >>> tools = AnalyticsTools() + >>> result = tools.search_by_entity( + ... entity="马斯克", + ... entity_type="person", + ... limit=20 + ... ) + >>> print(result['related_news']) + """ + try: + # 参数验证 + entity = validate_keyword(entity) + limit = validate_limit(limit, default=50) + + if entity_type and entity_type not in ["person", "location", "organization"]: + raise InvalidParameterError( + f"无效的实体类型: {entity_type}", + suggestion="支持的类型: person, location, organization" + ) + + # 读取数据 + all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date() + + # 搜索包含实体的新闻 + related_news = [] + entity_context = Counter() # 统计实体周边的词 + + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + + for title, info in titles.items(): + if entity in title: + url = info.get("url", "") + mobile_url = info.get("mobileUrl", "") + ranks = info.get("ranks", []) + count = len(ranks) + + related_news.append({ + "title": title, + "platform": platform_id, + "platform_name": platform_name, + "url": url, + "mobileUrl": mobile_url, + "ranks": ranks, + "count": count, + "rank": ranks[0] if ranks else 999 + }) + + # 提取实体周边的关键词 + keywords = self._extract_keywords(title) + entity_context.update(keywords) + + if not related_news: + raise DataNotFoundError( + f"未找到包含实体 '{entity}' 的新闻", + suggestion="请尝试其他实体名称" + ) + + # 移除实体本身 + if entity in entity_context: + del entity_context[entity] + + # 按权重排序(如果启用) + if sort_by_weight: + related_news.sort( + key=lambda x: calculate_news_weight(x), + reverse=True + ) + else: + # 按排名排序 + related_news.sort(key=lambda x: x["rank"]) + + # 限制返回数量 + result_news = related_news[:limit] + + return { + "success": True, + "entity": entity, + "entity_type": entity_type or "auto", + "related_news": result_news, + "total_found": len(related_news), + "returned_count": len(result_news), + "sorted_by_weight": sort_by_weight, + "related_keywords": [ + {"keyword": k, "count": v} + for k, v in entity_context.most_common(10) + ] + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def generate_summary_report( + self, + report_type: str = "daily", + date_range: Optional[Dict[str, str]] = None + ) -> Dict: + """ + 每日/每周摘要生成器 - 自动生成热点摘要报告 + + Args: + report_type: 报告类型(daily/weekly) + date_range: 自定义日期范围(可选) + + Returns: + Markdown格式的摘要报告 + + Examples: + 用户询问示例: + - "生成今天的新闻摘要报告" + - "给我一份本周的热点总结" + - "生成过去7天的新闻分析报告" + + 代码调用示例: + >>> tools = AnalyticsTools() + >>> result = tools.generate_summary_report( + ... report_type="daily" + ... ) + >>> print(result['markdown_report']) + """ + try: + # 参数验证 + if report_type not in ["daily", "weekly"]: + raise InvalidParameterError( + f"无效的报告类型: {report_type}", + suggestion="支持的类型: daily, weekly" + ) + + # 确定日期范围 + if date_range: + date_range_tuple = validate_date_range(date_range) + start_date, end_date = date_range_tuple + else: + if report_type == "daily": + start_date = end_date = datetime.now() + else: # weekly + end_date = datetime.now() + start_date = end_date - timedelta(days=6) + + # 收集数据 + all_keywords = Counter() + all_platforms_news = defaultdict(int) + all_titles_list = [] + + current_date = start_date + while current_date <= end_date: + try: + all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date( + date=current_date + ) + + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + all_platforms_news[platform_name] += len(titles) + + for title in titles.keys(): + all_titles_list.append({ + "title": title, + "platform": platform_name, + "date": current_date.strftime("%Y-%m-%d") + }) + + # 提取关键词 + keywords = self._extract_keywords(title) + all_keywords.update(keywords) + + except DataNotFoundError: + pass + + current_date += timedelta(days=1) + + # 生成报告 + report_title = f"{'每日' if report_type == 'daily' else '每周'}新闻热点摘要" + date_str = f"{start_date.strftime('%Y-%m-%d')}" if report_type == "daily" else f"{start_date.strftime('%Y-%m-%d')} 至 {end_date.strftime('%Y-%m-%d')}" + + # 构建Markdown报告 + markdown = f"""# {report_title} + +**报告日期**: {date_str} +**生成时间**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} + +--- + +## 📊 数据概览 + +- **总新闻数**: {len(all_titles_list)} +- **覆盖平台**: {len(all_platforms_news)} +- **热门关键词数**: {len(all_keywords)} + +## 🔥 TOP 10 热门话题 + +""" + + # 添加TOP 10关键词 + for i, (keyword, count) in enumerate(all_keywords.most_common(10), 1): + markdown += f"{i}. **{keyword}** - 出现 {count} 次\n" + + # 平台分析 + markdown += "\n## 📱 平台活跃度\n\n" + sorted_platforms = sorted(all_platforms_news.items(), key=lambda x: x[1], reverse=True) + + for platform, count in sorted_platforms: + markdown += f"- **{platform}**: {count} 条新闻\n" + + # 趋势变化(如果是周报) + if report_type == "weekly": + markdown += "\n## 📈 趋势分析\n\n" + markdown += "本周热度持续的话题(样本数据):\n\n" + + # 简单的趋势分析 + top_keywords = [kw for kw, _ in all_keywords.most_common(5)] + for keyword in top_keywords: + markdown += f"- **{keyword}**: 持续热门\n" + + # 添加样本新闻(按权重选择,确保确定性) + markdown += "\n## 📰 精选新闻样本\n\n" + + # 确定性选取:按标题的权重排序,取前5条 + # 这样相同输入总是返回相同结果 + if all_titles_list: + # 计算每条新闻的权重分数(基于关键词出现次数) + news_with_scores = [] + for news in all_titles_list: + # 简单权重:统计包含TOP关键词的次数 + score = 0 + title_lower = news['title'].lower() + for keyword, count in all_keywords.most_common(10): + if keyword.lower() in title_lower: + score += count + news_with_scores.append((news, score)) + + # 按权重降序排序,权重相同则按标题字母顺序(确保确定性) + news_with_scores.sort(key=lambda x: (-x[1], x[0]['title'])) + + # 取前5条 + sample_news = [item[0] for item in news_with_scores[:5]] + + for news in sample_news: + markdown += f"- [{news['platform']}] {news['title']}\n" + + markdown += "\n---\n\n*本报告由 TrendRadar MCP 自动生成*\n" + + return { + "success": True, + "report_type": report_type, + "date_range": { + "start": start_date.strftime("%Y-%m-%d"), + "end": end_date.strftime("%Y-%m-%d") + }, + "markdown_report": markdown, + "statistics": { + "total_news": len(all_titles_list), + "platforms_count": len(all_platforms_news), + "keywords_count": len(all_keywords), + "top_keyword": all_keywords.most_common(1)[0] if all_keywords else None + } + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def get_platform_activity_stats( + self, + date_range: Optional[Dict[str, str]] = None + ) -> Dict: + """ + 平台活跃度统计 - 统计各平台的发布频率和活跃时间段 + + Args: + date_range: 日期范围(可选) + + Returns: + 平台活跃度统计结果 + + Examples: + 用户询问示例: + - "统计各平台今天的活跃度" + - "看看哪个平台更新最频繁" + - "分析各平台的发布时间规律" + + 代码调用示例: + >>> tools = AnalyticsTools() + >>> result = tools.get_platform_activity_stats( + ... date_range={"start": "2025-10-01", "end": "2025-10-11"} + ... ) + >>> print(result['platform_activity']) + """ + try: + # 参数验证 + date_range_tuple = validate_date_range(date_range) + + # 确定日期范围 + if date_range_tuple: + start_date, end_date = date_range_tuple + else: + start_date = end_date = datetime.now() + + # 统计各平台活跃度 + platform_activity = defaultdict(lambda: { + "total_updates": 0, + "days_active": set(), + "news_count": 0, + "hourly_distribution": Counter() + }) + + # 遍历日期范围 + current_date = start_date + while current_date <= end_date: + try: + all_titles, id_to_name, timestamps = self.data_service.parser.read_all_titles_for_date( + date=current_date + ) + + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + + platform_activity[platform_name]["news_count"] += len(titles) + platform_activity[platform_name]["days_active"].add(current_date.strftime("%Y-%m-%d")) + + # 统计更新次数(基于文件数量) + platform_activity[platform_name]["total_updates"] += len(timestamps) + + # 统计时间分布(基于文件名中的时间) + for filename in timestamps.keys(): + # 解析文件名中的小时(格式:HHMM.txt) + match = re.match(r'(\d{2})(\d{2})\.txt', filename) + if match: + hour = int(match.group(1)) + platform_activity[platform_name]["hourly_distribution"][hour] += 1 + + except DataNotFoundError: + pass + + current_date += timedelta(days=1) + + # 转换为可序列化的格式 + result_activity = {} + for platform, stats in platform_activity.items(): + days_count = len(stats["days_active"]) + avg_news_per_day = stats["news_count"] / days_count if days_count > 0 else 0 + + # 找出最活跃的时间段 + most_active_hours = stats["hourly_distribution"].most_common(3) + + result_activity[platform] = { + "total_updates": stats["total_updates"], + "news_count": stats["news_count"], + "days_active": days_count, + "avg_news_per_day": round(avg_news_per_day, 2), + "most_active_hours": [ + {"hour": f"{hour:02d}:00", "count": count} + for hour, count in most_active_hours + ], + "activity_score": round(stats["news_count"] / max(days_count, 1), 2) + } + + # 按活跃度排序 + sorted_platforms = sorted( + result_activity.items(), + key=lambda x: x[1]["activity_score"], + reverse=True + ) + + return { + "success": True, + "date_range": { + "start": start_date.strftime("%Y-%m-%d"), + "end": end_date.strftime("%Y-%m-%d") + }, + "platform_activity": dict(sorted_platforms), + "most_active_platform": sorted_platforms[0][0] if sorted_platforms else None, + "total_platforms": len(result_activity) + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def analyze_topic_lifecycle( + self, + topic: str, + lookback_days: int = 7 + ) -> Dict: + """ + 话题生命周期分析 - 追踪话题从出现到消失的完整周期 + + Args: + topic: 话题关键词 + lookback_days: 回溯天数 + + Returns: + 话题生命周期分析结果 + + Examples: + 用户询问示例: + - "分析'人工智能'这个话题的生命周期" + - "看看'iPhone'话题是昙花一现还是持续热点" + - "追踪'比特币'话题的热度变化" + + 代码调用示例: + >>> tools = AnalyticsTools() + >>> result = tools.analyze_topic_lifecycle( + ... topic="人工智能", + ... lookback_days=7 + ... ) + >>> print(result['lifecycle_stage']) + """ + try: + # 参数验证 + topic = validate_keyword(topic) + lookback_days = validate_limit(lookback_days, default=7, max_limit=30) + + # 收集话题历史数据 + lifecycle_data = [] + start_date = datetime.now() - timedelta(days=lookback_days) + + current_date = start_date + while current_date <= datetime.now(): + try: + all_titles, _, _ = self.data_service.parser.read_all_titles_for_date( + date=current_date + ) + + # 统计该日的话题出现次数 + count = 0 + for _, titles in all_titles.items(): + for title in titles.keys(): + if topic.lower() in title.lower(): + count += 1 + + lifecycle_data.append({ + "date": current_date.strftime("%Y-%m-%d"), + "count": count + }) + + except DataNotFoundError: + lifecycle_data.append({ + "date": current_date.strftime("%Y-%m-%d"), + "count": 0 + }) + + current_date += timedelta(days=1) + + # 分析生命周期阶段 + counts = [item["count"] for item in lifecycle_data] + + if not any(counts): + raise DataNotFoundError( + f"在过去 {lookback_days} 天内未找到话题 '{topic}'", + suggestion="请尝试其他话题或扩大时间范围" + ) + + # 找到首次出现和最后出现 + first_appearance = next((item["date"] for item in lifecycle_data if item["count"] > 0), None) + last_appearance = next((item["date"] for item in reversed(lifecycle_data) if item["count"] > 0), None) + + # 计算峰值 + max_count = max(counts) + peak_index = counts.index(max_count) + peak_date = lifecycle_data[peak_index]["date"] + + # 计算平均值和标准差(简单实现) + non_zero_counts = [c for c in counts if c > 0] + avg_count = sum(non_zero_counts) / len(non_zero_counts) if non_zero_counts else 0 + + # 判断生命周期阶段 + recent_counts = counts[-3:] # 最近3天 + early_counts = counts[:3] # 前3天 + + if sum(recent_counts) > sum(early_counts): + lifecycle_stage = "上升期" + elif sum(recent_counts) < sum(early_counts) * 0.5: + lifecycle_stage = "衰退期" + elif max_count in recent_counts: + lifecycle_stage = "爆发期" + else: + lifecycle_stage = "稳定期" + + # 分类:昙花一现 vs 持续热点 + active_days = sum(1 for c in counts if c > 0) + + if active_days <= 2 and max_count > avg_count * 2: + topic_type = "昙花一现" + elif active_days >= lookback_days * 0.6: + topic_type = "持续热点" + else: + topic_type = "周期性热点" + + return { + "success": True, + "topic": topic, + "lookback_days": lookback_days, + "lifecycle_data": lifecycle_data, + "analysis": { + "first_appearance": first_appearance, + "last_appearance": last_appearance, + "peak_date": peak_date, + "peak_count": max_count, + "active_days": active_days, + "avg_daily_mentions": round(avg_count, 2), + "lifecycle_stage": lifecycle_stage, + "topic_type": topic_type + } + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def detect_viral_topics( + self, + threshold: float = 3.0, + time_window: int = 24 + ) -> Dict: + """ + 异常热度检测 - 自动识别突然爆火的话题 + + Args: + threshold: 热度突增倍数阈值 + time_window: 检测时间窗口(小时) + + Returns: + 爆火话题列表 + + Examples: + 用户询问示例: + - "检测今天有哪些突然爆火的话题" + - "看看有没有热度异常的新闻" + - "预警可能的重大事件" + + 代码调用示例: + >>> tools = AnalyticsTools() + >>> result = tools.detect_viral_topics( + ... threshold=3.0, + ... time_window=24 + ... ) + >>> print(result['viral_topics']) + """ + try: + # 参数验证 + if threshold < 1.0: + raise InvalidParameterError( + "threshold 必须大于等于 1.0", + suggestion="推荐值:2.0-5.0" + ) + + time_window = validate_limit(time_window, default=24, max_limit=72) + + # 读取当前和之前的数据 + current_all_titles, _, _ = self.data_service.parser.read_all_titles_for_date() + + # 读取昨天的数据作为基准 + yesterday = datetime.now() - timedelta(days=1) + try: + previous_all_titles, _, _ = self.data_service.parser.read_all_titles_for_date( + date=yesterday + ) + except DataNotFoundError: + previous_all_titles = {} + + # 统计当前的关键词频率 + current_keywords = Counter() + current_keyword_titles = defaultdict(list) + + for _, titles in current_all_titles.items(): + for title in titles.keys(): + keywords = self._extract_keywords(title) + current_keywords.update(keywords) + + for kw in keywords: + current_keyword_titles[kw].append(title) + + # 统计之前的关键词频率 + previous_keywords = Counter() + + for _, titles in previous_all_titles.items(): + for title in titles.keys(): + keywords = self._extract_keywords(title) + previous_keywords.update(keywords) + + # 检测异常热度 + viral_topics = [] + + for keyword, current_count in current_keywords.items(): + previous_count = previous_keywords.get(keyword, 0) + + # 计算增长倍数 + if previous_count == 0: + # 新出现的话题 + if current_count >= 5: # 至少出现5次才认为是爆火 + growth_rate = float('inf') + is_viral = True + else: + continue + else: + growth_rate = current_count / previous_count + is_viral = growth_rate >= threshold + + if is_viral: + viral_topics.append({ + "keyword": keyword, + "current_count": current_count, + "previous_count": previous_count, + "growth_rate": round(growth_rate, 2) if growth_rate != float('inf') else "新话题", + "sample_titles": current_keyword_titles[keyword][:3], + "alert_level": "高" if growth_rate > threshold * 2 else "中" + }) + + # 按增长率排序 + viral_topics.sort( + key=lambda x: x["current_count"] if x["growth_rate"] == "新话题" else x["growth_rate"], + reverse=True + ) + + if not viral_topics: + return { + "success": True, + "viral_topics": [], + "total_detected": 0, + "message": f"未检测到热度增长超过 {threshold} 倍的话题" + } + + return { + "success": True, + "viral_topics": viral_topics, + "total_detected": len(viral_topics), + "threshold": threshold, + "time_window": time_window, + "detection_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S") + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def predict_trending_topics( + self, + lookahead_hours: int = 6, + confidence_threshold: float = 0.7 + ) -> Dict: + """ + 话题预测 - 基于历史数据预测未来可能的热点 + + Args: + lookahead_hours: 预测未来多少小时 + confidence_threshold: 置信度阈值 + + Returns: + 预测的潜力话题列表 + + Examples: + 用户询问示例: + - "预测接下来6小时可能的热点话题" + - "有哪些话题可能会火起来" + - "早期发现潜力话题" + + 代码调用示例: + >>> tools = AnalyticsTools() + >>> result = tools.predict_trending_topics( + ... lookahead_hours=6, + ... confidence_threshold=0.7 + ... ) + >>> print(result['predicted_topics']) + """ + try: + # 参数验证 + lookahead_hours = validate_limit(lookahead_hours, default=6, max_limit=48) + + if not 0 <= confidence_threshold <= 1: + raise InvalidParameterError( + "confidence_threshold 必须在 0 到 1 之间", + suggestion="推荐值:0.6-0.8" + ) + + # 收集最近3天的数据用于预测 + keyword_trends = defaultdict(list) + + for days_ago in range(3, 0, -1): + date = datetime.now() - timedelta(days=days_ago) + + try: + all_titles, _, _ = self.data_service.parser.read_all_titles_for_date( + date=date + ) + + # 统计关键词 + keywords_count = Counter() + for _, titles in all_titles.items(): + for title in titles.keys(): + keywords = self._extract_keywords(title) + keywords_count.update(keywords) + + # 记录每个关键词的历史数据 + for keyword, count in keywords_count.items(): + keyword_trends[keyword].append(count) + + except DataNotFoundError: + pass + + # 添加今天的数据 + try: + all_titles, _, _ = self.data_service.parser.read_all_titles_for_date() + + keywords_count = Counter() + keyword_titles = defaultdict(list) + + for _, titles in all_titles.items(): + for title in titles.keys(): + keywords = self._extract_keywords(title) + keywords_count.update(keywords) + + for kw in keywords: + keyword_titles[kw].append(title) + + for keyword, count in keywords_count.items(): + keyword_trends[keyword].append(count) + + except DataNotFoundError: + raise DataNotFoundError( + "未找到今天的数据", + suggestion="请等待爬虫任务完成" + ) + + # 预测潜力话题 + predicted_topics = [] + + for keyword, trend_data in keyword_trends.items(): + if len(trend_data) < 2: + continue + + # 简单的线性趋势预测 + # 计算增长率 + recent_value = trend_data[-1] + previous_value = trend_data[-2] if len(trend_data) >= 2 else 0 + + if previous_value == 0: + if recent_value >= 3: + growth_rate = 1.0 + else: + continue + else: + growth_rate = (recent_value - previous_value) / previous_value + + # 判断是否是上升趋势 + if growth_rate > 0.3: # 增长超过30% + # 计算置信度(基于趋势的稳定性) + if len(trend_data) >= 3: + # 检查是否连续增长 + is_consistent = all( + trend_data[i] <= trend_data[i+1] + for i in range(len(trend_data)-1) + ) + confidence = 0.9 if is_consistent else 0.7 + else: + confidence = 0.6 + + if confidence >= confidence_threshold: + predicted_topics.append({ + "keyword": keyword, + "current_count": recent_value, + "growth_rate": round(growth_rate * 100, 2), + "confidence": round(confidence, 2), + "trend_data": trend_data, + "prediction": "上升趋势,可能成为热点", + "sample_titles": keyword_titles.get(keyword, [])[:3] + }) + + # 按置信度和增长率排序 + predicted_topics.sort( + key=lambda x: (x["confidence"], x["growth_rate"]), + reverse=True + ) + + return { + "success": True, + "predicted_topics": predicted_topics[:20], # 返回TOP 20 + "total_predicted": len(predicted_topics), + "lookahead_hours": lookahead_hours, + "confidence_threshold": confidence_threshold, + "prediction_time": datetime.now().strftime("%Y-%m-%d %H:%M:%S"), + "note": "预测基于历史趋势,实际结果可能有偏差" + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + # ==================== 辅助方法 ==================== + + def _parse_time_range(self, time_range: str) -> int: + """解析时间范围字符串为天数""" + match = re.match(r'(\d+)([dhwm])', time_range.lower()) + if not match: + raise InvalidParameterError( + f"无效的时间范围格式: {time_range}", + suggestion="格式示例:7d(7天)、24h(24小时)、1w(1周)、1m(1个月)、2m(2个月)" + ) + + value = int(match.group(1)) + unit = match.group(2) + + if unit == 'h': + return max(1, value // 24) # 转换为天数 + elif unit == 'd': + return value + elif unit == 'w': + return value * 7 + elif unit == 'm': + return value * 30 # 1个月按30天计算 + + return value + + def _extract_keywords(self, title: str, min_length: int = 2) -> List[str]: + """ + 从标题中提取关键词(简单实现) + + Args: + title: 标题文本 + min_length: 最小关键词长度 + + Returns: + 关键词列表 + """ + # 移除URL和特殊字符 + title = re.sub(r'http[s]?://\S+', '', title) + title = re.sub(r'[^\w\s]', ' ', title) + + # 简单分词(按空格和常见分隔符) + words = re.split(r'[\s,。!?、]+', title) + + # 过滤停用词和短词 + stopwords = {'的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', '看', '好', '自己', '这'} + + keywords = [ + word.strip() for word in words + if word.strip() and len(word.strip()) >= min_length and word.strip() not in stopwords + ] + + return keywords + + def _calculate_similarity(self, text1: str, text2: str) -> float: + """ + 计算两个文本的相似度 + + Args: + text1: 文本1 + text2: 文本2 + + Returns: + 相似度分数(0-1之间) + """ + # 使用 SequenceMatcher 计算相似度 + return SequenceMatcher(None, text1, text2).ratio() + + def _find_unique_topics(self, platform_stats: Dict) -> Dict[str, List[str]]: + """ + 找出各平台独有的热点话题 + + Args: + platform_stats: 平台统计数据 + + Returns: + 各平台独有话题字典 + """ + unique_topics = {} + + # 获取每个平台的TOP关键词 + platform_keywords = {} + for platform, stats in platform_stats.items(): + top_keywords = set([kw for kw, _ in stats["top_keywords"].most_common(10)]) + platform_keywords[platform] = top_keywords + + # 找出独有关键词 + for platform, keywords in platform_keywords.items(): + # 找出其他平台的所有关键词 + other_keywords = set() + for other_platform, other_kws in platform_keywords.items(): + if other_platform != platform: + other_keywords.update(other_kws) + + # 找出独有的 + unique = keywords - other_keywords + if unique: + unique_topics[platform] = list(unique)[:5] # 最多5个 + + return unique_topics diff --git a/mcp_server/tools/config_mgmt.py b/mcp_server/tools/config_mgmt.py new file mode 100644 index 0000000..25ab7f5 --- /dev/null +++ b/mcp_server/tools/config_mgmt.py @@ -0,0 +1,66 @@ +""" +配置管理工具 + +实现配置查询和管理功能。 +""" + +from typing import Dict, Optional + +from ..services.data_service import DataService +from ..utils.validators import validate_config_section +from ..utils.errors import MCPError + + +class ConfigManagementTools: + """配置管理工具类""" + + def __init__(self, project_root: str = None): + """ + 初始化配置管理工具 + + Args: + project_root: 项目根目录 + """ + self.data_service = DataService(project_root) + + def get_current_config(self, section: Optional[str] = None) -> Dict: + """ + 获取当前系统配置 + + Args: + section: 配置节 - all/crawler/push/keywords/weights,默认all + + Returns: + 配置字典 + + Example: + >>> tools = ConfigManagementTools() + >>> result = tools.get_current_config(section="crawler") + >>> print(result['crawler']['platforms']) + """ + try: + # 参数验证 + section = validate_config_section(section) + + # 获取配置 + config = self.data_service.get_current_config(section=section) + + return { + "config": config, + "section": section, + "success": True + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } diff --git a/mcp_server/tools/data_query.py b/mcp_server/tools/data_query.py new file mode 100644 index 0000000..49504da --- /dev/null +++ b/mcp_server/tools/data_query.py @@ -0,0 +1,284 @@ +""" +数据查询工具 + +实现P0核心的数据查询工具。 +""" + +from typing import Dict, List, Optional + +from ..services.data_service import DataService +from ..utils.validators import ( + validate_platforms, + validate_limit, + validate_keyword, + validate_date_range, + validate_top_n, + validate_mode, + validate_date_query +) +from ..utils.errors import MCPError + + +class DataQueryTools: + """数据查询工具类""" + + def __init__(self, project_root: str = None): + """ + 初始化数据查询工具 + + Args: + project_root: 项目根目录 + """ + self.data_service = DataService(project_root) + + def get_latest_news( + self, + platforms: Optional[List[str]] = None, + limit: Optional[int] = None, + include_url: bool = False + ) -> Dict: + """ + 获取最新一批爬取的新闻数据 + + Args: + platforms: 平台ID列表,如 ['zhihu', 'weibo'] + limit: 返回条数限制,默认20 + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + 新闻列表字典 + + Example: + >>> tools = DataQueryTools() + >>> result = tools.get_latest_news(platforms=['zhihu'], limit=10) + >>> print(result['total']) + 10 + """ + try: + # 参数验证 + platforms = validate_platforms(platforms) + limit = validate_limit(limit, default=50) + + # 获取数据 + news_list = self.data_service.get_latest_news( + platforms=platforms, + limit=limit, + include_url=include_url + ) + + return { + "news": news_list, + "total": len(news_list), + "platforms": platforms, + "success": True + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def search_news_by_keyword( + self, + keyword: str, + date_range: Optional[Dict] = None, + platforms: Optional[List[str]] = None, + limit: Optional[int] = None + ) -> Dict: + """ + 按关键词搜索历史新闻 + + Args: + keyword: 搜索关键词(必需) + date_range: 日期范围,格式: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} + platforms: 平台过滤列表 + limit: 返回条数限制(可选,默认返回所有) + + Returns: + 搜索结果字典 + + Example: + >>> tools = DataQueryTools() + >>> result = tools.search_news_by_keyword( + ... keyword="人工智能", + ... date_range={"start": "2025-10-01", "end": "2025-10-11"}, + ... limit=50 + ... ) + >>> print(result['total']) + """ + try: + # 参数验证 + keyword = validate_keyword(keyword) + date_range_tuple = validate_date_range(date_range) + platforms = validate_platforms(platforms) + + if limit is not None: + limit = validate_limit(limit, default=100) + + # 搜索数据 + search_result = self.data_service.search_news_by_keyword( + keyword=keyword, + date_range=date_range_tuple, + platforms=platforms, + limit=limit + ) + + return { + **search_result, + "success": True + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def get_trending_topics( + self, + top_n: Optional[int] = None, + mode: Optional[str] = None + ) -> Dict: + """ + 获取个人关注词的新闻出现频率统计 + + 注意:本工具基于 config/frequency_words.txt 中的个人关注词列表进行统计, + 而不是自动从新闻中提取热点话题。这是一个个人可定制的关注词列表, + 用户可以根据自己的兴趣添加或删除关注词。 + + Args: + top_n: 返回TOP N关注词,默认10 + mode: 模式 - daily(当日累计), current(最新一批), incremental(增量) + + Returns: + 关注词频率统计字典,包含每个关注词在新闻中出现的次数 + + Example: + >>> tools = DataQueryTools() + >>> result = tools.get_trending_topics(top_n=5, mode="current") + >>> print(len(result['topics'])) + 5 + >>> # 返回的是你在 frequency_words.txt 中设置的关注词的频率统计 + """ + try: + # 参数验证 + top_n = validate_top_n(top_n, default=10) + valid_modes = ["daily", "current", "incremental"] + mode = validate_mode(mode, valid_modes, default="current") + + # 获取趋势话题 + trending_result = self.data_service.get_trending_topics( + top_n=top_n, + mode=mode + ) + + return { + **trending_result, + "success": True + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def get_news_by_date( + self, + date_query: Optional[str] = None, + platforms: Optional[List[str]] = None, + limit: Optional[int] = None, + include_url: bool = False + ) -> Dict: + """ + 按日期查询新闻,支持自然语言日期 + + Args: + date_query: 日期查询字符串(可选,默认"今天"),支持: + - 相对日期:今天、昨天、前天、3天前、yesterday、3 days ago + - 星期:上周一、本周三、last monday、this friday + - 绝对日期:2025-10-10、10月10日、2025年10月10日 + platforms: 平台ID列表,如 ['zhihu', 'weibo'] + limit: 返回条数限制,默认50 + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + 新闻列表字典 + + Example: + >>> tools = DataQueryTools() + >>> # 不指定日期,默认查询今天 + >>> result = tools.get_news_by_date(platforms=['zhihu'], limit=20) + >>> # 指定日期 + >>> result = tools.get_news_by_date( + ... date_query="昨天", + ... platforms=['zhihu'], + ... limit=20 + ... ) + >>> print(result['total']) + 20 + """ + try: + # 参数验证 - 默认今天 + if date_query is None: + date_query = "今天" + target_date = validate_date_query(date_query) + platforms = validate_platforms(platforms) + limit = validate_limit(limit, default=50) + + # 获取数据 + news_list = self.data_service.get_news_by_date( + target_date=target_date, + platforms=platforms, + limit=limit, + include_url=include_url + ) + + return { + "news": news_list, + "total": len(news_list), + "date": target_date.strftime("%Y-%m-%d"), + "date_query": date_query, + "platforms": platforms, + "success": True + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + diff --git a/mcp_server/tools/search_tools.py b/mcp_server/tools/search_tools.py new file mode 100644 index 0000000..a68ee17 --- /dev/null +++ b/mcp_server/tools/search_tools.py @@ -0,0 +1,664 @@ +""" +智能新闻检索工具 + +提供模糊搜索、链接查询、历史相关新闻检索等高级搜索功能。 +""" + +import re +from collections import Counter +from datetime import datetime, timedelta +from difflib import SequenceMatcher +from typing import Dict, List, Optional, Tuple + +from ..services.data_service import DataService +from ..utils.validators import validate_keyword, validate_limit +from ..utils.errors import MCPError, InvalidParameterError, DataNotFoundError + + +class SearchTools: + """智能新闻检索工具类""" + + def __init__(self, project_root: str = None): + """ + 初始化智能检索工具 + + Args: + project_root: 项目根目录 + """ + self.data_service = DataService(project_root) + # 中文停用词列表 + self.stopwords = { + '的', '了', '在', '是', '我', '有', '和', '就', '不', '人', '都', '一', + '一个', '上', '也', '很', '到', '说', '要', '去', '你', '会', '着', '没有', + '看', '好', '自己', '这', '那', '来', '被', '与', '为', '对', '将', '从', + '以', '及', '等', '但', '或', '而', '于', '中', '由', '可', '可以', '已', + '已经', '还', '更', '最', '再', '因为', '所以', '如果', '虽然', '然而' + } + + def search_news_unified( + self, + query: str, + search_mode: str = "keyword", + date_range: Optional[Dict[str, str]] = None, + platforms: Optional[List[str]] = None, + limit: int = 50, + sort_by: str = "relevance", + threshold: float = 0.6, + include_url: bool = False + ) -> Dict: + """ + 统一新闻搜索工具 - 整合多种搜索模式 + + Args: + query: 查询内容(必需)- 关键词、内容片段或实体名称 + search_mode: 搜索模式,可选值: + - "keyword": 精确关键词匹配(默认) + - "fuzzy": 模糊内容匹配(使用相似度算法) + - "entity": 实体名称搜索(自动按权重排序) + date_range: 日期范围,格式: {"start": "YYYY-MM-DD", "end": "YYYY-MM-DD"} + 不指定则默认查询今天 + platforms: 平台过滤列表,如 ['zhihu', 'weibo'] + limit: 返回条数限制,默认50 + sort_by: 排序方式,可选值: + - "relevance": 按相关度排序(默认) + - "weight": 按新闻权重排序 + - "date": 按日期排序 + threshold: 相似度阈值(仅fuzzy模式有效),0-1之间,默认0.6 + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + 搜索结果字典,包含匹配的新闻列表 + + Examples: + - search_news_unified(query="人工智能", search_mode="keyword") + - search_news_unified(query="特斯拉降价", search_mode="fuzzy", threshold=0.4) + - search_news_unified(query="马斯克", search_mode="entity", limit=20) + - search_news_unified(query="iPhone 16发布", search_mode="keyword") + """ + try: + # 参数验证 + query = validate_keyword(query) + + if search_mode not in ["keyword", "fuzzy", "entity"]: + raise InvalidParameterError( + f"无效的搜索模式: {search_mode}", + suggestion="支持的模式: keyword, fuzzy, entity" + ) + + if sort_by not in ["relevance", "weight", "date"]: + raise InvalidParameterError( + f"无效的排序方式: {sort_by}", + suggestion="支持的排序: relevance, weight, date" + ) + + limit = validate_limit(limit, default=50) + threshold = max(0.0, min(1.0, threshold)) + + # 处理日期范围 + if date_range: + from ..utils.validators import validate_date_range + date_range_tuple = validate_date_range(date_range) + start_date, end_date = date_range_tuple + else: + # 默认今天 + start_date = end_date = datetime.now() + + # 收集所有匹配的新闻 + all_matches = [] + current_date = start_date + + while current_date <= end_date: + try: + all_titles, id_to_name, timestamps = self.data_service.parser.read_all_titles_for_date( + date=current_date, + platform_ids=platforms + ) + + # 根据搜索模式执行不同的搜索逻辑 + if search_mode == "keyword": + matches = self._search_by_keyword_mode( + query, all_titles, id_to_name, current_date, include_url + ) + elif search_mode == "fuzzy": + matches = self._search_by_fuzzy_mode( + query, all_titles, id_to_name, current_date, threshold, include_url + ) + else: # entity + matches = self._search_by_entity_mode( + query, all_titles, id_to_name, current_date, include_url + ) + + all_matches.extend(matches) + + except DataNotFoundError: + # 该日期没有数据,继续下一天 + pass + + current_date += timedelta(days=1) + + if not all_matches: + time_desc = "今天" if start_date == end_date else f"{start_date.strftime('%Y-%m-%d')} 至 {end_date.strftime('%Y-%m-%d')}" + return { + "success": True, + "results": [], + "total": 0, + "query": query, + "search_mode": search_mode, + "time_range": time_desc, + "message": f"未找到匹配的新闻({time_desc})" + } + + # 统一排序逻辑 + if sort_by == "relevance": + all_matches.sort(key=lambda x: x.get("similarity_score", 1.0), reverse=True) + elif sort_by == "weight": + from .analytics import calculate_news_weight + all_matches.sort(key=lambda x: calculate_news_weight(x), reverse=True) + elif sort_by == "date": + all_matches.sort(key=lambda x: x.get("date", ""), reverse=True) + + # 限制返回数量 + results = all_matches[:limit] + + # 构建时间范围描述 + if start_date == end_date: + time_range_desc = start_date.strftime("%Y-%m-%d") + else: + time_range_desc = f"{start_date.strftime('%Y-%m-%d')} 至 {end_date.strftime('%Y-%m-%d')}" + + result = { + "success": True, + "summary": { + "total_found": len(all_matches), + "returned_count": len(results), + "requested_limit": limit, + "search_mode": search_mode, + "query": query, + "platforms": platforms or "所有平台", + "time_range": time_range_desc, + "sort_by": sort_by + }, + "results": results + } + + if search_mode == "fuzzy": + result["summary"]["threshold"] = threshold + if len(all_matches) < limit: + result["note"] = f"模糊搜索模式下,相似度阈值 {threshold} 仅匹配到 {len(all_matches)} 条结果" + + return result + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def _search_by_keyword_mode( + self, + query: str, + all_titles: Dict, + id_to_name: Dict, + current_date: datetime, + include_url: bool + ) -> List[Dict]: + """ + 关键词搜索模式(精确匹配) + + Args: + query: 搜索关键词 + all_titles: 所有标题字典 + id_to_name: 平台ID到名称映射 + current_date: 当前日期 + + Returns: + 匹配的新闻列表 + """ + matches = [] + query_lower = query.lower() + + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + + for title, info in titles.items(): + # 精确包含判断 + if query_lower in title.lower(): + news_item = { + "title": title, + "platform": platform_id, + "platform_name": platform_name, + "date": current_date.strftime("%Y-%m-%d"), + "similarity_score": 1.0, # 精确匹配,相似度为1 + "ranks": info.get("ranks", []), + "count": len(info.get("ranks", [])), + "rank": info["ranks"][0] if info["ranks"] else 999 + } + + # 条件性添加 URL 字段 + if include_url: + news_item["url"] = info.get("url", "") + news_item["mobileUrl"] = info.get("mobileUrl", "") + + matches.append(news_item) + + return matches + + def _search_by_fuzzy_mode( + self, + query: str, + all_titles: Dict, + id_to_name: Dict, + current_date: datetime, + threshold: float, + include_url: bool + ) -> List[Dict]: + """ + 模糊搜索模式(使用相似度算法) + + Args: + query: 搜索内容 + all_titles: 所有标题字典 + id_to_name: 平台ID到名称映射 + current_date: 当前日期 + threshold: 相似度阈值 + + Returns: + 匹配的新闻列表 + """ + matches = [] + + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + + for title, info in titles.items(): + # 模糊匹配 + is_match, similarity = self._fuzzy_match(query, title, threshold) + + if is_match: + news_item = { + "title": title, + "platform": platform_id, + "platform_name": platform_name, + "date": current_date.strftime("%Y-%m-%d"), + "similarity_score": round(similarity, 4), + "ranks": info.get("ranks", []), + "count": len(info.get("ranks", [])), + "rank": info["ranks"][0] if info["ranks"] else 999 + } + + # 条件性添加 URL 字段 + if include_url: + news_item["url"] = info.get("url", "") + news_item["mobileUrl"] = info.get("mobileUrl", "") + + matches.append(news_item) + + return matches + + def _search_by_entity_mode( + self, + query: str, + all_titles: Dict, + id_to_name: Dict, + current_date: datetime, + include_url: bool + ) -> List[Dict]: + """ + 实体搜索模式(自动按权重排序) + + Args: + query: 实体名称 + all_titles: 所有标题字典 + id_to_name: 平台ID到名称映射 + current_date: 当前日期 + + Returns: + 匹配的新闻列表 + """ + matches = [] + + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + + for title, info in titles.items(): + # 实体搜索:精确包含实体名称 + if query in title: + news_item = { + "title": title, + "platform": platform_id, + "platform_name": platform_name, + "date": current_date.strftime("%Y-%m-%d"), + "similarity_score": 1.0, + "ranks": info.get("ranks", []), + "count": len(info.get("ranks", [])), + "rank": info["ranks"][0] if info["ranks"] else 999 + } + + # 条件性添加 URL 字段 + if include_url: + news_item["url"] = info.get("url", "") + news_item["mobileUrl"] = info.get("mobileUrl", "") + + matches.append(news_item) + + return matches + + def _calculate_similarity(self, text1: str, text2: str) -> float: + """ + 计算两个文本的相似度 + + Args: + text1: 文本1 + text2: 文本2 + + Returns: + 相似度分数 (0-1之间) + """ + # 使用 difflib.SequenceMatcher 计算序列相似度 + return SequenceMatcher(None, text1.lower(), text2.lower()).ratio() + + def _fuzzy_match(self, query: str, text: str, threshold: float = 0.3) -> Tuple[bool, float]: + """ + 模糊匹配函数 + + Args: + query: 查询文本 + text: 待匹配文本 + threshold: 匹配阈值 + + Returns: + (是否匹配, 相似度分数) + """ + # 直接包含判断 + if query.lower() in text.lower(): + return True, 1.0 + + # 计算整体相似度 + similarity = self._calculate_similarity(query, text) + if similarity >= threshold: + return True, similarity + + # 分词后的部分匹配 + query_words = set(self._extract_keywords(query)) + text_words = set(self._extract_keywords(text)) + + if not query_words or not text_words: + return False, 0.0 + + # 计算关键词重合度 + common_words = query_words & text_words + keyword_overlap = len(common_words) / len(query_words) + + if keyword_overlap >= 0.5: # 50%的关键词重合 + return True, keyword_overlap + + return False, similarity + + def _extract_keywords(self, text: str, min_length: int = 2) -> List[str]: + """ + 从文本中提取关键词 + + Args: + text: 输入文本 + min_length: 最小词长 + + Returns: + 关键词列表 + """ + # 移除URL和特殊字符 + text = re.sub(r'http[s]?://\S+', '', text) + text = re.sub(r'\[.*?\]', '', text) # 移除方括号内容 + + # 使用正则表达式分词(中文和英文) + words = re.findall(r'[\w]+', text) + + # 过滤停用词和短词 + keywords = [ + word for word in words + if word and len(word) >= min_length and word not in self.stopwords + ] + + return keywords + + def _calculate_keyword_overlap(self, keywords1: List[str], keywords2: List[str]) -> float: + """ + 计算两个关键词列表的重合度 + + Args: + keywords1: 关键词列表1 + keywords2: 关键词列表2 + + Returns: + 重合度分数 (0-1之间) + """ + if not keywords1 or not keywords2: + return 0.0 + + set1 = set(keywords1) + set2 = set(keywords2) + + # Jaccard 相似度 + intersection = len(set1 & set2) + union = len(set1 | set2) + + if union == 0: + return 0.0 + + return intersection / union + + def search_related_news_history( + self, + reference_text: str, + time_range: str = "yesterday", + start_date: Optional[datetime] = None, + end_date: Optional[datetime] = None, + threshold: float = 0.4, + limit: int = 50, + include_url: bool = False + ) -> Dict: + """ + 在历史数据中搜索与给定新闻相关的新闻 + + Args: + reference_text: 参考新闻标题或内容 + time_range: 时间范围预设值,可选: + - "yesterday": 昨天 + - "last_week": 上周 (7天) + - "last_month": 上个月 (30天) + - "custom": 自定义日期范围(需要提供 start_date 和 end_date) + start_date: 自定义开始日期(仅当 time_range="custom" 时有效) + end_date: 自定义结束日期(仅当 time_range="custom" 时有效) + threshold: 相似度阈值 (0-1之间),默认0.4 + limit: 返回条数限制,默认50 + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + 搜索结果字典,包含相关新闻列表 + + Example: + >>> tools = SearchTools() + >>> result = tools.search_related_news_history( + ... reference_text="人工智能技术突破", + ... time_range="last_week", + ... threshold=0.4, + ... limit=50 + ... ) + >>> for news in result['results']: + ... print(f"{news['date']}: {news['title']} (相似度: {news['similarity_score']})") + """ + try: + # 参数验证 + reference_text = validate_keyword(reference_text) + threshold = max(0.0, min(1.0, threshold)) + limit = validate_limit(limit, default=50) + + # 确定查询日期范围 + today = datetime.now() + + if time_range == "yesterday": + search_start = today - timedelta(days=1) + search_end = today - timedelta(days=1) + elif time_range == "last_week": + search_start = today - timedelta(days=7) + search_end = today - timedelta(days=1) + elif time_range == "last_month": + search_start = today - timedelta(days=30) + search_end = today - timedelta(days=1) + elif time_range == "custom": + if not start_date or not end_date: + raise InvalidParameterError( + "自定义时间范围需要提供 start_date 和 end_date", + suggestion="请提供 start_date 和 end_date 参数" + ) + search_start = start_date + search_end = end_date + else: + raise InvalidParameterError( + f"不支持的时间范围: {time_range}", + suggestion="请使用 'yesterday', 'last_week', 'last_month' 或 'custom'" + ) + + # 提取参考文本的关键词 + reference_keywords = self._extract_keywords(reference_text) + + if not reference_keywords: + raise InvalidParameterError( + "无法从参考文本中提取关键词", + suggestion="请提供更详细的文本内容" + ) + + # 收集所有相关新闻 + all_related_news = [] + current_date = search_start + + while current_date <= search_end: + try: + # 读取该日期的数据 + all_titles, id_to_name, _ = self.data_service.parser.read_all_titles_for_date(current_date) + + # 搜索相关新闻 + for platform_id, titles in all_titles.items(): + platform_name = id_to_name.get(platform_id, platform_id) + + for title, info in titles.items(): + # 计算标题相似度 + title_similarity = self._calculate_similarity(reference_text, title) + + # 提取标题关键词 + title_keywords = self._extract_keywords(title) + + # 计算关键词重合度 + keyword_overlap = self._calculate_keyword_overlap( + reference_keywords, + title_keywords + ) + + # 综合相似度 (70% 关键词重合 + 30% 文本相似度) + combined_score = keyword_overlap * 0.7 + title_similarity * 0.3 + + if combined_score >= threshold: + news_item = { + "title": title, + "platform": platform_id, + "platform_name": platform_name, + "date": current_date.strftime("%Y-%m-%d"), + "similarity_score": round(combined_score, 4), + "keyword_overlap": round(keyword_overlap, 4), + "text_similarity": round(title_similarity, 4), + "common_keywords": list(set(reference_keywords) & set(title_keywords)), + "rank": info["ranks"][0] if info["ranks"] else 0 + } + + # 条件性添加 URL 字段 + if include_url: + news_item["url"] = info.get("url", "") + news_item["mobileUrl"] = info.get("mobileUrl", "") + + all_related_news.append(news_item) + + except DataNotFoundError: + # 该日期没有数据,继续下一天 + pass + except Exception as e: + # 记录错误但继续处理其他日期 + print(f"Warning: 处理日期 {current_date.strftime('%Y-%m-%d')} 时出错: {e}") + + # 移动到下一天 + current_date += timedelta(days=1) + + if not all_related_news: + return { + "success": True, + "results": [], + "total": 0, + "query": reference_text, + "time_range": time_range, + "date_range": { + "start": search_start.strftime("%Y-%m-%d"), + "end": search_end.strftime("%Y-%m-%d") + }, + "message": "未找到相关新闻" + } + + # 按相似度排序 + all_related_news.sort(key=lambda x: x["similarity_score"], reverse=True) + + # 限制返回数量 + results = all_related_news[:limit] + + # 统计信息 + platform_distribution = Counter([news["platform"] for news in all_related_news]) + date_distribution = Counter([news["date"] for news in all_related_news]) + + result = { + "success": True, + "summary": { + "total_found": len(all_related_news), + "returned_count": len(results), + "requested_limit": limit, + "threshold": threshold, + "reference_text": reference_text, + "reference_keywords": reference_keywords, + "time_range": time_range, + "date_range": { + "start": search_start.strftime("%Y-%m-%d"), + "end": search_end.strftime("%Y-%m-%d") + } + }, + "results": results, + "statistics": { + "platform_distribution": dict(platform_distribution), + "date_distribution": dict(date_distribution), + "avg_similarity": round( + sum([news["similarity_score"] for news in all_related_news]) / len(all_related_news), + 4 + ) if all_related_news else 0.0 + } + } + + if len(all_related_news) < limit: + result["note"] = f"相关性阈值 {threshold} 下仅找到 {len(all_related_news)} 条相关新闻" + + return result + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } diff --git a/mcp_server/tools/system.py b/mcp_server/tools/system.py new file mode 100644 index 0000000..2cf2248 --- /dev/null +++ b/mcp_server/tools/system.py @@ -0,0 +1,465 @@ +""" +系统管理工具 + +实现系统状态查询和爬虫触发功能。 +""" + +from pathlib import Path +from typing import Dict, List, Optional + +from ..services.data_service import DataService +from ..utils.validators import validate_platforms +from ..utils.errors import MCPError, CrawlTaskError + + +class SystemManagementTools: + """系统管理工具类""" + + def __init__(self, project_root: str = None): + """ + 初始化系统管理工具 + + Args: + project_root: 项目根目录 + """ + self.data_service = DataService(project_root) + if project_root: + self.project_root = Path(project_root) + else: + # 获取项目根目录 + current_file = Path(__file__) + self.project_root = current_file.parent.parent.parent + + def get_system_status(self) -> Dict: + """ + 获取系统运行状态和健康检查信息 + + Returns: + 系统状态字典 + + Example: + >>> tools = SystemManagementTools() + >>> result = tools.get_system_status() + >>> print(result['system']['version']) + """ + try: + # 获取系统状态 + status = self.data_service.get_system_status() + + return { + **status, + "success": True + } + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e) + } + } + + def trigger_crawl(self, platforms: Optional[List[str]] = None, save_to_local: bool = False, include_url: bool = False) -> Dict: + """ + 手动触发一次临时爬取任务(可选持久化) + + Args: + platforms: 指定平台列表,为空则爬取所有平台 + save_to_local: 是否保存到本地 output 目录,默认 False + include_url: 是否包含URL链接,默认False(节省token) + + Returns: + 爬取结果字典,包含新闻数据和保存路径(如果保存) + + Example: + >>> tools = SystemManagementTools() + >>> # 临时爬取,不保存 + >>> result = tools.trigger_crawl(platforms=['zhihu', 'weibo']) + >>> print(result['data']) + >>> # 爬取并保存到本地 + >>> result = tools.trigger_crawl(platforms=['zhihu'], save_to_local=True) + >>> print(result['saved_files']) + """ + try: + import json + import time + import random + import requests + from datetime import datetime + import pytz + import yaml + + # 参数验证 + platforms = validate_platforms(platforms) + + # 加载配置文件 + config_path = self.project_root / "config" / "config.yaml" + if not config_path.exists(): + raise CrawlTaskError( + "配置文件不存在", + suggestion=f"请确保配置文件存在: {config_path}" + ) + + # 读取配置 + with open(config_path, "r", encoding="utf-8") as f: + config_data = yaml.safe_load(f) + + # 获取平台配置 + all_platforms = config_data.get("platforms", []) + if not all_platforms: + raise CrawlTaskError( + "配置文件中没有平台配置", + suggestion="请检查 config/config.yaml 中的 platforms 配置" + ) + + # 过滤平台 + if platforms: + target_platforms = [p for p in all_platforms if p["id"] in platforms] + if not target_platforms: + raise CrawlTaskError( + f"指定的平台不存在: {platforms}", + suggestion=f"可用平台: {[p['id'] for p in all_platforms]}" + ) + else: + target_platforms = all_platforms + + # 获取请求间隔 + request_interval = config_data.get("crawler", {}).get("request_interval", 100) + + # 构建平台ID列表 + ids = [] + for platform in target_platforms: + if "name" in platform: + ids.append((platform["id"], platform["name"])) + else: + ids.append(platform["id"]) + + print(f"开始临时爬取,平台: {[p.get('name', p['id']) for p in target_platforms]}") + + # 爬取数据 + results = {} + id_to_name = {} + failed_ids = [] + + for i, id_info in enumerate(ids): + if isinstance(id_info, tuple): + id_value, name = id_info + else: + id_value = id_info + name = id_value + + id_to_name[id_value] = name + + # 构建请求URL + url = f"https://newsnow.busiyi.world/api/s?id={id_value}&latest" + + headers = { + "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36", + "Accept": "application/json, text/plain, */*", + "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8", + "Connection": "keep-alive", + "Cache-Control": "no-cache", + } + + # 重试机制 + max_retries = 2 + retries = 0 + success = False + + while retries <= max_retries and not success: + try: + response = requests.get(url, headers=headers, timeout=10) + response.raise_for_status() + + data_text = response.text + data_json = json.loads(data_text) + + status = data_json.get("status", "未知") + if status not in ["success", "cache"]: + raise ValueError(f"响应状态异常: {status}") + + status_info = "最新数据" if status == "success" else "缓存数据" + print(f"获取 {id_value} 成功({status_info})") + + # 解析数据 + results[id_value] = {} + for index, item in enumerate(data_json.get("items", []), 1): + title = item["title"] + url_link = item.get("url", "") + mobile_url = item.get("mobileUrl", "") + + if title in results[id_value]: + results[id_value][title]["ranks"].append(index) + else: + results[id_value][title] = { + "ranks": [index], + "url": url_link, + "mobileUrl": mobile_url, + } + + success = True + + except Exception as e: + retries += 1 + if retries <= max_retries: + wait_time = random.uniform(3, 5) + print(f"请求 {id_value} 失败: {e}. {wait_time:.2f}秒后重试...") + time.sleep(wait_time) + else: + print(f"请求 {id_value} 失败: {e}") + failed_ids.append(id_value) + + # 请求间隔 + if i < len(ids) - 1: + actual_interval = request_interval + random.randint(-10, 20) + actual_interval = max(50, actual_interval) + time.sleep(actual_interval / 1000) + + # 格式化返回数据 + news_data = [] + for platform_id, titles_data in results.items(): + platform_name = id_to_name.get(platform_id, platform_id) + for title, info in titles_data.items(): + news_item = { + "platform_id": platform_id, + "platform_name": platform_name, + "title": title, + "ranks": info["ranks"] + } + + # 条件性添加 URL 字段 + if include_url: + news_item["url"] = info.get("url", "") + news_item["mobile_url"] = info.get("mobileUrl", "") + + news_data.append(news_item) + + # 获取北京时间 + beijing_tz = pytz.timezone("Asia/Shanghai") + now = datetime.now(beijing_tz) + + # 构建返回结果 + result = { + "success": True, + "task_id": f"crawl_{int(time.time())}", + "status": "completed", + "crawl_time": now.strftime("%Y-%m-%d %H:%M:%S"), + "platforms": list(results.keys()), + "total_news": len(news_data), + "failed_platforms": failed_ids, + "data": news_data, + "saved_to_local": save_to_local + } + + # 如果需要持久化,调用保存逻辑 + if save_to_local: + try: + import re + + # 辅助函数:清理标题 + def clean_title(title: str) -> str: + """清理标题中的特殊字符""" + if not isinstance(title, str): + title = str(title) + cleaned_title = title.replace("\n", " ").replace("\r", " ") + cleaned_title = re.sub(r"\s+", " ", cleaned_title) + cleaned_title = cleaned_title.strip() + return cleaned_title + + # 辅助函数:创建目录 + def ensure_directory_exists(directory: str): + """确保目录存在""" + Path(directory).mkdir(parents=True, exist_ok=True) + + # 格式化日期和时间 + date_folder = now.strftime("%Y年%m月%d日") + time_filename = now.strftime("%H时%M分") + + # 创建 txt 文件路径 + txt_dir = self.project_root / "output" / date_folder / "txt" + ensure_directory_exists(str(txt_dir)) + txt_file_path = txt_dir / f"{time_filename}.txt" + + # 创建 html 文件路径 + html_dir = self.project_root / "output" / date_folder / "html" + ensure_directory_exists(str(html_dir)) + html_file_path = html_dir / f"{time_filename}.html" + + # 保存 txt 文件(按照 main.py 的格式) + with open(txt_file_path, "w", encoding="utf-8") as f: + for id_value, title_data in results.items(): + # id | name 或 id + name = id_to_name.get(id_value) + if name and name != id_value: + f.write(f"{id_value} | {name}\n") + else: + f.write(f"{id_value}\n") + + # 按排名排序标题 + sorted_titles = [] + for title, info in title_data.items(): + cleaned = clean_title(title) + if isinstance(info, dict): + ranks = info.get("ranks", []) + url = info.get("url", "") + mobile_url = info.get("mobileUrl", "") + else: + ranks = info if isinstance(info, list) else [] + url = "" + mobile_url = "" + + rank = ranks[0] if ranks else 1 + sorted_titles.append((rank, cleaned, url, mobile_url)) + + sorted_titles.sort(key=lambda x: x[0]) + + for rank, cleaned, url, mobile_url in sorted_titles: + line = f"{rank}. {cleaned}" + if url: + line += f" [URL:{url}]" + if mobile_url: + line += f" [MOBILE:{mobile_url}]" + f.write(line + "\n") + + f.write("\n") + + if failed_ids: + f.write("==== 以下ID请求失败 ====\n") + for id_value in failed_ids: + f.write(f"{id_value}\n") + + # 保存 html 文件(简化版) + html_content = self._generate_simple_html(results, id_to_name, failed_ids, now) + with open(html_file_path, "w", encoding="utf-8") as f: + f.write(html_content) + + print(f"数据已保存到:") + print(f" TXT: {txt_file_path}") + print(f" HTML: {html_file_path}") + + result["saved_files"] = { + "txt": str(txt_file_path), + "html": str(html_file_path) + } + result["note"] = "数据已持久化到 output 文件夹" + + except Exception as e: + print(f"保存文件失败: {e}") + result["save_error"] = str(e) + result["note"] = "爬取成功但保存失败,数据仅在内存中" + else: + result["note"] = "临时爬取结果,未持久化到output文件夹" + + return result + + except MCPError as e: + return { + "success": False, + "error": e.to_dict() + } + except Exception as e: + import traceback + return { + "success": False, + "error": { + "code": "INTERNAL_ERROR", + "message": str(e), + "traceback": traceback.format_exc() + } + } + + def _generate_simple_html(self, results: Dict, id_to_name: Dict, failed_ids: List, now) -> str: + """生成简化的 HTML 报告""" + html = """ + +
+ + +