orc-order-v2/doc/TECHNICAL_ARCHITECTURE.md

# 益选-OCR订单处理系统 - 技术架构文档

## 系统架构设计

### 整体架构概述

益选-OCR订单处理系统采用分层架构设计，遵循单一职责原则和开闭原则，确保系统的可维护性、可扩展性和可测试性。系统架构分为四个主要层次：用户界面层、业务逻辑层、核心处理层和数据访问层。

### 架构设计原则

#### 1. 分层解耦
- 各层之间通过明确定义的接口进行通信
- 上层依赖下层，下层不依赖上层
- 层与层之间保持松耦合关系

#### 2. 模块化设计
- 每个模块具有明确的职责边界
- 模块内部高内聚，模块之间低耦合
- 支持模块的独立开发和测试

#### 3. 配置驱动
- 系统行为通过配置文件控制
- 支持运行时参数调整
- 提供灵活的配置管理机制

#### 4. 错误处理
- 统一的异常处理机制
- 详细的日志记录和错误追踪
- 优雅的错误恢复机制

## 模块划分和职责

### 用户界面层 (UI Layer)

#### 启动器模块 (启动器.py)
**职责**：
- 提供图形用户界面
- 协调各个功能模块的调用
- 显示处理进度和结果
- 管理用户交互流程

**主要组件**：
- `OCR订单处理系统`类：主应用程序类
- `StatusBar`类：状态栏组件
- `LogRedirector`类：日志重定向器
- 各种对话框和预览窗口

#### 命令行接口 (app/cli/)
**职责**：
- 提供命令行操作方式
- 支持脚本化操作
- 实现批量处理功能

**子模块**：
- `ocr_cli.py`：OCR识别命令行接口
- `excel_cli.py`：Excel处理命令行接口
- `merge_cli.py`：合并功能命令行接口

### 业务逻辑层 (Service Layer)

#### OCR服务 (app/services/ocr_service.py)
**职责**：
- 协调OCR识别流程
- 管理OCR处理器的生命周期
- 提供OCR相关的业务逻辑
- 处理OCR结果的验证和转换

**核心方法**：
- `process_image()`：处理单个图片
- `process_images_batch()`：批量处理图片
- `get_unprocessed_images()`：获取待处理图片列表
- `validate_image()`：验证图片有效性

#### 订单服务 (app/services/order_service.py)
**职责**：
- 处理Excel订单文件
- 提取和标准化商品信息
- 应用条码映射规则
- 执行规格单位转换

**核心功能**：
- Excel文件读取和解析
- 商品信息提取和清洗
- 条码映射和转换
- 规格单位智能识别

#### 烟草服务 (app/services/tobacco_service.py)
**职责**：
- 专门处理烟草行业订单
- 适配烟草公司的特殊格式
- 处理烟草订单的特定规则

#### 合并服务
**职责**：
- 合并多个采购单文件
- 汇总相同商品信息
- 处理合并冲突和重复项

### 核心处理层 (Core Layer)

#### OCR处理核心 (app/core/ocr/)

##### 表格OCR处理器 (table_ocr.py)
**职责**：
- 协调OCR识别流程
- 管理处理记录
- 控制批量处理逻辑
- 处理文件I/O操作

**核心组件**：
- `OCRProcessor`类：主要的OCR处理器
- `ProcessedRecordManager`类：处理记录管理器

**处理流程**：
1. 图片验证和预处理
2. 调用百度OCR API进行识别
3. 解析OCR返回结果
4. 生成Excel文件
5. 更新处理记录

##### 百度OCR客户端 (baidu_ocr.py)
**职责**：
- 封装百度OCR API调用
- 处理API认证和授权
- 管理API请求和响应
- 实现重试和错误处理机制

**核心功能**：
- API密钥管理
- 请求签名生成
- 表格识别API调用
- 结果获取和解析

#### Excel处理核心 (app/core/excel/)

##### Excel处理器
**职责**：
- 读取和解析Excel文件
- 提取商品信息
- 数据清洗和标准化
- 生成标准采购单格式

**处理逻辑**：
1. 读取Excel文件
2. 识别商品数据区域
3. 提取商品属性（条码、名称、规格、数量、单价）
4. 应用数据清洗规则
5. 生成标准化输出

##### 单位转换器 (converter.py)
**职责**：
- 智能识别商品规格单位
- 执行单位换算
- 处理复杂的规格描述

**支持的单位**：
- 数量单位：个、只、条、包、箱、件等
- 重量单位：克、千克、斤、公斤等
- 体积单位：毫升、升、立方米等

#### 工具模块 (app/core/utils/)

##### 文件工具 (file_utils.py)
**职责**：
- 文件系统操作封装
- 路径处理和验证
- 文件类型检查
- 批量文件操作

##### 日志工具 (log_utils.py)
**职责**：
- 日志配置和管理
- 日志级别控制
- 日志文件轮转
- 错误追踪和记录

##### 对话框工具 (dialog_utils.py)
**职责**：
- 自定义对话框实现
- 用户交互界面组件
- 配置界面管理

### 数据访问层 (Data Access Layer)

#### 配置管理 (app/config/)

##### 配置管理器 (settings.py)
**职责**：
- 配置文件加载和解析
- 配置项访问和修改
- 配置验证和默认值处理
- 配置持久化

##### 默认配置 (defaults.py)
**职责**：
- 定义系统默认配置
- 提供配置模板
- 确保配置完整性

#### 数据存储

##### 文件系统接口
- **输入目录**：`data/input/` - 存放待处理的图片文件
- **输出目录**：`data/output/` - 存放处理结果和生成的Excel文件
- **临时目录**：`data/temp/` - 存放临时文件
- **模板目录**：`templates/` - 存放Excel模板文件
- **配置目录**：`config/` - 存放配置文件和映射规则

##### 处理记录管理
- **JSON记录文件**：`data/output/processed_files.json`
- **记录内容**：已处理文件的映射关系
- **更新机制**：处理完成后自动更新记录

## 核心算法和流程

### OCR识别算法流程

```mermaid
graph TD
    A[开始] --> B[图片验证]
    B --> C{图片有效?}
    C -->|是| D[检查是否已处理]
    C -->|否| E[返回错误]
    D --> F{已处理?}
    F -->|是| G[返回现有结果]
    F -->|否| H[调用百度OCR API]
    H --> I[解析OCR结果]
    I --> J[生成Excel文件]
    J --> K[更新处理记录]
    K --> L[返回成功]
    E --> M[结束]
    G --> M
    L --> M
```

#### 图片验证算法
```python
def validate_image(image_path: str) -> bool:
    # 1. 文件存在性检查
    if not os.path.exists(image_path):
        return False

    # 2. 文件扩展名验证
    ext = get_file_extension(image_path)
    if ext not in ALLOWED_EXTENSIONS:
        return False

    # 3. 文件大小检查
    if not is_file_size_valid(image_path, MAX_SIZE_MB):
        return False

    # 4. 图片格式验证（可选）
    try:
        with Image.open(image_path) as img:
            img.verify()
    except:
        return False

    return True
```

#### OCR结果解析算法
```python
def parse_ocr_result(ocr_response: dict) -> dict:
    result = {
        'tables': [],
        'text': '',
        'excel_data': None
    }

    # 1. 提取表格数据
    if 'tables_result' in ocr_response:
        for table in ocr_response['tables_result']:
            table_data = extract_table_data(table)
            result['tables'].append(table_data)

    # 2. 提取文本内容
    if 'words_result' in ocr_response:
        result['text'] = extract_text_content(ocr_response['words_result'])

    # 3. 提取Excel数据
    excel_base64 = find_excel_data(ocr_response)
    if excel_base64:
        result['excel_data'] = base64.b64decode(excel_base64)

    return result
```

### Excel处理算法流程

#### 商品信息提取算法
```python
def extract_product_info(excel_data: pd.DataFrame) -> List[Dict]:
    products = []

    # 1. 识别表头行
    header_row = identify_header_row(excel_data)

    # 2. 确定列映射
    column_mapping = map_columns(excel_data.iloc[header_row])

    # 3. 提取商品数据
    for row_idx in range(header_row + 1, len(excel_data)):
        row_data = excel_data.iloc[row_idx]

        product = {
            'barcode': extract_barcode(row_data, column_mapping),
            'name': extract_product_name(row_data, column_mapping),
            'specification': extract_specification(row_data, column_mapping),
            'quantity': extract_quantity(row_data, column_mapping),
            'unit_price': extract_unit_price(row_data, column_mapping),
            'total_price': extract_total_price(row_data, column_mapping)
        }

        # 4. 数据验证和清洗
        if validate_product(product):
            cleaned_product = clean_product_data(product)
            products.append(cleaned_product)

    return products
```

#### 规格单位识别算法
```python
def parse_specification(spec_text: str) -> Dict:
    result = {
        'original': spec_text,
        'quantity': 1,
        'unit': '个',
        'parsed': False
    }

    # 1. 预定义单位模式
    unit_patterns = {
        r'(\d+)\s*个': ('个', 1),
        r'(\d+)\s*只': ('只', 1),
        r'(\d+)\s*条': ('条', 1),
        r'(\d+)\s*包': ('包', 1),
        r'(\d+)\s*箱': ('箱', 1),
        r'(\d+)\s*件': ('件', 1),
        r'(\d+)\s*克': ('克', 1),
        r'(\d+)\s*千克': ('千克', 1),
        r'(\d+)\s*斤': ('斤', 1),
        r'(\d+)\s*公斤': ('公斤', 1)
    }

    # 2. 模式匹配
    for pattern, (unit, multiplier) in unit_patterns.items():
        match = re.search(pattern, spec_text, re.IGNORECASE)
        if match:
            result['quantity'] = int(match.group(1))
            result['unit'] = unit
            result['parsed'] = True
            break

    # 3. 复杂规格处理
    if not result['parsed']:
        result = parse_complex_specification(spec_text)

    return result
```

### 采购单合并算法

#### 商品汇总算法
```python
def merge_products(products_list: List[List[Dict]]) -> List[Dict]:
    merged_products = {}

    # 1. 收集所有商品
    for products in products_list:
        for product in products:
            key = generate_product_key(product)

            if key in merged_products:
                # 2. 合并相同商品
                merged_products[key]['quantity'] += product['quantity']
                merged_products[key]['total_price'] += product['total_price']
                merged_products[key]['source_files'].append(product.get('source_file', ''))
            else:
                # 3. 添加新商品
                merged_products[key] = product.copy()
                merged_products[key]['source_files'] = [product.get('source_file', '')]

    # 4. 转换回列表格式
    result = list(merged_products.values())

    # 5. 排序（按条码或名称）
    result.sort(key=lambda x: x.get('barcode', x.get('name', '')))

    return result
```

## 数据流设计

### 主要数据流

#### 1. OCR识别数据流
```
输入图片 → 图片验证 → OCR API调用 → 结果解析 → Excel生成 → 输出文件
```

#### 2. Excel处理数据流
```
Excel文件 → 数据读取 → 商品提取 → 数据清洗 → 格式转换 → 标准采购单
```

#### 3. 合并处理数据流
```
多个采购单 → 商品提取 → 去重汇总 → 冲突处理 → 合并结果 → 输出文件
```

### 数据结构设计

#### 商品数据结构
```python
{
    'barcode': str,           # 商品条码
    'name': str,              # 商品名称
    'specification': str,     # 商品规格
    'quantity': int,          # 数量
    'unit': str,              # 单位
    'unit_price': float,      # 单价
    'total_price': float,     # 总价
    'source_file': str,       # 来源文件
    'category': str,          # 商品类别
    'brand': str              # 品牌
}
```

#### 处理记录数据结构
```python
{
    'image_file': str,        # 输入图片路径
    'output_file': str,       # 输出文件路径
    'processing_time': str,   # 处理时间
    'status': str,            # 处理状态
    'error_message': str      # 错误信息（如果有）
}
```

#### OCR结果数据结构
```python
{
    'tables': List[Dict],     # 表格数据
    'text': str,              # 文本内容
    'excel_data': bytes,      # Excel文件数据
    'confidence': float,      # 识别置信度
    'processing_time': float   # 处理耗时
}
```

## 关键技术实现

### 并发处理机制

#### 多线程批量处理
```python
class BatchProcessor:
    def __init__(self, max_workers: int = 4):
        self.max_workers = max_workers
        self.executor = ThreadPoolExecutor(max_workers=max_workers)

    def process_batch(self, items: List[Any], processor_func) -> List[Any]:
        # 使用线程池并发处理
        futures = [self.executor.submit(processor_func, item) for item in items]

        # 收集处理结果
        results = []
        for future in as_completed(futures):
            try:
                result = future.result()
                results.append(result)
            except Exception as e:
                logger.error(f"处理失败: {e}")
                results.append(None)

        return results
```

### 错误处理和重试机制

#### API调用重试机制
```python
def call_with_retry(func, max_retries=3, retry_delay=2):
    for attempt in range(max_retries):
        try:
            result = func()
            return result
        except Exception as e:
            logger.warning(f"第{attempt + 1}次尝试失败: {e}")

            if attempt < max_retries - 1:
                time.sleep(retry_delay)
            else:
                logger.error(f"所有重试尝试都失败")
                raise
```

### 内存优化策略

#### 大文件处理
```python
def process_large_file(file_path: str, chunk_size: int = 1000):
    # 使用生成器避免一次性加载大文件
    def read_in_chunks():
        with pd.read_excel(file_path, chunksize=chunk_size) as reader:
            for chunk in reader:
                yield chunk

    # 逐块处理
    for chunk in read_in_chunks():
        process_chunk(chunk)
        # 及时清理内存
        del chunk
        gc.collect()
```

### 配置管理实现

#### 动态配置加载
```python
class ConfigManager:
    def __init__(self, config_file: str):
        self.config_file = config_file
        self.config = configparser.ConfigParser()
        self.load_config()

    def load_config(self):
        if os.path.exists(self.config_file):
            self.config.read(self.config_file, encoding='utf-8')
        else:
            self.create_default_config()

    def get(self, section: str, option: str, fallback: Any = None) -> Any:
        return self.config.get(section, option, fallback=fallback)

    def getint(self, section: str, option: str, fallback: int = 0) -> int:
        return self.config.getint(section, option, fallback=fallback)

    def getboolean(self, section: str, option: str, fallback: bool = False) -> bool:
        return self.config.getboolean(section, option, fallback=fallback)
```

### 日志系统设计

#### 结构化日志记录
```python
import logging
from logging.handlers import RotatingFileHandler

def setup_logging(log_file: str = 'logs/app.log'):
    # 创建logger
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)

    # 创建文件处理器（带轮转）
    file_handler = RotatingFileHandler(
        log_file, maxBytes=10*1024*1024, backupCount=5
    )
    file_handler.setLevel(logging.DEBUG)

    # 创建控制台处理器
    console_handler = logging.StreamHandler()
    console_handler.setLevel(logging.INFO)

    # 创建格式化器
    formatter = logging.Formatter(
        '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )

    # 添加格式化器到处理器
    file_handler.setFormatter(formatter)
    console_handler.setFormatter(formatter)

    # 添加处理器到logger
    logger.addHandler(file_handler)
    logger.addHandler(console_handler)

    return logger
```

### 性能监控和优化

#### 处理时间统计
```python
import time
from functools import wraps

def timing_decorator(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()

        processing_time = end_time - start_time
        logger.info(f"{func.__name__} 执行耗时: {processing_time:.2f}秒")

        return result

    return wrapper
```

### 安全性考虑

#### API密钥管理
```python
import os
from cryptography.fernet import Fernet

class SecureConfig:
    def __init__(self, encryption_key: str = None):
        self.cipher = Fernet(encryption_key or self._get_or_create_key())

    def _get_or_create_key(self) -> str:
        key_file = 'config/.key'
        if os.path.exists(key_file):
            with open(key_file, 'rb') as f:
                return f.read()
        else:
            key = Fernet.generate_key()
            with open(key_file, 'wb') as f:
                f.write(key)
            return key

    def encrypt(self, data: str) -> str:
        return self.cipher.encrypt(data.encode()).decode()

    def decrypt(self, encrypted_data: str) -> str:
        return self.cipher.decrypt(encrypted_data.encode()).decode()
```

这个技术架构文档详细描述了益选-OCR订单处理系统的技术实现细节，包括系统架构设计、模块职责划分、核心算法流程、数据流设计以及关键技术实现。系统设计遵循软件工程最佳实践，确保系统的可靠性、可维护性和可扩展性。