面向知识抽取的真实世界中文电子病历数据质量分析与治理对策研究

1. 深圳市卫生健康发展研究和数据管理中心;
2. 吉林大学;

通讯作者: 张云秋, yunqiu@jlu.edu.cn

DOI：10.12201/bmr.202511.00077

声明：预印本系统所发表的论文仅用于最新科研成果的交流与共享，未经同行评议，因此不建议直接应用于指导临床实践。

Research on Quality Analysis and Governance Strategies of Real-World Chinese Electronic Medical Records Data for Knowledge Extraction

gaiyanrong¹,
zhangyunqiu²,
zhanghui²,
lichencheng¹,
lujunrui¹

1. Shenzhen Health Development Research and Data Management Center ;
2. Jilin University ;

Corresponding author: zhangyunqiu, yunqiu@jlu.edu.cn

摘要：目的/意义真实世界中文电子病历知识抽取目前受制于标注规则的临床意义与技术可行性适配不足、源头数据质量较低以及数据治理滞后等问题。本研究旨在缓解上述瓶颈，探索面向真实世界场景的中文电子病历知识抽取路径。方法/过程本研究制定了覆盖主要实体和关系类型的标注规则，基于Bert+BiLstm+CRF模型在真实世界电子病历上开展实验，并据此梳理电子病历数据治理的关键问题。结果/结论模型在真实世界电子病历上的实体和关系识别F1值分别约0.62和0.36，明显低于公开数据集。数据自身的原因主要有表述不规范、数据稀疏和科室间术语差异，数据治理的原因主要有隐私保护与数据利用失衡、缺乏全流程管理及入库前质量检测等。

关键词： 中文电子病历; 知识抽取; 命名实体识别; 实体关系识别; 数据治理

Abstract: Objective/Significance Knowledge extraction from real-world Chinese electronic medical records is currently constrained by issues such as inadequate alignment between the clinical significance and technical feasibility of annotation rules, low quality of source data, and lagging data governance. This study aims to alleviate these bottlenecks and explore a path for knowledge extraction from Chinese EMRs in real-world scenarios. Methods/Process In this study, annotation rules covering major entity and relationship types were formulated. Experiments were conducted on real-world EMRs based on the Bert+BiLSTM+CRF model, and key issues in EMR data governance were summarized accordingly. Results/Conclusion The F1-scores of the model for entity recognition and relationship recognition on real-world EMRs were approximately 0.62 and 0.36, respectively, which are significantly lower than those on public datasets. The main data-related causes include non-standard expressions, data sparsity, and terminological differences among departments. The main data governance-related causes include an imbalance between privacy protection and data utilization, lack of full-process management, and insufficient pre-storage quality inspection.

Key words: Chinese Electronic Medical Record; Knowledge Extraction; Named Entity Recognition; Relation Extraction; Data Governance

提交时间：2025-11-24

版权声明：作者本人独立拥有该论文的版权，预印本系统仅拥有论文的永久保存权利。任何人未经允许不得重复使用。
html
图表
陈婕卿, 竹志超, 张锋, 曾可, 姜会珍, 程振宁. 面向知识图谱构建的中文电子病历命名实体识别方法研究. 2023. doi: 10.12201/bmr.202312.00011

武学鸿, 杨峰, 李建华, 徐倩. 融合词向量及词属性推理的中文电子病历实体识别方法. 2021. doi: 10.12201/bmr.202109.00016

郭维嘉. 中文电子病历数据元抽取方法. 2024. doi: 10.12201/bmr.202404.00038

邓嘉乐, 胡振生, 连万民, 华赟鹏, 周毅. 基于RoBERTa-CRF的肝癌电子病历实体识别研究. 2023. doi: 10.12201/bmr.202303.00027

陈剑秋, 黄晓芳, 周祖宏, 廖敏. 基于BERT的电子病历实体关系联合抽取研究. 2022. doi: 10.12201/bmr.202206.00003

刘彬, 肖晓霞, 邹北骥, 周展, 郑立瑞, 谭建聪. 融合汉字部首的BERT-BiLSTM-CRF中医医案命名实体识别模型. 2023. doi: 10.12201/bmr.202303.00004

冯凤翔, 任慧玲, 李晓瑛, 王巍洁, 王勖, 张颖. 融合相似度算法与预训练模型的中文电子病历实体映射方法研究. 2023. doi: 10.12201/bmr.202305.00015

沈蓉蓉, 夏帅帅, 晏峻峰. 命名实体识别在中医药领域的研究进展. 2022. doi: 10.12201/bmr.202207.00038

吴欢, 何昆仑. 基于循证医学和电子病历数据的通用医学知识图谱构建. 2024. doi: 10.12201/bmr.202409.00027

张丽鑫, 孙海霞, 唐明坤, 钱庆. 真实世界电子病历数据评价研究综述. 2021. doi: 10.12201/bmr.202106.00015

序号	提交日期	编号	操作
2	2025-10-12	10.12201/bmr.202511.00077V2	下载
1	2025-10-12	10.12201/bmr.202511.00077V1	下载

公开评论匿名评论仅发给作者

引用格式

盖彦蓉, 张云秋, 张慧, 李晨程, 卢浚睿. 面向知识抽取的真实世界中文电子病历数据质量分析与治理对策研究. 2025. biomedRxiv.202511.00077

访问统计

阅读量：37
下载量： 0
评论数：0

面向知识抽取的真实世界中文电子病历数据质量分析与治理对策研究

通讯作者: 张云秋, yunqiu@jlu.edu.cn

DOI：10.12201/bmr.202511.00077

Research on Quality Analysis and Governance Strategies of Real-World Chinese Electronic Medical Records Data for Knowledge Extraction

Corresponding author: zhangyunqiu, yunqiu@jlu.edu.cn

引用格式

访问统计

分享

Email This Article