盖彦蓉, 张云秋, 张慧, 李晨程, 卢浚睿. 面向知识抽取的真实世界中文电子病历数据质量分析与治理对策研究. 2025. biomedRxiv.202511.00077
面向知识抽取的真实世界中文电子病历数据质量分析与治理对策研究
通讯作者: 张云秋, yunqiu@jlu.edu.cn
DOI:10.12201/bmr.202511.00077
Research on Quality Analysis and Governance Strategies of Real-World Chinese Electronic Medical Records Data for Knowledge Extraction
Corresponding author: zhangyunqiu, yunqiu@jlu.edu.cn
-
摘要:目的/意义 真实世界中文电子病历知识抽取目前受制于标注规则的临床意义与技术可行性适配不足、源头数据质量较低以及数据治理滞后等问题。本研究旨在缓解上述瓶颈,探索面向真实世界场景的中文电子病历知识抽取路径。 方法/过程 本研究制定了覆盖主要实体和关系类型的标注规则,基于Bert+BiLstm+CRF模型在真实世界电子病历上开展实验,并据此梳理电子病历数据治理的关键问题。 结果/结论 模型在真实世界电子病历上的实体和关系识别F1值分别约0.62和0.36,明显低于公开数据集。数据自身的原因主要有表述不规范、数据稀疏和科室间术语差异,数据治理的原因主要有隐私保护与数据利用失衡、缺乏全流程管理及入库前质量检测等。
Abstract: Objective/Significance Knowledge extraction from real-world Chinese electronic medical records is currently constrained by issues such as inadequate alignment between the clinical significance and technical feasibility of annotation rules, low quality of source data, and lagging data governance. This study aims to alleviate these bottlenecks and explore a path for knowledge extraction from Chinese EMRs in real-world scenarios. Methods/Process In this study, annotation rules covering major entity and relationship types were formulated. Experiments were conducted on real-world EMRs based on the Bert+BiLSTM+CRF model, and key issues in EMR data governance were summarized accordingly. Results/Conclusion The F1-scores of the model for entity recognition and relationship recognition on real-world EMRs were approximately 0.62 and 0.36, respectively, which are significantly lower than those on public datasets. The main data-related causes include non-standard expressions, data sparsity, and terminological differences among departments. The main data governance-related causes include an imbalance between privacy protection and data utilization, lack of full-process management, and insufficient pre-storage quality inspection.
Key words: Chinese Electronic Medical Record; Knowledge Extraction; Named Entity Recognition; Relation Extraction; Data Governance提交时间:2025-11-24
版权声明:作者本人独立拥有该论文的版权,预印本系统仅拥有论文的永久保存权利。任何人未经允许不得重复使用。 -
图表
-
陈婕卿, 竹志超, 张锋, 曾可, 姜会珍, 程振宁. 面向知识图谱构建的中文电子病历命名实体识别方法研究. 2023. doi: 10.12201/bmr.202312.00011
武学鸿, 杨峰, 李建华, 徐倩. 融合词向量及词属性推理的中文电子病历实体识别方法. 2021. doi: 10.12201/bmr.202109.00016
郭维嘉. 中文电子病历数据元抽取方法. 2024. doi: 10.12201/bmr.202404.00038
邓嘉乐, 胡振生, 连万民, 华赟鹏, 周毅. 基于RoBERTa-CRF的肝癌电子病历实体识别研究. 2023. doi: 10.12201/bmr.202303.00027
陈剑秋, 黄晓芳, 周祖宏, 廖敏. 基于BERT的电子病历实体关系联合抽取研究. 2022. doi: 10.12201/bmr.202206.00003
刘彬, 肖晓霞, 邹北骥, 周展, 郑立瑞, 谭建聪. 融合汉字部首的BERT-BiLSTM-CRF中医医案命名实体识别模型. 2023. doi: 10.12201/bmr.202303.00004
冯凤翔, 任慧玲, 李晓瑛, 王巍洁, 王勖, 张颖. 融合相似度算法与预训练模型的中文电子病历实体映射方法研究. 2023. doi: 10.12201/bmr.202305.00015
沈蓉蓉, 夏帅帅, 晏峻峰. 命名实体识别在中医药领域的研究进展. 2022. doi: 10.12201/bmr.202207.00038
吴欢, 何昆仑. 基于循证医学和电子病历数据的通用医学知识图谱构建. 2024. doi: 10.12201/bmr.202409.00027
张丽鑫, 孙海霞, 唐明坤, 钱庆. 真实世界电子病历数据评价研究综述. 2021. doi: 10.12201/bmr.202106.00015
-
-
公开评论 匿名评论 仅发给作者
引用格式
访问统计
- 阅读量:37
- 下载量: 0
- 评论数:0

登录
注册




京公网安备