• 国家药监局综合司 国家卫生健康委办公厅
  • 国家药监局综合司 国家卫生健康委办公厅

Research on the Construction of a Question-Answer Corpus Dataset for Chinese Medical Knowledge Large Language Models

Corresponding author: LiXiaoying, li.xiaoying@imicams.ac.cn
DOI: 10.12201/bmr.202404.00002
Statement: This article is a preprint and has not been peer-reviewed. It reports new research that has yet to be evaluated and so should not be used to guide clinical practice.
  •  

    Abstract: Purpose/Significance Given the specialized and complex nature of medical knowledge, the performance of large language models (LLMs) in medical question-answering tasks is unsatisfactory. Therefore, its essential to conduct quantitative assessments of these LLMs performance to enhance the accuracy of LLMs in responding to medical domain questions. This paper focuses on the construction of a Chinese medical knowledge corpus dataset to enhance the accuracy and efficiency of LLMs in handling Chinese medical questions, which aims to establish a standardized evaluation benchmark for LLMs in the medical domain. Method/Process This study developed Q&A datasets encompassing Chinese medical paper knowledge, medical terminology explanations, and supplementary questions acquired from the Chinese medical licensing examination, as well as open-source Chinese medical Q&A datasets. These datasets would be useful for evaluating LLMs medical knowledge coverage, comprehension, and generation capabilities. Result/Conclusion The Chinese medical Q&A corpus datasets enrich the sources of existing data sets and promote the objective and comprehensive quantitative evaluation of large models in the medical field. In the near future, additional data such as electronic medical records and those from online health communities will be utilized to expand this dataset. All the efforts will offer stronger AI support for the Healthy China strategy.

    Key words: Large language models; Corpus dataset; Model evaluation

    Submit time: 10 April 2024

    Copyright: The copyright holder for this preprint is the author/funder, who has granted biomedRxiv a license to display the preprint in perpetuity.
  • 图表

  • Shi Chenghao, Tu Xinyi, Shi Jiawei, Chen Hongshuang, Wang Qinlu, Zou Haiou. A Scoping Review of the Application of Large Language Models in Clinical Practice. 2024. doi: 10.12201/bmr.202406.00001

    niuyuxiang, geshanshan, wanglihua. Exploration and research of electronic medical record generation technology from traditional NLP to large language model. 2024. doi: 10.12201/bmr.202412.00080

    wangyaoguo, tangshishi, liuhongze, anyuting, zhouyi. Research on optimization of osteoporosis disease database construction process based on local large model. 2024. doi: 10.12201/bmr.202410.00002

    GE Xiaoling. Application of Artificial Intelligence Large Models in Healthcare:a Survey. 2024. doi: 10.12201/bmr.202408.00039

    ZhengYanli, Han Fuhai, LI Shuyu, SU Wenxing. Application Status and Prospect of Artificial Intelligence Large Models in Medicine. 2023. doi: 10.12201/bmr.202312.00027

    lizihao, Chen Mosha, Ma Zhenxin, Yin Kangping, Tong Yixuan, Tan Chuanqi, Lang ZhenZhen, Tang Buzhou. CMedCausal - A dataset of Chinese medical causal relationship extraction. 2022. doi: 10.12201/bmr.202211.00004

    xie jia qi. Leveraging Pre-trained Language Model for Consumer Health Question Classification. 2021. doi: 10.12201/bmr.202101.00017

    kangyishuai, shaochenjie. An Algorithm for Generating TCM Document Questions Based on Unified Language Model. 2022. doi: 10.12201/bmr.202110.00044

    Study on Construction of Professional English Corpus for Evidence-based Medicine. 2020. doi: 10.12201/bmr.202005.00004

    wuhuan, hekunlun. Construction of general medical knowledge graph based on evidence-based medicine and electronic medical record data. 2024. doi: 10.12201/bmr.202409.00027

  • ID Submit time Number Download
    2 2023-10-29

    10.12201/bmr.202404.00002V2

    Download
    1 2023-10-29

    10.12201/bmr.202404.00002V1

    Download
  • Public  Anonymous  To author only

Get Citation

LvTingyu, LiXiaoying, LiuYuyang, DuJinhua, LiXinyi, LuoYan, Tangxiaoli, RenHuiling, LiuHui, YinHao. Research on the Construction of a Question-Answer Corpus Dataset for Chinese Medical Knowledge Large Language Models. 2024. biomedRxiv.202404.00002

Article Metrics

  • Read: 782
  • Download: 14
  • Comment: 0

Email This Article

User name:
Email:*请输入正确邮箱
Code:*验证码错误