LvTingyu, LiXiaoying, LiuYuyang, DuJinhua, LiXinyi, LuoYan, Tangxiaoli, RenHuiling, LiuHui, YinHao. Research on the Construction of a Question-Answer Corpus Dataset for Chinese Medical Knowledge Large Language Models. 2024. biomedRxiv.202404.00002
Research on the Construction of a Question-Answer Corpus Dataset for Chinese Medical Knowledge Large Language Models
Corresponding author: LiXiaoying, li.xiaoying@imicams.ac.cn
DOI: 10.12201/bmr.202404.00002
- 
							 
								 
									
Abstract: Purpose/Significance Given the specialized and complex nature of medical knowledge, the performance of large language models (LLMs) in medical question-answering tasks is unsatisfactory. Therefore, its essential to conduct quantitative assessments of these LLMs performance to enhance the accuracy of LLMs in responding to medical domain questions. This paper focuses on the construction of a Chinese medical knowledge corpus dataset to enhance the accuracy and efficiency of LLMs in handling Chinese medical questions, which aims to establish a standardized evaluation benchmark for LLMs in the medical domain. Method/Process This study developed Q&A datasets encompassing Chinese medical paper knowledge, medical terminology explanations, and supplementary questions acquired from the Chinese medical licensing examination, as well as open-source Chinese medical Q&A datasets. These datasets would be useful for evaluating LLMs medical knowledge coverage, comprehension, and generation capabilities. Result/Conclusion The Chinese medical Q&A corpus datasets enrich the sources of existing data sets and promote the objective and comprehensive quantitative evaluation of large models in the medical field. In the near future, additional data such as electronic medical records and those from online health communities will be utilized to expand this dataset. All the efforts will offer stronger AI support for the Healthy China strategy.
Key words: Large language models; Corpus dataset; Model evaluationSubmit time: 10 April 2024
Copyright: The copyright holder for this preprint is the author/funder, who has granted biomedRxiv a license to display the preprint in perpetuity. - 
								
图表
 - 
								
Shi Chenghao, Tu Xinyi, Shi Jiawei, Chen Hongshuang, Wang Qinlu, Zou Haiou. A Scoping Review of the Application of Large Language Models in Clinical Practice. 2024. doi: 10.12201/bmr.202406.00001
niuyuxiang, geshanshan, wanglihua. Exploration and research of electronic medical record generation technology from traditional NLP to large language model. 2024. doi: 10.12201/bmr.202412.00080
wangyaoguo, tangshishi, liuhongze, anyuting, zhouyi. Research on optimization of osteoporosis disease database construction process based on local large model. 2024. doi: 10.12201/bmr.202410.00002
GE Xiaoling. Application of Artificial Intelligence Large Models in Healthcare:a Survey. 2024. doi: 10.12201/bmr.202408.00039
ZhengYanli, Han Fuhai, LI Shuyu, SU Wenxing. Application Status and Prospect of Artificial Intelligence Large Models in Medicine. 2023. doi: 10.12201/bmr.202312.00027
lizihao, Chen Mosha, Ma Zhenxin, Yin Kangping, Tong Yixuan, Tan Chuanqi, Lang ZhenZhen, Tang Buzhou. CMedCausal - A dataset of Chinese medical causal relationship extraction. 2022. doi: 10.12201/bmr.202211.00004
xie jia qi. Leveraging Pre-trained Language Model for Consumer Health Question Classification. 2021. doi: 10.12201/bmr.202101.00017
kangyishuai, shaochenjie. An Algorithm for Generating TCM Document Questions Based on Unified Language Model. 2022. doi: 10.12201/bmr.202110.00044
Study on Construction of Professional English Corpus for Evidence-based Medicine. 2020. doi: 10.12201/bmr.202005.00004
wuhuan, hekunlun. Construction of general medical knowledge graph based on evidence-based medicine and electronic medical record data. 2024. doi: 10.12201/bmr.202409.00027
 - 
								
 - 
								Public Anonymous To author only
 
Get Citation
Article Metrics
- Read: 782
 - Download: 14
 - Comment: 0
 

Login
Register
	                



京公网安备