基于TF-IDF算法的方剂构成相似度可视化研究(1)
摘要:目的 构建中药方剂数据挖掘系统,直观反映方剂属性及方剂之间的相似度,为方剂研究及应用提供参考。方法 应用爬虫框架和手工录入方式获取一定数量的经典方剂,采用中文分词工具和手工整理方式对方剂信息进行名称、功能、来源、中药组成、剂量、剂量单位、炮制方法、忌宜、主治等属性拆分,构造语料词库,Python3.5环境下采用TF-IDF算法计算方剂间相似度并进行功能主治验证,采用d3.js进行可视化展示。结果 经过分词和手工整理得到不同类型方剂7710首,包含药物8957味,构建的中药方剂数据挖掘系统实现了相似度和方剂构成等信息可视化展示。同时,相似度高的方剂在功能主治方面具相似性。结论 本研究构建的中药方剂数据挖掘系统可直观展示方剂信息、方剂与药物间的关联关系及方剂之间的相似度。
关键词:方剂;TF-IDF算法;相似度;可视化;中药方剂数据挖掘系统
中图分类号:R289.1;R2-05 文献标识码:A 文章编号:1005-5304(2019)07-0104-05
Abstract: Objective To construct a data mining system for TCM prescriptions; To visually reflect the prescription properties and similarity between prescriptions; To provide references for research and application of prescriptions. Methods A reptile framework and manual entry method were used to obtain a certain number of classical prescriptions. The Chinese word segmentation tool and the manual finishing method were used for splitting the information of prescriptions according to the name, function, source, TCM composition, dosage, dosage unit, processing method, contraindication and indication. The corpus was constructed. In Python 3.5 environment, the TF-IDF algorithm was used to calculate the similarity between prescriptions and to perform functional indication verification, and d3.js was used for visual display. Results Through word segmentation and manual finishing, 7710 kinds of prescriptions of various types were obtained, including 8957 kinds of Chinese materia medica. The constructed TCM prescription data mining system realized information visualization of similarity and prescription composition. At the same time, prescriptions with high similarity were similar in terms of functional indications. Conclusion The TCM prescription data mining system constructed in this study can visually display the relationship between the prescription information, the prescription and the Chinese materia medica, and the similarity between the prescriptions.
Keywords: prescriptions; TF-IDF algorithm; similarity; visualization; TCM prescription data mining system
部分中藥方剂包含的药物数据非常相似,组成药物仅有微小差别,总体成分大致相同。这些相似方剂在治疗某一种或某一类病证时的功效存在某种潜在联系。从所有的方剂中找出与之相似的方剂可提供用药的多维度参考[1]。因此,通过方剂的相似性分析可较好挖掘其相似关系。目前中药方剂相似度模型主要从成分和功效两方面进行相似性分析。
本研究在Python3.5环境下应用TF-IDF(term frequence-inverse document frequence)算法进行方剂相似性的计算,把所有方剂看作一个方剂集合整体,每一方剂的药物构成看作关键词,并构造词库,计算TF-IDF值后,依照系数矩阵计算相似性。构建中药方剂数据挖掘系统,直观反映方剂属性及方剂之间的相似度,为方剂研究及应用提供参考。
1 方剂相似度计算方法, http://www.100md.com(郭文龙 罗熊 姜惠娟 谢永红 陈茂建)
关键词:方剂;TF-IDF算法;相似度;可视化;中药方剂数据挖掘系统
中图分类号:R289.1;R2-05 文献标识码:A 文章编号:1005-5304(2019)07-0104-05
Abstract: Objective To construct a data mining system for TCM prescriptions; To visually reflect the prescription properties and similarity between prescriptions; To provide references for research and application of prescriptions. Methods A reptile framework and manual entry method were used to obtain a certain number of classical prescriptions. The Chinese word segmentation tool and the manual finishing method were used for splitting the information of prescriptions according to the name, function, source, TCM composition, dosage, dosage unit, processing method, contraindication and indication. The corpus was constructed. In Python 3.5 environment, the TF-IDF algorithm was used to calculate the similarity between prescriptions and to perform functional indication verification, and d3.js was used for visual display. Results Through word segmentation and manual finishing, 7710 kinds of prescriptions of various types were obtained, including 8957 kinds of Chinese materia medica. The constructed TCM prescription data mining system realized information visualization of similarity and prescription composition. At the same time, prescriptions with high similarity were similar in terms of functional indications. Conclusion The TCM prescription data mining system constructed in this study can visually display the relationship between the prescription information, the prescription and the Chinese materia medica, and the similarity between the prescriptions.
Keywords: prescriptions; TF-IDF algorithm; similarity; visualization; TCM prescription data mining system
部分中藥方剂包含的药物数据非常相似,组成药物仅有微小差别,总体成分大致相同。这些相似方剂在治疗某一种或某一类病证时的功效存在某种潜在联系。从所有的方剂中找出与之相似的方剂可提供用药的多维度参考[1]。因此,通过方剂的相似性分析可较好挖掘其相似关系。目前中药方剂相似度模型主要从成分和功效两方面进行相似性分析。
本研究在Python3.5环境下应用TF-IDF(term frequence-inverse document frequence)算法进行方剂相似性的计算,把所有方剂看作一个方剂集合整体,每一方剂的药物构成看作关键词,并构造词库,计算TF-IDF值后,依照系数矩阵计算相似性。构建中药方剂数据挖掘系统,直观反映方剂属性及方剂之间的相似度,为方剂研究及应用提供参考。
1 方剂相似度计算方法, http://www.100md.com(郭文龙 罗熊 姜惠娟 谢永红 陈茂建)