[1]梁敬东,崔丙剑,姜海燕,等.基于word2vec和LSTM的句子相似度计算及其在水稻FAQ问答系统中的应用[J].南京农业大学学报,2018,41(5):946-953.[doi:10.7685/jnau.201801055]
 LIANG Jingdong,CUI Bingjian,JIANG Haiyan,et al.Sentence similarity computing based on word2vec and LSTM and its application in rice FAQ question-answering system[J].Journal of Nanjing Agricultural University,2018,41(5):946-953.[doi:10.7685/jnau.201801055]
点击复制

基于word2vec和LSTM的句子相似度计算及其在水稻FAQ问答系统中的应用()
分享到:

《南京农业大学学报》[ISSN:1000-2030/CN:32-1148/S]

卷:
41卷
期数:
2018年5期
页码:
946-953
栏目:
出版日期:
2018-09-20

文章信息/Info

Title:
Sentence similarity computing based on word2vec and LSTM and its application in rice FAQ question-answering system
作者:
梁敬东1 崔丙剑1 姜海燕12 沈毅1 谢元澄1
1. 南京农业大学信息科学技术学院, 江苏 南京 210095;
2. 南京农业大学国家信息农业工程技术中心, 江苏 南京 210095
Author(s):
LIANG Jingdong1 CUI Bingjian1 JIANG Haiyan12 SHEN Yi1 XIE Yuancheng1
1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China;
2. National Engineering and Technology Center for Information Agriculture, Nanjing Agricultural University, Nanjing 210095, China
关键词:
水稻问答系统常问问题集词向量长短期记忆深度学习
Keywords:
ricequestion-answering systemfrequently asked questionword2veclong-short term memorydeep learning
分类号:
S126
DOI:
10.7685/jnau.201801055
摘要:
[目的]水稻FAQ(frequently asked question,常问问题集)问答系统对农户在水稻种植过程中遇到的问题进行解答,问句相似度计算是其核心,用来匹配用户问题和FAQ中的问题。针对传统句子相似度算法准确率普遍较低的问题,本研究旨在用深度学习计算问句相似度,以提高系统回答的准确性。[方法]构建一个基于word2vec和LSTM(long-short term memory,长短期记忆)神经网络,包括输入层、嵌入层、LSTM层、全连接层和输出层的句子相似度模型。对水稻FAQ中的3 007个问题进行归类和组合得到32 072个问题对,并标注其相似性作为训练和测试数据。使用基于农业领域语料库训练得到的word2vec模型对训练数据向量化后作为输入,训练句子相似度模型。[结果]在测试集上对模型进行验证,并与基于HowNet、基于词向量的余弦距离以及基于word2vec和卷积神经网络(convolutional neural network,CNN)的3种句子相似度算法进行对比。对句子相似度的计算结果进行抽样检查,该模型的计算结果更符合人的直观印象。从准确率和ROC(receiver operating characteristic curve)曲线进行分析,该模型也明显优于其他3种方法,准确率达到了93.1%。[结论]本研究构建的模型显著提升了句子相似度计算的准确率,基于该模型开发的水稻FAQ问答系统,能够准确匹配用户问题和水稻FAQ中的问题,帮助农户更好地解决水稻生产中遇到的问题。
Abstract:
[Objectives]Rice FAQ(frequently asked question)question-answering system answers questions that farmers encounter in the process of rice planting,and the core of the system is question similarity computing,which is used to match users’ questions and the questions in FAQ. In order to solve the problem of low accuracy of the traditional sentence similarity algorithms,this study aims to use deep learning to calculate the similarity of questions to improve the accuracy of the system.[Methods]Based on word2vec and LSTM(long-short term memory),a sentence similarity computing model was designed including input layer,embedding layer,LSTM layer,full connection layer and output layer. Then 32 072 question pairs were obtained through manually grouping 3 007 questions in rice FAQ into pairs,and their similarities were marked as training dataset and test dataset. Using the word2vec model trained in the agricultural field corpus,the training dataset was mapped into vectors and used as input to train the sentence similarity computing model.[Results]Finally,the model was validated on the test dataset and compared with the other three sentence similarity methods:the method based on HowNet,the method based on cosine distance of word vectors,and the method based on word2vec and CNN(convolutional neural network). Sampling results of the sentence similarity calculation indicated that the result of this model was more reasonable for human. Furthermore,the analysis results of the accuracy and ROC(receiver operating characteristic curve)curves showed that our model was obviously superior to the other three methods,and the accuracy was 93.1%.[Conclusions]The model designed in this study has significantly increased the accuracy of sentence similarity computation. The rice FAQ question-answering system developed by this model can accurately match users’ questions and the questions in rice FAQ,and better help farmers solve problems in rice production.

参考文献/References:

[1] 郑实福,刘挺,秦兵,等. 自动问答综述[J]. 中文信息学报,2002,16(6):46-52. Zheng S F,Liu T,Qin B,et al. Overview of question answering[J]. Journal of Chinese Information Processing,2002,16(6):46-52(in Chinese with English abstract).
[2] 沈奎林,邵波,赵华. 利用微信构建图书馆智能问答系统[J]. 图书馆学研究,2015(8):75-80. Shen K L,Shao B,Zhao H. Using WeChat to build a library intelligent question answering system[J]. Researches on Library Science,2015(8):75-80(in Chinese with English abstract).
[3] 张生泽,王庆阳,袁克虹. 基于电子病历大数据的问答系统[J]. 医学信息学杂志,2017,38(3):7-11. Zhang S Z,Wang Q Y,Yuan K H. Question answering system based on the big data of electronic medical records[J]. Journal of Medical Informatics,2017,38(3):7-11(in Chinese with English abstract).
[4] 郑颖,金松林,张自阳,等. 基于本体的小麦病虫害问答系统构建与实现[J]. 河南农业科学,2016,45(6):143-146. Zheng Y,Jin S L,Zhang Z Y,et al. Construction of question answering system related to wheat diseases and insect pests based on ontology[J]. Journal of Henan Agricultural Sciences,2016,45(6):143-146(in Chinese with English abstract).
[5] 陈二静,姜恩波. 文本相似度计算方法研究综述[J]. 数据分析与知识发现,2017,6(6):1-11. Chen E J,Jiang E B. Review of studies on text similarity measures[J]. Data Analysis and Knowledge Discovery,2017,6(6):1-11(in Chinese with English abstract).
[6] 杨思春. 一种改进的句子相似度计算模型[J]. 电子科技大学学报,2006,35(6):956-959. Yang S C. An improved model for sentence similarity computing[J]. Journal of University of Electronic Science and Technology of China,2006,35(6):956-959(in Chinese with English abstract).
[7] 钱丽萍,汪立东. 基于中心短语及权值的相似度计算[J]. 郑州大学学报(理学版),2007,39(2):149-152. Qian L P,Wang L D. Similarity measure based on center phrase and word weight[J]. Journal of Zhengzhou University(Natural Science Edition),2007,39(2):149-152(in Chinese with English abstract).
[8] Salton G,Wong A,Yang C S. A vector space model for automatic indexing[J]. Communications of the ACM,1975,18(11):613-620.
[9] Landauer T K,Dumais S T. A solution to Plato’s problem:the latent semantic analysis theory of acquisition,induction,and representation of knowledge[J]. Psychological Review,1997,104(2):211-240.
[10] Bengio Y,Ducharme R,Vincent P,et al. A neural probabilistic language model[J]. Journal of Machine Learning Research,2003,3(6):1137-1155.
[11] Mikolov T,Chen K,Corrado G,et al. Efficient estimation of word representations in vector space[EB/OL].[2014-04-20]. http://arxiv.org/pdf/1301.3781v3.pdf.
[12] Pennington J,Socher R,Manning C D. GloVe:global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. New York:Association for Computational Linguistics,2014:1532-1543.
[13] 李晓,解辉,李丽杰. 基于Word2vec的句子语义相似度计算研究[J]. 计算机科学,2017,44(9):256-260. Li X,Xie H,Li L J. Research on sentence semantic similarity based on Word2vec[J]. Computer Science,2017,44(9):256-260(in Chinese with English abstract).
[14] He H,Gimpel K,Lin J. Multi-perspective sentence similarity modeling with convolutional neural networks[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg,PA:ACL,2015:1576-1586.
[15] 贾可亮,樊孝忠,张禹. 基于HowNet语义相似度的FAQ研究[J]. 计算机应用,2007,27(9):2256-2257. Jia K L,Fan X Z,Zhang Y. Research of FAQ based on the semantic similarities of HowNet[J]. Journal of Computer Applications,2007,27(9):2256-2257(in Chinese with English abstract).
[16] 赵妍妍,秦兵,刘挺,等. 基于多特征融合的句子相似度计算[C]//全国第八届计算语言学联合学术会议论文集. 北京:清华大学出版社,2005:168-174. Zhao Y Y,Qin B,Liu T,et al. Sentence similarity computing based on multi-feature fusion[C]//Proceedings of the Eighth China National Conference on Computational Linguistics. Beijing:Tsinghua University Press,2005:168-174(in Chinese with English abstract).
[17] 张琳,胡杰. FAQ问答系统句子相似度计算[J]. 郑州大学学报(理学版),2010,42(1):57-61. Zhang L,Hu J. Sentence similarity computing for FAQ question answering system[J]. Journal of Zhengzhou University(Natural Science Edition),2010,42(1):57-61(in Chinese with English abstract).
[18] 熊富林,邓怡豪,唐晓晟. Word2vec的核心架构及其应用[J]. 南京师范大学学报(工程技术版),2015,15(1):43-48. Xiong F L,Deng Y H,Tang X S. The architecture of Word2vec and its applications[J]. Journal of Nanjing Normal University(Engineering and Technology Edition),2015,15(1):43-48(in Chinese with English abstract).
[19] Hochreiter S. Recurrent neural net learning and vanishing gradient[J]. International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems,1998,6(2):107-116.
[20] Hochreiter S,Schmidhuber J. Long short-term memory[J]. Neural Computation,1997,9(8):1735-1780.
[21] Zaremba W,Sutskever I,Vinyals O. Recurrent neural network regularization[EB/OL].[2015-08-24]. http://arxiv.org/pdf/1409.2329v5.pdf.
[22] Srivastava N,Hinton G,Krizhevsky A,et al. Dropout:a simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research,2014,15(1):1929-1958.
[23] Ioffe S,Szegedy C. Batch normalization:accelerating deep network training by reducing internal covariate shift[C]/Proceedings of the 32nd International Conference on Machine Learning. Lille:IMLS,2015:448-456.
[24] Bottou L. Large-scale machine learning with stochastic gradient descent[C]//Proceedings of the 19th International Conference on Computational Statistics. Berlin:Springer-Verlag,2010:177-186.
[25] 毛宇. 中医药症状的中文分词与句子相似度研究[D]. 杭州:浙江大学,2017. Mao Y. Research of Chinese word segmentation and sentence similarity on traditional Chinese medicine symptom[D]. Hangzhou:Zhejiang University,2017(in Chinese with English abstract).
[26] 黄姝婧,张仰森. 基于多特征融合的句子相似度计算方法[J]. 北京信息科技大学学报,2017,32(5):45-49. Huang S J,Zhang Y S. Sentence similarity calculation method based on multiple-features[J]. Journal of Beijing Information Science and Technology University,2017,32(5):45-49(in Chinese with English abstract).

相似文献/References:

[1]张辰明,徐烨红,赵海娟,等.不同氮形态对水稻苗期氮素吸收和根系生长的影响[J].南京农业大学学报,2011,34(3):72.[doi:10.7685/j.issn.1000-2030.2011.03.013]
 ZHANG Chen-ming,XU Ye-hong,ZHAO Hai-juan,et al.Effects of different nitrogen forms on nitrogen uptake and root growth of rice at the seedling stage[J].Journal of Nanjing Agricultural University,2011,34(5):72.[doi:10.7685/j.issn.1000-2030.2011.03.013]
[2]郝文雅,沈其荣,冉炜,等.西瓜和水稻根系分泌物中糖和氨基酸对西瓜枯萎病病原菌生长的影响[J].南京农业大学学报,2011,34(3):77.[doi:10.7685/j.issn.1000-2030.2011.03.014]
 HAO Wen-ya,SHEN Qi-rong,RAN Wei,et al.The effects of sugars and amino acids in watermelon and rice root exudates on the growth of Fusarium oxysporum f.sp. niveum[J].Journal of Nanjing Agricultural University,2011,34(5):77.[doi:10.7685/j.issn.1000-2030.2011.03.014]
[3]徐小飒,刘喜,赵志刚,等.培矮64S/93-11重组自交系分子图谱构建及千粒重QTL检测[J].南京农业大学学报,2011,34(1):8.[doi:10.7685/j.issn.1000-2030.2011.01.002]
 XU Xiao-sa,LIU Xi,ZHAO Zhi-gang,et al.Construction of genetic linkage map based on a RILs population derived from the hybrid rice Peiai 64S/93-11 and detection of QTL for 1000-grain weight[J].Journal of Nanjing Agricultural University,2011,34(5):8.[doi:10.7685/j.issn.1000-2030.2011.01.002]
[4]魏广彬,徐海港,丁艳峰,等.水稻设计栽培系统的研制与实现[J].南京农业大学学报,2011,34(1):14.[doi:10.7685/j.issn.1000-2030.2011.01.003]
 WEI Guang-bin,XU Hai-gang,DING Yan-feng,et al.Development and realization of the rice design cultivation system[J].Journal of Nanjing Agricultural University,2011,34(5):14.[doi:10.7685/j.issn.1000-2030.2011.01.003]
[5]李刚华,王惠芝,王绍华,等.穗肥对水稻穗分化期碳氮代谢及颖花数的影响[J].南京农业大学学报,2010,33(1):1.[doi:10.7685/j.issn.1000-2030.2010.01.001]
 LI Gang-hua,WANG Hui-zhi,WANG Shao-hua,et al.Effect of nitrogen applied at rice panicle initiation stage on carbon and nitrogen metabolism and spikelets per panicle[J].Journal of Nanjing Agricultural University,2010,33(5):1.[doi:10.7685/j.issn.1000-2030.2010.01.001]
[6]王碧茜,范晓荣,徐国华,等.不同氮效率水稻品种旗叶的衰老特征[J].南京农业大学学报,2010,33(2):8.[doi:10.7685/j.issn.1000-2030.2010.02.002]
 WANG Bi-qian,FAN Xiao-rong,XU Guo-hua,et al.Characteristics of flag leaf senescence among three rice cultivars with different nitrogen use efficiency[J].Journal of Nanjing Agricultural University,2010,33(5):8.[doi:10.7685/j.issn.1000-2030.2010.02.002]
[7]赵成国,徐海港,李刚华,等.超高产单季粳稻抽穗期群体构成研究[J].南京农业大学学报,2011,34(2):23.[doi:10.7685/j.issn.1000-2030.2011.02.005]
 ZHAO Cheng-guo,XU Hai-gang,LI Gang-hua,et al.Studies on population composition of super-high-yielding single-cropping japonica rice in heading stage[J].Journal of Nanjing Agricultural University,2011,34(5):23.[doi:10.7685/j.issn.1000-2030.2011.02.005]
[8]陈志德,仲维功,王军,等.水稻苗期Cd2+胁迫的QTL定位研究[J].南京农业大学学报,2010,33(3):1.[doi:10.7685/j.issn.1000-2030.2010.03.001]
 CHEN Zhi-de,ZHONG Wei-gong,WANG Jun,et al.Mapping of QTL of tolerance to Cd^{2+} stress at seedling stage in rice(Oryza sativa L.)[J].Journal of Nanjing Agricultural University,2010,33(5):1.[doi:10.7685/j.issn.1000-2030.2010.03.001]
[9]叶利庭,樊剑波,徐晔红,等.不同氮效率水稻的生长特性[J].南京农业大学学报,2010,33(3):77.[doi:10.7685/j.issn.1000-2030.2010.03.015]
 YE Li-ting,FAN Jian-bo,XU Ye-hong,et al.Characteristics of growth in rice genotypes with different nitrogen use efficiency[J].Journal of Nanjing Agricultural University,2010,33(5):77.[doi:10.7685/j.issn.1000-2030.2010.03.015]
[10]晋玉宽,杨世湖,余丽,等.不同启动子驱动下Pib基因的表达及与稻瘟病抗性的关系[J].南京农业大学学报,2010,33(4):1.[doi:10.7685/j.issn.1000-2030.2010.04.001]
 JIN Yu-kuan,YANG Shi-hu,YU Li,et al.Expression and resistance analysis of the Pib gene in transgenic rice under different promoters[J].Journal of Nanjing Agricultural University,2010,33(5):1.[doi:10.7685/j.issn.1000-2030.2010.04.001]

备注/Memo

备注/Memo:
收稿日期:2018-1-31。
基金项目:国家重点研发计划项目(2016YFD0300607);中央高校基本科研业务费自主创新重点项目(KYZ201550,KYZ201548)
作者简介:梁敬东,副教授,研究方向为数据挖掘、农业信息化,E-mail:ljd@njau.edu.cn。
通信作者:谢元澄,副教授,博士,研究方向为模式识别、生物信息学,E-mail:xieych@163.com。
更新日期/Last Update: 1900-01-01