Automatic Keyword Extraction from Documents Using Conditional Random Fields

Zhang, Chengzhi Automatic Keyword Extraction from Documents Using Conditional Random Fields. Journal of Computational Information Systems, 2008, vol. 4, n. 3, pp. 1169-1180. [Journal article (Paginated)]


Download (117kB) | Preview

English abstract

Keywords are subset of words or phrases from a document that can describe the meaning of the document. Many text mining applications can take advantage from it. Unfortunately, a large portion of documents still do not have keywords assigned. On the other hand, manual assignment of high quality keywords is expensive, time-consuming, and error prone. Therefore, most algorithms and systems aimed to help people perform automatic keywords extraction have been proposed. Conditional Random Fields (CRF) model is a state-of-the-art sequence labeling method, which can use the features of documents more sufficiently and effectively. At the same time, keywords extraction can be considered as the string labeling. In this paper, keywords extraction based on CRF is proposed and implemented. As far as we know, using CRF model in keyword extraction has not been investigated previously. Experimental results show that the CRF model outperforms other machine learning methods such as support vector machine, multiple linear regression model etc. in the task of keywords extraction.

Item type: Journal article (Paginated)
Keywords: Keywords Extraction; Conditional Random Fields; Automatic Indexing; Machine Learning
Subjects: L. Information technology and library technology > LL. Automated language processing.
Depositing user: Chengzhi Zhang
Date deposited: 23 Sep 2008
Last modified: 02 Oct 2014 12:12


[1] A. Hulth. Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In: Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing, Sapporo, Japan, 2003: 216-223.

[2] O. Medelyan, I. H Witten. Thesaurus Based Automatic Keyphrase Indexing. In: Proceedings of the Joint Conference on Digital Libraries 2006, Chapel Hill, NC, USA, 2006: 296-297.

[3] J. D. Cohen. Highlights: Language and Domain-independent Automatic Indexing Terms for Abstracting. Journal of the American Society for Information Science, 1995, 46(3): 162-174.

[4] H. P. Luhn. A Statistical Approach to Mechanized Encoding and Searching of Literary Information. IBM Journal of Research and Development, 1957, 1(4): 309-317.

[5] G. Salton, C. S. Yang, C. T. Yu. A Theory of Term Importance in Automatic Text Analysis, Journal of the American society for Information Science, 1975, 26(1): 33-44.

[6] Y. Matsuo, M. Ishizuka. Keyword Extraction from a Single Document Using Word Co-ocuurrence Statistical Information. International Journal on Artificial Intelligence Tools, 2004, 13(1): 157-169.

[7] L F. Chien. PAT-tree-based Keyword Extraction for Chinese Information Retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR1997), Philadelphia, PA, USA, 1997: 50-59.

[8] G. Ercan, I. Cicekli. Using Lexical Chains for Keyword Extraction. Information Processing and Management, 2007, 43(6): 1705-1714.

[9] S. F. Dennis. The Design and Testing of a Fully Automatic Indexing-searching System for Documents Consisting of Expository Text. In: G. Schecter eds. Information Retrieval: a Critical Review, Washington D. C.: Thompson

Book Company, 1967: 67-94.

[10] G. Salton, C. Buckley. Automatic Text Structuring and Retrieval –Experiments in Automatic Encyclopaedia Searching. In: Proceedings of the Fourteenth SIGIR Conference, New York: ACM, 1991: 21-30.

[11] E. Frank, G. W. Paynter, I. H. Witten. Domain-Specific Keyphrase Extraction. In: Proceedings of the 16th International Joint Conference on Aritifcal Intelliegence, Stockholm, Sweden, Morgan Kaufmann, 1999:


[12] K. Zhang, H. Xu, J. Tang, J. Z. Li. Keyword Extraction Using Support Vector Machine. In: Proceedings of the Seventh International Conference on Web-Age Information Management (WAIM2006), Hong Kong, China, 2006: 85-96.

[13] I. H. Witten, G. W. Paynte, E. Frank, C. Gutwin, C. G. Nevill-Manning. KEA: Practical Automatic Keyphrase Extraction. In: Proceedings of the 4th ACM Conference on Digital Library (DL’99) , Berkeley, CA, USA, 1999: 254-26.

[14] P. D. Turney. Learning to Extract Keyphrases from Text. NRC Technical Report ERB-1057, National Research Council, Canada. 1999: 1-43.

[15] J. B. Keith Humphreys. Phraserate: An Html Keyphrase Extractor. Technical Report, University of California, Riverside, 2002: 1-16.

[16] J. Lafferty, A. McCallum, F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the 18th International Conference on Machine Learning (ICML01),Williamstown, MA, USA, 2001: 282-289.

[17] H. Kang, W. J. Liu. Prosodic Words Prediction from Lexicon Words with CRF and TBL Joint Method. In: Proceedings of 2006 International Symposium on Chinese Spoken Language Processing (ISCSLP-2006),Kent-Ridge, Singapore, 2006: 161-168.

[18] CRF++: Yet Another CRF toolkit. Accessed: 2006.12.20.

[19] CNLP Platform. Accessed: 2006.12. 25.

[20] V. Vapnik. The Nature of Statistical Learning Theory. New York: Springer-Verlag, 1995.

[21] H. J. Zeng, Q. He, Z. Chen, W. Y. Ma, J. Ma. Learning to Cluster Web Search Results. In: Proceedings of 27th Annual International Conference on Research and Development in Information Retrieval (SIGIR'04), Sheffield,

2004: 210-217.

[22] T. Hastie, R. Tibshirani, J. Friedman. The Elements of Statistical Learning. New York: Springer-Verlag, 2001.

[23] W. F. Yang, X. Li. Chinese Keyword Extraction Based on Max-dupliated Strings of the Documents. In: Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR02), Tampere, Finland, 2002: 439-440.


Downloads per month over past year

Actions (login required)

View Item View Item