Examining learning algorithms for text classification in digital libraries

Fahmi, Ismail Examining learning algorithms for text classification in digital libraries., 2004 Master Thesis thesis, University of Groningen, Netherland. [Thesis]


Download (399kB) | Preview

English abstract

Information presentation in a digital library plays important role especially in improving the usability of collections and helping users to get started with the collection. One approach is to provide an overview through large topical category hierarchies associated with the documents of a collection. But with the growth in the amount of information, this manual classification becomes a new problem for users. The navigation through the hierarchy can be a time-consuming and frustrating process. In this master thesis, we examine the performance of machine learning algorithms for automatic text classification. We examine three learning algorithms namely ID3, Instance Based Learning, and Naive Bayes to classify documents according to their category hierarchies. We focused on the effectiveness measurement such as recall, precision, the F1- measure, error, and the learning curve in learning a manually classified metadata collection from the Indonesian Digital Library Network (IndonesiaDLN), and we compare the results with an examination of the Reuters-21578 dataset. We summarize the algorithm that is most suitable for the digital library collection and the performance of the algorithms on these datasets.

Item type: Thesis (UNSPECIFIED)
Keywords: dataset; algorythms; digital library; software
Subjects: A. Theoretical and general aspects of libraries and information.
Depositing user: Imam Budi Prasetiawan
Date deposited: 14 Apr 2007
Last modified: 02 Oct 2014 12:07
URI: http://hdl.handle.net/10760/9315


1] Edward A Fox and O. Sornil. Digital libraries. In R. B-Yates and B. R-Neto, editors, Modern Information Retrieval, page 415, New York, 1999. Addison Wesley.

[2] OAI. The Open Archives Initiative, 2003. Retrieved September 4, 2003 from the WWW: http://www.openarchives.org.

[3] Mehran Sahami. Using Machine Learning to Improve Information Access.

Dissertation, Stanford University, Stanford, 1998.

[4] S.E Larson D.T. Hawkins and B. Q. Caton. Information science abstracts: Tracking the literature of information science. part 2: A new taxonomy for information science. JASIST, 54(8):771–779, 2003.

[5] Ismail Fahmi. The Indonesian Digital Library Network is born to struggle with the digital divide. International Information and Library Review, 34:153–174, 2002.

[6] Ismail Fahmi. The Network of Networks (NeONs). In The Fourth IndonesiaDLN

Meeting, Institut Teknologi Surabaya, Surabaya, 2003.

[7] Hsinchun Chen. Introduction to the JASIST Special Topic Section on Web Retrieval and Mining: A Machine Learning Perspective. JASIST, 54(7):621–623, 2003.

[8] Gerard Salton. Automatic Text Processing. Addison Wesley, 1989.

[9] Raymond J. Mooney and Loriene Roy. Content-based book recommending using learning for text categorization. In Proceedings of DL-00, 5th ACM Conference on Digital Libraries, pages 195–204, San Antonio, US, 2000. ACM Press, New York, US.

[10] Yiming Yang and Xin Liu. A re-examination of text categorization methods. In Marti A. Hearst, Fredric Gey, and Richard Tong, editors, Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, pages 42–49, Berkeley, US, 1999. ACM Press, New York, US.

[11] Alexander Bergo. Text categorization and prototypes. 2001. Retrieved September 4, 2003 from the WWW:

http://www.illc.uva.nl/Publications/ResearchReports/ MoL-2001-08.text.pdf.

[12] Tom M. Mitchell. Machine Learning. McGraw Hill, 1997.

[13] Wikipedia. Inductive bias. Retrieved January 13, 2004 from the WWW:

http://en2.wikipedia.org/wiki/Inductive bias.

[14] Yan Liu, Yiming Yang and Jaime Carbonell. Boosting to correct inductive bias in text classification. In CIKM’02, November 4-9, 2002, McLean, Virginia, USA, 2002.

[15] Google. Google Search Appliance FAQ. Retrieved August 10, 2003 from the WWW: http://www.google.com/appliance/faq.html#13.

[16] R. Kohavi, G. John, R. Long, D. Manley, and K. Pfleger. MLC++: A

machine learning library in C. In Tech Report, Computer Science Dept,

Stanford University, 1994. Retrieved September 15, 2003 from the WWW:


[17] Ron Kohavi. Wrappers for performance enhancement and oblivious decision graphs. Dissertation, Stanford University, Stanford, 1995.

[18] IndonesiaDLN. The IndonesiaDLN central hub server, 2003. Retrieved September 4, 2003 from the WWW: http://hub.indonesiadln.org.

[19] Raymond J. Mooney. ML-code Machine Learning Archive, 1991. Retrieved January 7, 2004 from the WWW: http://www.cs.utexas.edu/ftp/pub/mooney/ml-code/.

[20] Derek Sleeman. The role of CONSULTANT in helping domain experts use machine

learning, 1993. Workshop on fielded applications of ML 1993, University of Mas-

sachusetts, Amherst.

[21] D. Mitchie C.C. Taylor and D.J. Spiegalhalter. Machine Learning, Neural, and Statistical Classification. Paramount Publishing International, 1994.

[22] Ron Kohavi, Dan Sommerfield, and James Dougherty. Data Mining Using MLC++:

A machine learning library in C++. In Tools with Artificial Intelligence. IEEE

Computer Society Press, 1996. Retrieved August 17, 2003 from the WWW:


[23] David D. Lewis. Reuters-21578 text categorization test collection. 1997. AT&T Research Lab.

[24] Andrew Kachites McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Retrieved August 10, 2003 from the WWW: http://www.cs.cmu.edu/ mccallum/bow, 1996.

[25] DCMI. The Dublin Core Metadata Initiative, 2000. Retrieved August 17, 2003 from the WWW: URL http://dublincore.org.

[26] R. B-Yates and B. R-Neto. Modern Information Retrieval. Addison Wesley, New York, 1999.

[27] Vinsensius Berlian Vega S N. Information retrieval for the Indonesian Language. Thesis, National University of Singapore, 2001.

[28] K. Aas and L. Eikvil. Text categorisation: A survey. 1999. Technical report, Norwegian Computing Center, June 1999.

[29] M. F. Porter. An algorithm for suffixes stripping. Program, 14(3):130–137, 1980.

[30] Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proc. 14th Int. Conf. Machine Learning, pages 412–420, 1997.

[31] D. Heckerman M. Sahami, S. Du


Downloads per month over past year

Actions (login required)

View Item View Item