Üstverinin Tam-Metin Bilgi Erişim Performansı Üzerindeki Etkisi: Küçük Ölçekli Türkçe Külliyat Üzerinde Deneysel Bir Araştırma / Impact of Metadata on Full-text Information Retrieval Performance: An Experimental Research on a Small Scale Turkish Corpus

Çapkın, Çağdaş Üstverinin Tam-Metin Bilgi Erişim Performansı Üzerindeki Etkisi: Küçük Ölçekli Türkçe Külliyat Üzerinde Deneysel Bir Araştırma / Impact of Metadata on Full-text Information Retrieval Performance: An Experimental Research on a Small Scale Turkish Corpus. Türk Kütüphaceciliği, 2016, vol. 30, n. 4, pp. 678-701. [Journal article (Paginated)]

[img]
Preview
Text
impact_of_metada_on_fulltext_information_retrieval_performance.pdf

Download (1MB) | Preview

English abstract

Information institutions use text-based information retrieval systems to store, index and retrieve metadata, full-text, or both metadata and full-text (hybrid) contents. The aim of this research was to evaluate impact of these contents on information retrieval performance. For this purpose, metadata (MIR), full-text (FIR) and hybrid (HIR) content information retrieval systems were developed with default Lucene information retrieval model for a small scale Turkish corpus. In order to evaluate performance of this three systems, “precision - recall” and “normalized recall” tests were conducted. Experimental findings showed that there was no significant differences between MIR and FIR in mean average precision (MAP) performance. On the other hand, MAP performance of HIR was significantly higher in comparison to MIR and FIR. When information retrieval performance was evaluated as user-centered, the “normalized recall” performances of MIR and HIR were significantly higher than FIR. Additionally, there was no significant differences between the systems in retrieved relevant document means. Processing different types of contents such as metadata and full-text had some advantages and disadvantages for information retrieval systems in terms of term management. The advantages brought together in hybrid content processing (HIR) and information retrieval performance improved. [There is an extended English summary at the end of the article.]

Turkish abstract

Bilgi kurumları üstveri, tam-metin veya hem üstveri hem de tam-metin (melez) içerikleri depolamak, dizinlemek ve eriştirmek için metin tabanlı bilgi erişim sistemleri kullanmaktadır. Araştırmanın amacı, bu içeriklerin bilgi erişim performansı üzerindeki etkisini değerlendirmektir. Bu amaçla, küçük ölçekli bir Türkçe külliyat için varsayılan Lucene bilgi erişim modelini kullanan üstveri (ÜBES), tam-metin (TBES) ve melez (MBES) içerik bilgi erişim sistemleri geliştirilmiştir. Bu üç sistemin performansını değerlendirmek için "duyarlılık - anma" ve "normalize sıralama" testleri yapılmıştır. Deneysel bulgular, ÜBES ve TBES arasında ortalama duyarlılık performansında anlamlı bir fark olmadığını göstermiştir. Diğer taraftan, MBES’in ortalama duyarlılık performansı ÜBES ve TBES’ten anlamlı olarak yüksektir. Bilgi erişim performansı kullanıcı-merkezli olarak değerlendirildiğinde, ÜBES ve MBES’in normalize sıralama performansları TBES’e göre anlamlı olarak yüksektir. Ayrıca, üç bilgi erişim sisteminin eriştiği ilgili doküman ortalamaları arasında anlamlı bir farka ulaşılamamıştır. Bilgi erişim sistemlerinde üstveri ve tam-metin gibi faklı türlerdeki içeriklerin işlenmesinde terim yönetimi bakımından bazı avantajlar ve dezavantajlar bulunmaktadır. Melez içerik işleme (MBES), avantajları bir araya getirmiş ve bilgi erişim performansını artırmıştır.

Item type: Journal article (Paginated)
Keywords: Bilgi erişim; dizinleme; otomatik dizinleme; üstveri; performans değerlendirme; Türk Kütüphaneciliği; Apache Lucene; information retrieval; indexing; automatic indexing; metadata; performance evaluation; Turkish Librarianship
Subjects: I. Information treatment for information services > IC. Index languages, processes and schemes.
L. Information technology and library technology > LM. Automatic text retrieval.
L. Information technology and library technology > LR. OPAC systems.
L. Information technology and library technology > LS. Search engines.
Depositing user: Dr. Çağdaş ÇAPKIN
Date deposited: 18 Jan 2017 20:19
Last modified: 18 Jan 2017 20:19
URI: http://hdl.handle.net/10760/30523

References

Akın, A. A. ve Akın, M. D. (2007). Zemberek, an open source NLP framework for Turkic languages. 15 Ocak 2015 tarihinde http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.556.69 adresinden erişildi.

Anderson, J. D. ve Pérez-Carballo, J. (2001). The nature of indexing: How humans and machines analyze messages and texts for retrieval. Part II: Machine indexing, and the allocation of human versus machine effort. Information Processing & Management, 37(2), 255–277.

Baeza-Yates, R. ve Ribeiro-Neto, B. A. N. (1999). Modern information retrieval. New York: ACM Press.

Beall, J. (2008). The weaknesses of full-text searching. The Journal of Academic Librarianship, 34(5), 438–444.

Bollmann, P. (1983). The normalized recall and related measures. Proceedings of the 6th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '83) içinde (ss. 122-128). New York, NY, USA: ACM.

Brown, E. W., Callan, J. P. ve Croft, W. B. (1994). Fast incremental indexing for full-text information retrieval. Jorge B. Bocca, Matthias Jarke, Carlo Zaniolo (Yay. Haz.). Proceedings of the 20th International Conference on Very Large Data Bases (VLDB '94) içinde (ss. 192-202). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Buckley, C. ve Voorhees, E. M. (2004). Retrieval evaluation with incomplete information. Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '04) içinde (ss. 25-32).

Bush, V. (1945). As we may think. The Atlantic Monthly, (Temmuz), 112–124.

Can, F., Kocberber, S., Balcik, E., Kaynak, C., Ocalan, H. C. ve Vursavas, O. M. (2008). Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology, 59(3), 407–421.

Cleverdon, C. W. (1960). ASLIB Cranfield research project: Report on the first stage of an investigation into the comparative efficiency of indexing systems (Teknik Rapor). 15 Ocak 2016 tarihinde https://dspace.lib.cranfield.ac.uk/handle/1826/1122 adresinden erişildi.

Cleverdon, C. W. (1970). The effect of variations in relevance assessments in comparative experimental tests of index languages (Teknik Rapor). 15 Ocak 2016 tarihinde https://dspace.lib.cranfield.ac.uk/handle/1826/967 adresinden erişildi.

Cleverdon, C. W. (1977). A comparative evaluation of searching by controlled language and natural language in experimental N.A.S.A. data base (Teknik Rapor). 15 Ocak 2016 tarihinde https://dspace.lib.cranfield.ac.uk/handle/1826/1365 adresinden erişildi.

Cleverdon, C. W. ve Keen, M. (1966). Aslib Cranfield research project: Factors determining the performance of indexing systems; Volume 2, Test results (Teknik Rapor). 15 Ocak 2016 tarihinde https://dspace.lib.cranfield.ac.uk/handle/1826/863 adresinden erişildi.

Croft, W. B., Metzler, D. ve Strohman, T. (2015). Search engines: Information retrieval in practice. Pearson Education. 15 Ocak 2015 tarihinde http://ciir.cs.umass.edu/irbook/ adresinden erişildi.

Cooper, W. S. (1988). Getting beyond Boole. Information Processing and Management. 24(3), 243-48.

Çilden, E. K. (2006). Stemming Turkish words using Snowball. 15 Ocak 2015 tarihinde http://img.eba.gov.tr/542/7b6/2ce/3d5/995/c04/9a5/b2b/041/2a6/8ed/829/046/5ac/002/5427b62ce3d5995c049a5b2b0412a68ed8290465ac002.pdf adresinden erişildi.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. ve Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.

Dominich, S. (2001). Mathematical foundations of information retrieval. Dordrecht: Kluwer Academic Publishers.

Dominich, S. (2008). The modern algebra of information retrieval. Berlin: Springer.

Duran, G. (1997). Gövdebul: Türkçe gövdeleme algoritması. Yayımlanmamış yüksek mühendislik tezi, Hacettepe Üniversitesi, Ankara.

Eroğlu, M. (2000). Gövdelemenin ve gömünün Türkçe bir bilgi erişim sistemi üzerindeki etkisinin araştırılması. Yayımlanmamış yüksek mühendislik tezi, Hacettepe Üniversitesi, Ankara.

Eryiğit, G. ve Adalı, E. (2004). An affix stripping morphological analyzer for Turkish. Proceedings of the IASTED International Conference on Artificial Intelligence and Applications içinde (ss. 299-304). Innsbruck, Austria. 15 Ocak 2015 tarihinde http://web.itu.edu.tr/gulsenc/papers/iasted.pdf adresinden erişildi.

Garfield, E. (1979). Citation indexing, its theory and application in science, technology, and humanities. New York: Wiley.

Göker, A. ve Davies, J. (Ed.). (2008). Information retrieval: Searching in the 21st century. Chichester: Wiley.

Hemminger, B. M., Saelim, B., Sullivan, P. F. ve Vision, T. J. (2007). Comparison of full-text searching to metadata searching for genes in two biomedical literature cohorts. Journal of the American Society for Information Science and Technology, 58(14), 2341–2352.

Järvelin, K. ve Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4), 422–446.

Kent, A., Berry, M. M., Luehrs, F. U. ve Perry, J. W. (1955). Machine literature searching VIII. Operational criteria for designing information retrieval systems. American Documentation, 6(2), 93–101. doi:10.1002/asi.5090060209

Kim, S. S., Myaeng, S. H. ve Yoo, J. M. (2005). A hybrid information retrieval model using metadata and text. E. A. Fox, E. J. Neuhold, P. Premsmit ve V. Wuwongse (Ed.), Digital Libraries: Implementing Strategies and Sharing Experiences içinde, Lecture Notes in Computer Science (ss. 232–241). Springer Berlin Heidelberg.

Lin, J. (2009). Is searching full text more effective than searching abstracts? BMC Bioinformatics, 10, 46. doi:10.1186/1471-2105-10-46

Luhn, H. P. (1957). A statistical approach to mechanised encoding and searching of library information. IBM Journal of Research and Development, 1, 309-317.

Manning, C.D., Raghavan, P. ve Schütze, H. (2008). Evaluation of ranked retrieval results. 15 Ocak 2016 tarihinde http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html adresinden erişildi.

Marcus, R. (1991). Computer and human understanding in intelligent retrieval assistance. Proceedings of the 54th American Society for Information Science meeting içinde (ss. 49-59), Washington: Medford.

Maron, M. E. ve Kuhns, J. L. (1960). On relevance, probabilistic indexing and information retrieval. J. ACM, 7(3), 216–244.

McKinin, E. J., Sievert, M., Johnson, E. D. ve Mitchell, J. A. (1991). The Medline/full-text research project. Journal of the American Society for Information Science, 42(4), 297–307.

Page, L., Brin, S., Motwani, R. ve Winograd, T. (1998). The PageRank citation ranking: Bringing order to the web. CA: Stanford University. 15. Ocak 2015 tarihinde http://ilpubs.stanford.edu:8090/422/ adresinden erişildi.

Ponte, J. M. ve Croft, W. B. (1998). A language modeling approach to information retrieval. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’98) içinde (ss. 275–281). New York, NY, USA: ACM.

Robertson, S. E. ve Jones, K. S. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129–146.

Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M. ve Gatford, M. (1995). Okapi at TREC–3. D. K. Harman (Yay. Haz.). Proceedings of the Third Text REtrieval Conference (TREC–3) içinde (ss. 109-126). Gaithersburg, MD: NIST.

Salton, G. (1984). The use of extended Boolean logic in information retrieval. Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD '84) içinde (ss. 277-285). New York, NY, USA: ACM.

Salton, G. (1986). Another look at automatic text-retrieval systems. Commun. ACM. 29(7), 648-656.

Salton, G. ve Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5), 513-523.

Salton, G., Fox, E. A. ve Wu, H. (1982). Extended Boolean information retrieval. 15 Ocak 2015 tarihinde http://ecommons.library.cornell.edu/handle/1813/6351 adresinden erişildi.

Salton, G., Wong, A. ve Yang, C. S. (1975). A Vector Space Model for information retrieval. Journal of the American Society for Information Science, 18(11), 613-620.

Sanderson, M. ve Croft, W. B. (2012). The history of information retrieval research. Proceedings of the IEEE, 100 (Special Centennial Issue), 1444–1451. doi:10.1109/JPROC.2012.2189916

Saracevic, T. (1995). Evaluation of evaluation in information retrieval. Edward A. Fox, Peter Ingwersen, Raya Fidel (Yay. Haz.). Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95) içinde (ss. 138-146). New York, NY, USA: ACM.

Sezer, E. (1999). Smart Bilgi Erişim Sistemi'nin Türkçe yerelleştirilmesi ve otomatik gömü üretimi. Yayımlanmamış yüksek mühendislik tezi, Hacettepe Üniversitesi, Ankara.

Sever, H. ve Tonta, Y. (2006). Arama motorları. Türkiye Bilişim Ansiklopedisi içinde (ss. 95-99). İstanbul: Papatya Yayınları. 15 Ocak 2016 tarihinde http://yunus.hacettepe.edu.tr/~tonta/yayinlar/Turkbilisimansiklopedisi.pdf adresinden erişildi.

Shields, G. (2005). What are the main differences between human indexing and automatic indexing?. 12 Aralık 2015 tarihinde http://www.shieldsnetwork.com/LI842_Shields_Automatic_Indexing.pdf adresinden erişildi.

Similarity. (2010). 15 Ocak 2016 tarihinde http://lucene.apache.org/core/3_0_3/api/core/org/apache/lucene/search/Similarity.html adresinden erişildi.

Singhal , A., Buckley, C. ve Mitra, M. (1996). Pivoted document length normalization. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '96) içinde (ss. 21-29). New York, NY, USA: ACM.

Singhal, A., Salton, G., Mitra, M. ve Buckley, C. (1995). Document length normalization (Teknik Rapor). Cornell University. 15 Ocak 2016 tarihinde http://ecommons.cornell.edu/handle/1813/7186 adresinden erişildi.

Spärck-Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11-21.

Spoerri, A. (1995). INFOCRYSTAL: A visual tool for information retrieval. Yayınlanmamış Doktora Tezi, Massachusetts Institute of Technology. 15 Ocak 2016 tarihinde http://hdl.handle.net/1721.1/36946 adresinden erişildi.

Tonta, Y., Bitirim, Y. ve Sever, H. (2002). Türkçe arama motorlarında performans değerlendirme. Ankara: Total Bilişim Ltd. Şti.

Turtle, H. ve Croft, W. B. (1989). Inference networks for document retrieval. Jean-Luc Vidick (Yay. Haz.). Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '90) içinde (ss. 1-24). New York, NY, USA: ACM.

Turtle, H. ve Croft, W. B. (1991). Evaluation of an inference network-based retrieval model. ACM Trans. Inf. Syst., 9(3), 187–222.

Turtle, H. R. ve Croft, W. B. (1997). Uncertainty in information retrieval systems. A. Motro, P. Smets (Yay. Haz.). Uncertainty management in information systems: From needs to solutions içinde (ss. 189-224). Boston: Kluwer Academic.

Wang, Y., Wang, L., Li, Y., He, D., Liu, T.-Y. ve Chen, W. (2013). A theoretical analysis of NDCG type ranking measures. Shai Shalev-Shwartz ve Ingo Steinwart (Yay. Haz.). 26th Conference on Learning Theory (COLT) içinde (ss. 25-54). 15 Ocak 2016 tarihinde http://www.jmlr.org/proceedings/papers/v30/Wang13.pdf adresinden erişildi.

Waugh, L., Tarver, H., Phillips, M. ve Alemneh, D. (2015). Comparison of full-text versus metadata searching in an institutional repository: Case study of the UNT Scholarly Works. arXiv:1512.07193 [cs]. 1 Nisan 2016 tarihinde http://arxiv.org/abs/1512.07193 adresinden erişildi.

Van Rijsbergen, C. J. (1979). Information retrieaval: Introduction. 15 Ocak 2016 tarihinde http://www.dcs.gla.ac.uk/Keith/Chapter.1/Ch.1.html adresinden erişildi.

Yao, Y. Y. (1995). Measuring retrieval effectiveness based on user preference of documents. Journal of the American Society for Information Science, 46(2), 133-145.

Yao, Y. Y. (2004). Granular computing for the design of information support systems. W. Wu, H. Xiong, S. Shekhar (Yay. Haz.). Clustering and Information Retrieval içinde (ss. 299-329). Dordrecht: Kluwer Academic Publishers.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item