A three-year study on the freshness of Web search engine databases

Lewandowski, Dirk A three-year study on the freshness of Web search engine databases., 2008 [Preprint]

[img]
Preview
PDF
JIS2008_preprint.pdf

Download (660kB) | Preview

English abstract

This paper deals with one aspect of the index quality of search engines: index freshness. The purpose is to analyse the update strategies of the major Web search engines Google, Yahoo, and MSN/Live.com. We conducted a test of the updates of 40 daily updated pages and 30 irregularly updated pages, respectively. We used data from a time span of six weeks in the years 2005, 2006, and 2007. We found that the best search engine in terms of up-to-dateness changes over the years and that none of the engines has an ideal solution for index freshness. Frequency distributions for the pages’ ages are skewed, which means that search engines do differentiate between often- and seldom-updated pages. This is confirmed by the difference between the average ages of daily updated pages and our control group of pages. Indexing patterns are often irregular, and there seems to be no clear policy regarding when to revisit Web pages. A major problem identified in our research is the delay in making crawled pages available for searching, which differs from one engine to another.

Item type: Preprint
Keywords: search engines; online information retrieval; World Wide Web; index freshness
Subjects: H. Information sources, supports, channels. > HQ. Web pages.
L. Information technology and library technology > LS. Search engines.
Depositing user: Dirk Lewandowski
Date deposited: 19 Jan 2008
Last modified: 02 Oct 2014 12:10
URI: http://hdl.handle.net/10760/11024

References

"SEEK" links will first look for possible matches inside E-LIS and query Google Scholar if no results are found.

[1] D. Lewandowski and N. Höchstötter, Web Searching: A Quality Measurement Perspective. In: A. Spink and M. Zimmer (eds.): Web Searching: Multidisciplinary Perspectives. (Springer, Dordrecht, 2008).

[2] A. Gulli and A. Signorini, The indexable Web is more than 11.5 billion pages. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, Chiba, Japan (2005) 902-903.

[3] L. Vaughan and Y. Zhang, Equal representation by search engines? A comparison of websites across countries and domains, Journal of Computer-Mediated Communication 12(3) (2007) article 7.

[4] L. Vaughan and M. Thelwall, Search Engine Coverage Bias: Evidence and Possible Causes, Information Processing & Management 40(4) (2004) 693-707.

[5] D. Lewandowski, H. Wahlig and G. Meyer-Bautor, The Freshness of Web search engine databases. Journal of Information Science 32(2) (2006) 133-150.

[6] D. Lewandowski, Date-restricted queries in web search engines. Online Information Review 28(6) (2004) 420-427.

[7] A. Dobra and S.E. Fienberg, How Large Is the World Wide Web? In: M. Levene and A. Poulovassilis (eds.): Web Dynamics - Adapting to Change in Content, Size, Topology and Use, (Springer, Berlin, Heidelberg 2004) 23-44.

[8] P. Lyman, H.R. Varian, K. Swearingen, P. Charles, N. Good, L.L. Jordan and J. Pal, How Much Information 2003? (2003). Available at: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003 (accessed 26 November 2007).

[9] D. Sullivan, Search Engine Sizes. Available at: http://searchenginewatch.com/showPage.html?page=2156481 (accessed 26 November 2007).

[10] A. Ntoulas, J. Cho and C. Olston, What's New on the Web? The Evolution of the Web from a Search Engine Perspective. In: Proceedings of the Thirteenth WWW Conference, New York, USA (2004).

[11] W. Koehler, Web Page Change and Persistance - A Four-Year Longitudinal Study, Journal of the American Society for Information Science and Technology 53(2) (2002) 162-171.

[12] S.J. Kim and S.H. Lee: An Empirical Study on the Change of Web Pages. In: Y. Zhang, K. Tanaka, J.X. Yu, S. Wang and M. Li (eds.): Web Technologies Research and Development - APWeb 2005: 7th Asia-Pacific Web Conference, Shanghai, China. (Springer, Berlin, Heidelberg, 2005) 632-642.

[13] D. Fetterly, M. Manasse, M. Najork and J.L. Wiener, A large-scale study of the evolution of Web pages, Software-Practice & Experience 34(2) (2004) 213-237.

[14] M. Toyoda and M. Kitsuregawa, What's Really New on the Web? Identifying New Pages from a Series of Unstable Web Snapshots. In: Proceedings of the 15th international conference on World Wide Web, (ACM Press, New York, 2006).

[15] J. Bar-Ilan, Search Engine Ability to Cope With the Changing Web. In: M. Levene and A. Poulovassilis (eds.): Web Dynamics: Adapting to Change in Content, Size, Topology and Use, (Springer Verlag, Heidelberg, 2004) 195-215.

[16] Y. Ke, L. Deng, W. Ng and D.L. Lee, Web dynamics and their ramifications for the development of Web search engines, Computer Networks 50(10) (2006) 1430-1447.

[17] K.M. Risvik and R. Michelsen, Search engines and Web dynamics. Computer Networks 39(3) (2002) 289-302.

[18] J. Griesbaum, Evaluation of three German search engines: Altavista.de, Google.de and Lycos.de, Information Research 9 (2004). Available at: http://informationr.net/ir/9-4/paper189.html (accessed 26 November 2007).

[19] J. Véronis, A comparative study of six search engines (2006). Available at: http://www.up.univ-mrs.fr/veronis/pdf/2006-comparative-study.pdf (accessed 26 November 2007).

[20] M. Machill, C. Neuberger, W. Schweiger and W. Wirth, Wegweiser im Netz: Qualität und Nutzung von Suchmaschinen. In: M. Machill and C. Welp (eds.): Wegweiser im Netz, (Bertelsmann Stiftung, Gütersloh, 2003).

[21] N. Schmidt-Maenz and C. Bomhardt, Wie Suchen Onliner im Internet? Science Factory/Absatzwirtschaft (2) (2005) 5-8.

[22] A. Acharya, M. Cutts, J. Dean, P. Haahr, M. Henzinger, U. Hoelzle, S. Lawrence, K. Pfleger, O. Sercinoglu and S. Tong, Information retrieval based on historical data (2005). US patent number 20050071741.

[23] A.Z. Broder, Z. Bar-Yossef and S. Ravikumar, Method and apparatus for assessing web page decay (2006). US patent application number 10/995,770.

[24] S. Adams, Information Quality, Liability, and Corrections, Online 27(5) (2003) 16-23.

[25] G.R. Notess, Search Engine Statistics: Freshness Showdown (2003). Available at: http://www.searchengineshowdown.com/statistics/freshness.shtml (accessed 26 November 2007).

[26] P. Mayr and F. Tosques, Google Web APIs - An instrument for webometric analyses? (2005). Available at: http://eprints.rclis.org/archive/00003704 (accessed 26 November 2007).

[27] G. Pant and P. Srinivasan, Learning to crawl: Comparing classification schemes, ACM Transactions on Information Systems 23(4) (2005) 430-462.

[28] P. Srinivasan, F. Menczer and G. Pant, A general evaluation framework for topical crawlers. Information Retrieval 8(3) (2005) 417-447.

[29] M.D. Dikaiakos, L. Papageorgiou and A. Stassopoulou, An investigation of web crawler behavior: Characterization and metrics, Computer Communications 28(8) (2005) 880-897.

[30] V. Cothey, Web-crawling reliability, Journal of the American Society for Information Science and Technology 55(14) (2004) 1228-1238.

[31] J. Cho and H. Garcia-Molina, Effective page refresh policies for Web crawlers, ACM Transactions on Database Systems 28(4) (2003) 390-426.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item