Avaluació de processos de reconeixement d’entitats (NER) com a complement a interfícies de recuperació d’informació en dipòsits digitals complexos

Vidal-Santos, Gerard Avaluació de processos de reconeixement d’entitats (NER) com a complement a interfícies de recuperació d’informació en dipòsits digitals complexos., 2018 BA thesis thesis, Universitat de Barcelona. [Thesis]

This is the latest version of this item.

[img]
Preview
Text (Treball final de grau)
VidalSantos_TFG_2018.pdf - Accepted version
Available under License Creative Commons Attribution.

Download (484kB) | Preview

English abstract

The aim of the study is to explore the use of unsupervised Named-Entity Recognition (NER) processes to generate descriptive metadata capable to assist information retrieval interfaces in large-scaled digital collections and support the construction of of more diverse knowledge representation models in academic libraries. For this purpose the study reviews some experiences and canonical literature in the use of automatized subject headings creation in libraries and archives environments as a leveraging tool in the overexploited use of search engines as main access points to retrieve assets in their catalogs and digital collections., focusing on the guidelines established by two articles that address this task from two complementary points of view: • van Hooland S, de Wilde M, Verborgh R, Steiner T, Van de Walle R. Exploring entity recognition and disambiguation for cultural heritage collections. DIGITAL SCHOLARSHIP IN THE HUMANITIES. 2013 Nov 1;30(2):262–79. • Zeng M. Using a Semantic Analysis Tool to Generate Subject Access Points: A Study Using Panofsky’s Theory and Two Research Samples. Knowledge Organization. 2014 Jan 1;440–51. The first one, provides the tools to generate named-entities in large scale samples of text and establishes the parameters to assess the suitability of this entities in a quantitative level. The second one provides the guidelines to analyze the quality of those results by developing a 3 layered framework (identification-description-interpretation) based on Edward Panofsky’s work in the analysis and interpretation of pictorial works. A work environment is built on this premise to extract and analyze the entities detected by DBPedia Spotlight (the NER service used for extraction) in a random collection of bibliographic records extracted from a thesis aggregator (Open Acces Thesis & Dissertations). The results shows the great improve on descriptive access points provided by this processes at a quantitative basis, allowing users to browse more effectively in better contextualized records if combined with the keywords already indexed, despite not having the necessary consistency to successfully surpass the quality filter established in the evaluation table. This setback, however, conditions in a relative way the possibility of improving the visibility of record in large collections by these means if the logical constructions from the semantic basis that manages the extraction service is taken into consideration on iterative cataloging processes, establishing a iterative and cost-effective way of constructing more diverse maps of knowledge graphs connecting manual or self-generated indexed keywords to others nodes in the linked open data (LOD) cloud.

Catalan abstract

L’objectiu d’aquest estudi és la creació automàtica de punts d’encapçalament per mitjà de tècniques de reconeixement d’entitats (Named-Object recognition NER) en un conjunt de registres bibliogràfics extrets d’un agregador de tesis doctorals que poden ser relacionats directament o indirecta amb el món de la cuina per determinar la seva validesa en l'assistència al desenvolupament de robustos models de representació de coneixement en plataformes d’agregació de continguts acadèmics . Per a tal propòsit l’estudi recopila de forma selectiva la bibliografia existent sobre experiències en l’ús d’aquest tipus de tecnologies en entorns d'àmbit bibliotecari i arxivístic, centrant-se especialment en les pautes establertes per dos articles que aborden aquesta tasca des de dos punts de vista complementaris: • El primer, Exploring entity recognition and disambiguation for cultural heritage collections (van Hooland et al., 2013), proposa la avaluació de tres serveis de d'extracció d'entitats per mitjà d'una eina creada expressament per al seu funcionament en un entorn controlat. •El segon, Using a Semantic Analysis Tool to Generate Subject Access Points: A Study Using Panofsky's Theory and Two Research Samples (Zeng, 2014) estableix les pautes d'anàlisi de resultats basat en tres nivells (identificació- descripció-interpretació) per a l'anàlisi dels resultats que aquest tipus de tractament genera. A traves de les pautes marcades per la bibliografia es construeix un entorn de treball per extreure i analitzar les entitats detectades per DBPedia Spotlight (el servei NER emprat per a la extracció) al conjunt de registres bibliogràfics. Els resultats mostren que, a nivell quantitatiu, la capacitat de l’eina permet fer visible una gran quantatitat de descriptors (Noms personals o corporatius, events, enclavaments geogràfics i matèries) que permeten contextualitzar millor els registres en ser combinades amb les paraules clau ja indexades, malgrat no tenir la consistència necessària en el processos de generació d’aquestes entitats per passar el filtre de qualitat establert en la taula d’avaluació. Aquest contratemps, però, condiciona de manera relativa la possibilitat de millorar la visibilitat dels registres en els processos de recuperació. En coordinar aquest tipus de procés tècnic amb la base semàntica que gestiona el servei d’extracció, una base de dades semàntica construïda a partir de Wikipedia que permet fer construccions lògiques a partir dels nodes que es relacionen amb la entitat extreta, les possibilitats de seguir millorant el context d’un registre a partir del nuvol de dades obertes enllaçades (Linked-Open Data, LOD) pot establir un punt permanent de contacte per seguir incrementant les opcions de filtre i descobriment en col·leccions digitals complexes i, a la vegada, permetre la revisió crítica dels materials prèviament indexats per millorar la seva experiència d’ús.

Item type: Thesis (UNSPECIFIED)
Keywords: Automatic analysis; Automatic indexing; Information retrieval; Linked Open Data; LOD; Anàlisi automàtica; Indexació automàtica; Recuperació d'informació.
Subjects: I. Information treatment for information services > IL. Semantic web
L. Information technology and library technology > LL. Automated language processing.
L. Information technology and library technology > LM. Automatic text retrieval.
Depositing user: Gerard Vidal Santos
Date deposited: 04 Mar 2019 19:36
Last modified: 04 Mar 2019 19:36
URI: http://hdl.handle.net/10760/33692

Available Versions of this Item

References

CLARK, Jason A., 2008. Making Patron Data Work Harder: User Search Terms as Access Points? Code4Lib Journal. 1 June 2008. No. 3, p. 78. http://journal.code4lib.org/articles/11355

DE WILDE, Max and HENGCHEN, Simon, 2017. Semantic Enrichment of a Multilingual Archive with Linked Open Data. [online]. 2017. [Accessed 16 April 2018]. Available from: https://helda.helsinki.f/handle/10138/233900

DRABINSKI, Emily, 2013. Queering the Catalog: Queer Theory and the Politics of Correction. Brooklyn Library Faculty Publications [online]. 1 January 2013. Available from: https://digitalcommons.liu.edu/brooklyn_libfacpubs/9

GARDNER, Sue Ann, 2012. Cresting toward the Sea Change. Library Resources & Technical Services [online]. 1 April 2012. Vol. 56, no. 2, p. 64–79. [Accessed 22 April 2018]. DOI 10/gdcck5. Available from: https://journals.ala.org/lrts/article/view/5565

GHAPHERY, Jimmy, OWENS, Emily, COGHILL, Donna, GARIEPY, Laura, HODGE, Megan, MCNULTY, Thomas and WHITE,Erin, 2016. Building Bridges with Logs: Collaborative Conversations about Discovery across Library Departments. The Code4Lib Journal [online]. 25 April 2016. No. 32. [Accessed 2 February 2018]. Available from: http://journal.code4lib.org/articles/11355

GRACY, Karen F., 2015. Archival description and linked data: a preliminary study of opportunities and implementation challenges. Archival Science [online]. September 2015. Vol. 15, no. 3, p. 239–294. [Accessed 22 April 2018].DOI 10.1007/s10502-014-9216-2.Available from: http://link.springer.com/10.1007/s10502-014-9216-2

HEARST, Marti, 2011. User interfaces for search. In: Modern Information Retrieval [online]. 2. New York: Addison Wesley. p. 21–55. ISBN 978-0-321-41691-9. Available from: http://grupoweb.upf.edu/mir2ed/pdf/chapter2.pdf

KITCHIN, Rob, 2017. Thinking critically about and researching algorithms. Information, Communication & Society. 2 January 2017. Vol. 20, no. 1, p. 14–29. DOI 10/gc3hsj.

MEDELYAN, Olena, MILNE, David, LEGG, Catherine and WITTEN, Ian H., 2009. Mining meaning from Wikipedia. International Journal of Human-Computer Studies [online]. September 2009. Vol. 67, no. 9, p. 716–754. [Accessed 18 February 2018]. DOI 10/bfnk2v. Available from: http://linkinghub.elsevier.com/retrieve/pii/S1071581909000561

NADEAU, David and SEKINE, Satoshi, 2009. A survey of named entity recognition and classification. In: Named Entities: Recognition, classification and use [online]. Amsterdam ; Philadelphia: John Benjamins Pub. Company. p. 3–28. 19. [Accessed 15 May 2018]. ISBN 978-90-272-8922-3. Available from: https://nlp.cs.nyu.edu/sekine/papers/li07.pdf

NOBLE, Safiya Umoja, 2018. Algorithms of oppression: how search engines reinforce racism. ISBN 978-1-4798-3724-3.

REIDSMA, Matthew, [no date]. Algorithmic Bias in Library Discovery Systems. [online]. [Accessed 17 February 2018]. Available from: https://matthew.reidsrow.com/articles/173

SANJAY, Desale and KUMBHAR, Rajendra, 2013. Research on Automatic Classification of Documents in Library Environment: A Literature Review. Knowledge Organization. 1 January 2013. Vol. 40, p. 295. Avaliable from: https://www.researchgate.net/publication/268505273_Research_on_Automatic_Classifcation_of_Documents_in_Library_Environment_A_Literature_Review

SADLER, Bess and BOURG, Chris, 2015. Feminism and the Future of Library Discovery. The Code4Lib Journal [online]. 15 April 2015. No. 28. [Accessed 1 February 2018]. Available from: http://journal.code4lib.org/articles/10425

SAWSAA, Ahlam and JOAN, Lu, 2014. Using Natural Language Programming (NLP)Technology to Model Domain Ontology OTO by Extracting Occupational Therapy Concepts - University of Huddersfield Repository. Knowledge Organization [online]. 2014. Vol. 46, no. 6, p. 452–464. [Accessed 9 May 2018]. Available from: http://eprints.hud.ac.uk/id/eprint/24354/

SMIRAGLIA, Richard P. and CAI, Xin, 2017. Tracking the Evolution of Clustering, Machine Learning, Automatic Indexing and Automatic Classifcation in Knowledge Organization. Knowledge Organization : KO; Wuerzburg [online]. 2017. Vol. 44, no. 3. [Accessed 10 December 2017]. Available from: https://search.proquest.com/lisa/docview/1927133404/CB9C71BDF158428EPQ/1

PAZ, Anita, 2013. In cerca del significato: la parola scritta nell’epoca di Google. [online]. 1 July 2013. No. 2.[Accessed 9 February 2018]. DOI 10/gcwfgj. Available from: https://www.jlis.it/article/view/8798

SMITH-YOSHIMURA, Karen, 2018. Are distributed models for vocabulary maintenance viable? – Hanging Together. [online]. 2018. [Accessed 16 April 2018]. Available from: http://hangingtogether.org/?p=6672

TRAMULLAS, Jesús, [2015]. Hispana: una revisión crítica | blok de bid. Blok de BiD [online]. [Accessed 29 April 2018]. Available from: http://www.ub.edu/blokdebid/ca/node/593

VÁLLEZ, Mari, PEDRAZA-JIMÉNEZ, Rafael, CODINA, Lluís, BLANCO, Saúl and ROVIRA, Cristòfol, 2015. Updating controlled vocabularies by analysing query logs. Online Information Review [online]. 9 November 2015. Vol. 39, no. 7, p. 870–884. [Accessed 30 April 2018]. DOI 10/f78wgq. Available from: http://www.emeraldinsight.com/doi/10.1108/OIR-06-2015-0180

VAN HOOLAND, Seth, DE WILDE, Max, VERBORGH, Ruben, STEINER, Thomas and VAN DE WALLE, Rik, 2013. Exploring entity recognition and disambiguation for cultural heritage collections. DIGITAL SCHOLARSHIP IN THE HUMANITIES. 1 November 2013. Vol. 30, no. 2, p. 262–279. DOI 10.1093/llc/fqt067

YI, Kwan and MAI CHAN, Lois, 2009. Linking folksonomy to Library of Congress subject headings: an exploratory study. Journal of Documentation [online]. 16 October 2009. Vol. 65, no. 6, p. 872–900. [Accessed 9 November 2017].DOI 10.1108/00220410910998906. Available from: http://www.emeraldinsight.com/doi/10.1108/00220410910998906

ZENG, Marcia, 2014. Using a Semantic Analysis Tool to Generate Subject Access Points: A Study Using Panofsky’s Theory and Two Research Samples. Knowledge Organization. 1 January 2014. P. 440–451


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item