Avaluació de processos de reconeixement d’entitats (NER) com a complement a interfícies de recuperació d’informació en dipòsits digitals complexos

Vidal-Santos, Gerard Avaluació de processos de reconeixement d’entitats (NER) com a complement a interfícies de recuperació d’informació en dipòsits digitals complexos., 2018 BA thesis thesis, Universitat de Barcelona. [Thesis]

This is the latest version of this item.

Preview

Text (Treball final de grau)
VidalSantos_TFG_2018.pdf - Accepted version
Available under License Creative Commons Attribution.
Download (484kB) | Preview

English abstract

The aim of the study is to explore the use of unsupervised Named-Entity Recognition (NER) processes to generate descriptive metadata capable to assist information retrieval interfaces in large-scaled digital collections and support the construction of of more diverse knowledge representation models in academic libraries. For this purpose the study reviews some experiences and canonical literature in the use of automatized subject headings creation in libraries and archives environments as a leveraging tool in the overexploited use of search engines as main access points to retrieve assets in their catalogs and digital collections., focusing on the guidelines established by two articles that address this task from two complementary points of view: • van Hooland S, de Wilde M, Verborgh R, Steiner T, Van de Walle R. Exploring entity recognition and disambiguation for cultural heritage collections. DIGITAL SCHOLARSHIP IN THE HUMANITIES. 2013 Nov 1;30(2):262–79. • Zeng M. Using a Semantic Analysis Tool to Generate Subject Access Points: A Study Using Panofsky’s Theory and Two Research Samples. Knowledge Organization. 2014 Jan 1;440–51. The first one, provides the tools to generate named-entities in large scale samples of text and establishes the parameters to assess the suitability of this entities in a quantitative level. The second one provides the guidelines to analyze the quality of those results by developing a 3 layered framework (identification-description-interpretation) based on Edward Panofsky’s work in the analysis and interpretation of pictorial works. A work environment is built on this premise to extract and analyze the entities detected by DBPedia Spotlight (the NER service used for extraction) in a random collection of bibliographic records extracted from a thesis aggregator (Open Acces Thesis & Dissertations). The results shows the great improve on descriptive access points provided by this processes at a quantitative basis, allowing users to browse more effectively in better contextualized records if combined with the keywords already indexed, despite not having the necessary consistency to successfully surpass the quality filter established in the evaluation table. This setback, however, conditions in a relative way the possibility of improving the visibility of record in large collections by these means if the logical constructions from the semantic basis that manages the extraction service is taken into consideration on iterative cataloging processes, establishing a iterative and cost-effective way of constructing more diverse maps of knowledge graphs connecting manual or self-generated indexed keywords to others nodes in the linked open data (LOD) cloud.

Catalan abstract

L’objectiu d’aquest estudi és la creació automàtica de punts d’encapçalament per mitjà de tècniques de reconeixement d’entitats (Named-Object recognition NER) en un conjunt de registres bibliogràfics extrets d’un agregador de tesis doctorals que poden ser relacionats directament o indirecta amb el món de la cuina per determinar la seva validesa en l'assistència al desenvolupament de robustos models de representació de coneixement en plataformes d’agregació de continguts acadèmics . Per a tal propòsit l’estudi recopila de forma selectiva la bibliografia existent sobre experiències en l’ús d’aquest tipus de tecnologies en entorns d'àmbit bibliotecari i arxivístic, centrant-se especialment en les pautes establertes per dos articles que aborden aquesta tasca des de dos punts de vista complementaris: • El primer, Exploring entity recognition and disambiguation for cultural heritage collections (van Hooland et al., 2013), proposa la avaluació de tres serveis de d'extracció d'entitats per mitjà d'una eina creada expressament per al seu funcionament en un entorn controlat. •El segon, Using a Semantic Analysis Tool to Generate Subject Access Points: A Study Using Panofsky's Theory and Two Research Samples (Zeng, 2014) estableix les pautes d'anàlisi de resultats basat en tres nivells (identificació- descripció-interpretació) per a l'anàlisi dels resultats que aquest tipus de tractament genera. A traves de les pautes marcades per la bibliografia es construeix un entorn de treball per extreure i analitzar les entitats detectades per DBPedia Spotlight (el servei NER emprat per a la extracció) al conjunt de registres bibliogràfics. Els resultats mostren que, a nivell quantitatiu, la capacitat de l’eina permet fer visible una gran quantatitat de descriptors (Noms personals o corporatius, events, enclavaments geogràfics i matèries) que permeten contextualitzar millor els registres en ser combinades amb les paraules clau ja indexades, malgrat no tenir la consistència necessària en el processos de generació d’aquestes entitats per passar el filtre de qualitat establert en la taula d’avaluació. Aquest contratemps, però, condiciona de manera relativa la possibilitat de millorar la visibilitat dels registres en els processos de recuperació. En coordinar aquest tipus de procés tècnic amb la base semàntica que gestiona el servei d’extracció, una base de dades semàntica construïda a partir de Wikipedia que permet fer construccions lògiques a partir dels nodes que es relacionen amb la entitat extreta, les possibilitats de seguir millorant el context d’un registre a partir del nuvol de dades obertes enllaçades (Linked-Open Data, LOD) pot establir un punt permanent de contacte per seguir incrementant les opcions de filtre i descobriment en col·leccions digitals complexes i, a la vegada, permetre la revisió crítica dels materials prèviament indexats per millorar la seva experiència d’ús.

Item type:	Thesis (UNSPECIFIED)
Keywords:	Automatic analysis; Automatic indexing; Information retrieval; Linked Open Data; LOD; Anàlisi automàtica; Indexació automàtica; Recuperació d'informació.
Subjects:	I. Information treatment for information services > IL. Semantic web L. Information technology and library technology > LL. Automated language processing. L. Information technology and library technology > LM. Automatic text retrieval.
Depositing user:	Gerard Vidal Santos
Date deposited:	04 Mar 2019 19:36
Last modified:	04 Mar 2019 19:36
URI:	http://hdl.handle.net/10760/33692