Extracción de información de documentos PDF para su uso en la indización automática de e-books (Extracting information from PDF documents for use in automatic indexing of e-books)

Gil-Leiva, Isidoro, Spotti Lopes Fujita, Mariângela, Marques Redigolo, Franciele and Ferreira Saran, Jordan Extracción de información de documentos PDF para su uso en la indización automática de e-books (Extracting information from PDF documents for use in automatic indexing of e-books). TransInforma��o, 2022, vol. 34. [Journal article (Unpaginated)]

[thumbnail of Leiva Fujita Redigolo Saran Transinformaçao 2022.pdf]
Preview
Text
Leiva Fujita Redigolo Saran Transinformaçao 2022.pdf

Download (620kB) | Preview

English abstract

The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.

Spanish abstract

El número de libros electrónicos que ingresan en las bibliotecas en formato PDF cada día es mayor, complicando y haciendo casi inviables algunos procesos realizados tradicionalmente de forma manual por los bibliotecarios, como es la asignación de materias. En este contexto, se hace necesario el diseño y desarrollo de aplicaciones que asistan a los bibliotecarios. Teniendo esto en consideración, presentamos en este trabajo la evaluación de herramientas de extracción de información de libros en PDF que podrían usarse posteriormente como materia prima para un sistema de indización automática. Para ello, realizamos una primera evaluación de cinco softwares (PDFMiner.six, PDFAct, PDF-extract, PDFExtract y Grobib) y, posteriormente, como PDFAct consiguió el mejor rendimiento, hicimos una segunda evaluación para averiguar su capacidad para identificar y extraer informaciones de los libros, tales como títulos, índices, secciones, títulos de tablas y gráficos y referencias bibliográficas, informaciones relevantes para cualquier sistema de indización. Se concluye que ninguna de las herramientas evaluadas extrae adecuadamente las diferentes partes de libros en PDF, si bien, PDFAct ha logrado un rendimiento superior al del resto.

Item type: Journal article (Unpaginated)
Keywords: Evaluación de software. Grobib. Indización automática. PDFMiner.six. PDFAct. PDF-extract. PDFExtract. Software evaluation. PDFMiner.six. PDFAct. PDF-extract. PDFExtract. Grobib. Automatic indexing.
Subjects: L. Information technology and library technology > LJ. Software.
L. Information technology and library technology > LL. Automated language processing.
Depositing user: Isidoro Gil Leiva
Date deposited: 21 Nov 2023 11:04
Last modified: 21 Nov 2023 11:04
URI: http://hdl.handle.net/10760/45080

References

Alamoudi, A. et al. A rule-based information extraction

approach for extracting metadata from PDF books. ICIC

Express Letters, Part B: Applications, v. 12, n. 2, p. 121-132, 2021.

Doi: https://doi.org/ 10.24507/icicelb.12.02.121

Anggakusuma, J.; Mawardi, V.C.; Lauro, M.D. Resume extraction

with conditional random field method. IOP Conference Series:

Materials Science and Engineering, v. 1007, n. 1, 012154. 2020.

Doi: https://doi.org/10.1088/1757-899X/1007/1/012154

Bui, D. D. A.; Del Fiol, G.; Jonnalagadda, S. PDF text classification

to leverage information extraction from publication reports.

Journal of Biomedical Informatics, v. 61, p. 141-148, 2016.

Chaniago, R.; Khodra, M. Information extraction on novel

text using machine learning and rule-based system. In:

International Conference on Innovative and Creative

Information Technology, 2017. [S.l.]. Proceedings […]. [S.l.]: IEEE

Explore, 2017. p. 1-6.

Chaudary, A. et al. Extraction of useful information from

Crude Job Descriptions. In: IEEE International Multi-Topic

Conference, INMIC, 23rd., 2020, Bahawalpur. Proceedings […].

[S.l.]: IEEE Explore, 2020. p. 1-4. Doi: https://doi.org/10.1109/

INMIC50486.2020.9318132

Dong, A. et al. Citation Metadata Extraction via Deep Neural

Network-based Segment Sequence Labeling. In: Conference

on Information and Knowledge Management, 2017.

Singapore. Proceedings […]. [S./.]: ACM, 2017. p. 1967-1970.

Doi: https://doi.org/10.1145/3132847.3133074

Gil-Leiva, I. Manual de indización: teoría y práctica. Gijón: Trea,

2008.

Gil-Leiva, I. et al. The abandonment of the assignment of

subject headings and classification codes in University

Libraries due to the massive emergence of electronic books.

Knowledge Organization, v. 47, n. 8, p. 646-667. 2020. Doi:

https://doi.org/10.5771/0943-7444-2020-8-646

Haviana, S.; Subroto, I. Obtaining reference’s topic congruity

in Indonesian publications using machine learning approach.

2019. In: International Conference on Electrical Engineering,

Computer Science and Informatics (EECSI), 6., 2019 [S.l.].

Proceedings […]. [S.l.:s.n.]: 2019. p. 428-431. Doi: https:// doi.

org/10.23919/EECSI48112.2019.8976985

Jayaram, K.; Sangeeta, K. A review: Information extraction

techniques from research papers. 2017. In: IEEE International

Conference on Innovative Mechanisms for Industry

Applications, 2017, Bengaluru, India. Proceedings […]. New

York: IEEE, 2017. p. 56-59. Doi: https://doi.org/10.1109/

ICIMIA.2017.7975532

Khusro, S.; Latif, A.; Ullah, I. On methods and tools of table

detection, extraction and annotation in PDF documents.

Journal of Information Science, v. 41, n. 1, p. 41-57, 2015. Doi:

https://doi.org/10.1177/0165551514551903

Najah-Imane, B.; R’emi, J.; Sira, F. Table-of-contents generation

on contemporary documents. In: International Conference

on Document Analysis and Recognition (ICDAR), 15th., 2019,

Sydney, Australia, september 20-25, 2019. Proceedings […].

New York: IEEE, 2019. p. 100-107. Doi: https://doi.org/10.1109/

ICDAR.2019.00025

Nasar, Z.; Jaffry, S. W.; Malik, M. K. Information extraction from

scientific articles: a survey. Scientometrics, v. 117, n. 3, p.

1931-1990, 2018. Doi: https://doi.org/10.1007/s11192-018-2921-5

Nitu, M. et al. Reconstructing scanned documents for full-text

indexing to empower digital library services. Lecture Notes

in Computer Science (including subseries Lecture Notes in

Artificial Intelligence and Lecture Notes in Bioinformatics), v.

11984 LNCS, p. 183-190, 2020.

Ojokoh, B. A.; Adewale, O. S.; Falaki, S.O. Automated document

metadata extraction. Journal of Information Science, v. 35, n. 5, p. 563-

570, 2009. Doi: https://doi.org/10.1177/0165551509105195

Perez-Arriaga, M.O.; Estrada, T.; Abad-Mota, S. Tao: system

for table detection and extraction from PDF documents. In:

Markov, Z.; Russell, I. (ed.). Proceedings of the Twenty-Ninth

International Florida Artificial Intelligence Research Society

Conference, FLAIRS 2016, Key Largo, Florida, May 16-18, 2016.

Palo Alto: AAAI Press, 2016. p. 591-596.

Pudasaini, S. et al. Application of NLP for information

extraction from unstructured documents. Lecture Notes in

Networks and Systems, v. 209, p. 695-704, 2021. Doi: https://

doi.org/10.1007/978-981-16-2126-0_54

Ratcliff, J. W.; Metzener, D. E. Pattern matching: the gestalt

approach. Dr. Dobb’s Journal, v. 13, n. 7, p. 46, 1988.

Sandanayake, T. C. et al. Automated CV analyzing and ranking

tool to select candidates for job positions. In: Proceedings of

the 6th International Conference on Information Technology:

IoT and Smart City. 2018, Hong Kong. Proceedings […]. New

York, NY: Association for Computing Machinery, 2018. p. 13-18.

Doi: https://doi.org/10.1145/3301551.3301579

Shahid, M. H.; Islam, M. A. TOC generation in PDF Document

for smart automated compliance engine. In: International

Symposium on Recent Advances in Electrical Engineering

& Computer Sciences (RAEE & CS), 2020, p. 1-5, Islamabad,

Pakistan. Proceedings […]. New York: IEEE, 2020. Doi: https://

doi.org/10.1109/raeecs50817.2020.9265792

Tkaczyk, D. et al. Machine learning vs. rules and outof-

the-box vs. retrained: an evaluation of open-source

bibliographic reference and citation parsers. In: ACM/IEEE

on Joint Conference on Digital Libraries, 18., June 3-7, 2018,

Fort Worth, Texas, USA. Proceedings […]. New York, NY:

Association for Computing Machinery, 2018. https://doi.

org/10.1145/3197026.3197048

Zaman, G.; Mahdin, H.; Hussain, K. Information extraction from

semi and unstructured data sources: a systematic literature

review. ICIC Express Letters, v. 14, n. 6, p. 593-603, 2020. Doi:

https://doi.org/10.24507/icicel.14.06.593


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item