Prompt engineering for bibliographic web-scraping

Blazquez Ochando, Manuel, Prieto-Gutierrez, Juan-Jose and Ovalle-Perandones, Maria-Antonia Prompt engineering for bibliographic web-scraping. Scientometrics, 2025, vol. 130, pp. 3433-3453. [Journal article (Paginated)]

[thumbnail of English text]
Preview
Text (English text)
Prompt engineering for bibliographic web-scraping.pdf - Published version
Available under License Creative Commons Attribution.

Download (2MB) | Preview

English abstract

Bibliographic catalogues store millions of data. The use of computer techniques such as web-scraping allows the extraction of data in an efficient and accurate manner. The recent emergence of ChatGPT is facilitating the development of suitable prompts that allow the configuration of scraping to identify and extract information from databases. The aim of this article is to define how to efficiently use prompts engineering to elaborate a suitable data entry model, able to generate in a single interaction with ChatGPT-4o, a fully functional web-scraper, programmed in PHP language, adapted to the case of bibliographic catalogues. As a demonstration example, the bibliographic catalogue of the National Library of Spain with a dataset of thousands of records is used. The findings present an effective model for developing web-scraping programs, assisted with AI and with the minimum possible interaction. The results obtained with the model indicate that the use of prompts with large language models (LLM) can improve the quality of scraping by understanding specific contexts and patterns, adapting to different formats and styles of presentation of bibliographic information.

Item type: Journal article (Paginated)
Keywords: Prompts, Scraping, Bibliographic catalogs, LLM, ChatGPT
Subjects: H. Information sources, supports, channels. > HR. Portals.
H. Information sources, supports, channels. > HS. Repositories.
Depositing user: Juan José Prieto-Gutiérrez
Date deposited: 26 Oct 2025 08:11
Last modified: 26 Oct 2025 08:11
URI: http://hdl.handle.net/10760/47235

References

Atlas, S. (2023). ChatGPT for higher education and professional development: A guide to conversational AI. https://digitalcommons.uri.edu/cba_facpubs/548

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.5555/3495724.3495883

Chai, C. P. (2023). Comparison of text preprocessing methods. Natural Language Engineering, 29(3), 509–553. https://doi.org/10.1017/S1351324922000213

Chen, S., Wong, S., Chen, L., & Tian, Y. (2023b). Extending context window of large language models via positional interpolation. Preprint retrieved from https://arxiv.org/abs/quant-ph/2306.15595

Chen, B., Zhang, Z., Langrené, N., & Zhu, S. (2023a). Unleashing the potential of prompt engineering in large language models: a comprehensive review. Preprint retrieved from https://arxiv.org/abs/quant-ph/2310.14735

Dong, Z., Li, J., Men, X., Zhao, W.X., Wang, B., Tian, Z., & Wen, J. R. (2024). Exploring context window of large language models via decomposed positional vectors. Preprint retrieved from https://arxiv.org/abs/quant-ph/2405.18009

Duan, H., Yang, Y., & Tam, K. Y. (2024). Do LLMs Know about Hallucination? An Empirical Investigation of LLM's Hidden States. Preprint retrieved from https://arxiv.org/abs/quant-ph/2402.09733

Dula, M. W., & Ye, G. (2012). Case study: Pepperdine University libraries’ migration to OCLC’s WorldShare. Journal of Web Librarianship, 6(2), 125–132. https://doi.org/10.1080/19322909.2012.677296

Fahrudin, T. M., Funabiki, N., Brata, K. C., Naing, I., Aung, S. T., Muhaimin, A., & Prasetya, D. A. (2025). An improved reference paper collection system using web scraping with three enhancements. Future Internet, 17(5), 195. https://doi.org/10.3390/fi17050195

Gao, J., Zhao, H., Yu, C., & Xu, R. (2023). Exploring the feasibility of chatgpt for event extraction. Preprint retrieved from https://arxiv.org/abs/quant-ph/2303.03836

Giray, L. (2023). Prompt engineering with ChatGPT: A guide for academic writers. Annals of Biomedical Engineering, 51(12), 2629–2633. https://doi.org/10.1007/s10439-023-03272-4

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. Proceedings of the ACM Workshop on Artificial Intelligence and Security. https://doi.org/10.1145/36057643623985

Hassanien, H.E.-D. (2019). Web scraping scientific repositories for augmented relevant literature search using CRISP-DM. Applied System Innovation, 2(4), 37. https://doi.org/10.3390/asi2040037

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., & Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. Preprint retrieved from https://arxiv.org/abs/quant-ph/2311.05232

Huang, F., et al. (2024). A Three-Stage Framework for Event-Event Relation Extraction with Large Language Model. In B. Luo, L. Cheng, Z. G. Wu, H. Li, & C. Li (Eds.), Neural information processing. ICONIP 2023. Communications in computer and information science. (Vol. 1968). Singapore: Springer.

Khojah, R., Mohamad, M., Leitner, P., & Neto, F. G. D. O. (2024). Beyond code generation: An observational study of ChatGPT usage in software engineering practice. Proceedings of the ACM on Software Engineering. https://doi.org/10.1145/3660788

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems. https://doi.org/10.48550/arXiv.2205.11916

Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y., Sun, R., & Zhou, X. (2023). Better zero-shot reasoning with role-play prompting. Preprint retrieved from https://arxiv.org/abs/quant-ph/2308.07702

Krotov, V., Johnson, L., & Silva, L. (2020). Tutorial: Legality and ethics of web scraping. https://doi.org/10.17705/1CAIS.04724

Lázaro-Rodríguez, P. (2024). PyDataBibPub: script en Python para automatizar la descarga de datos de bibliotecas públicas de España desarrollado con ChatGPT 3.5. Infonomy. https://doi.org/10.3145/infonomy.24.042

Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., & Liu, Y. (2023). Jailbreaking chatgpt via prompt engineering: An empirical study. Preprint retrieved from https://arxiv.org/abs/quant-ph/2305.13860

National Library of Spain (2021). National Library of Spain Report 2021. https://www.bne.es/sites/default/files/repositorio-archivos/memoria_BNE_2021_0.pdf

Nguyen Duc, A., Cabrero-Daniel, B., Przybylek, A., Arora, C., Khanna, D., Herda, T., & Rafiq, U. (2023). Generative artificial intelligence for software engineering—A research agenda. SSRN. https://doi.org/10.2139/ssrn.4622517

Nye, M., Tessler, M., Tenenbaum, J., & Lake, B. M. (2021). Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning. Advances in Neural Information Processing Systems. https://doi.org/10.48550/arXiv.2107.02794

Pividori, M., & Greene, C. S. (2023). A publishing infrastructure for AI-assisted academic authoring. BioRxiv. https://doi.org/10.1101/2023.01.21.525030

Qi, S., Cao, Z., Rao, J., Wang, L., Xiao, J., & Wang, X. (2023). What is the limitation of multimodal LLMs? A deeper look into multimodal LLMs through prompt probing. Information Processing & Management, 60(6), 103510. https://doi.org/10.1016/j.ipm.2023.103510

Reynolds, L., & McDonell, K. (2021). Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1–7). https://doi.org/10.1145/3411763.3451760

Robinson-Garcia, N., Mongeon, P., Jeng, W., & Costas, R. (2017). DataCite as a novel bibliometric source: Coverage, strengths and limitations. Journal of Informetrics, 11(3), 841–854. https://doi.org/10.1016/j.joi.2017.07.003

Sahoo, P., Singh, A.K., Saha, S., Jain, V., Mondal, S., & Chadha, A. (2024). A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. Preprint retrieved from https://arxiv.org/abs/quant-ph/2402.07927

Ul Huda, N., Sahito, S. F., Gilal, A. R., Abro, A., Alshanqiti, A., Alsughayyir, A., & Palli, A. S. (2024). Impact of contradicting subtle emotion cues on large language models with various prompting techniques. International Journal of Advanced Computer Science & Applications. https://doi.org/10.14569/IJACSA.2024.0150442

Vaillant, T.S., de Almeida, F.D., Neto, P.A., Gao, C., Bosch, J., & de Almeida, E.S. (2024). Developers' perceptions on the impact of ChatGPT in software development: A survey. Preprint retrieved from https://arxiv.org/abs/quant-ph/2405.12195

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems. https://doi.org/10.48550/arXiv.1706.03762

Verma, S., Tran, K., Ali, Y., & Min, G. (2023). Reducing llm hallucinations using epistemic neural networks. Preprint retrieved from https://arxiv.org/abs/quant-ph/2312.15576

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems. https://doi.org/10.48550/arXiv.2201.11903

Xia, C.S., & Zhang, L. (2023). Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. Preprint retrieved from https://arxiv.org/abs/quant-ph/2304.00385

Yang, Z., Chen, S., Gao, C., Li, Z., Li, G., & Lv, R. (2023). Deep learning based code generation methods: A literature review. https://doi.org/10.48550/arXiv.2303.01056

Ye, Q., Axmed, M., Pryzant, R., & Khani, F. (2023). Prompt engineering a prompt engineer. Preprint retrieved from https://arxiv.org/abs/quant-ph/2311.05661

Yehuda, Y., Malkiel, I., Barkan, O., Weill, J., Ronen, R., & Koenigstein, N. (2024). In Search of Truth: An Interrogation Approach to Hallucination Detection. Preprint retrieved from https://arxiv.org/abs/quant-ph/2403.02889

Zhao, Z., Song, S., Duah, B., Macbeth, J., Carter, S., Van, M. P., et al. (2023, June). More human than human: LLM-generated narratives outperform human-LLM interleaved narratives. In Proceedings of the 15th Conference on Creativity and Cognition (pp. 368–370)

Zhu, X., Kuang, Z., & Zhang, L. (2023). A prompt model with combined semantic refinement for aspect sentiment analysis. Information Processing & Management, 60(5), 103462. https://doi.org/10.1016/j.ipm.2023.103462


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item