Research Data Evaluation And Selection Criteria For Preservation In Data Repositories

TaghizadehMilani, Kimiya, KarbalaAghaeiKamran, Masoumeh and Ghaebi, Amir Research Data Evaluation And Selection Criteria For Preservation In Data Repositories. Human Information Interaction, 2025, vol. 11, n. 4, pp. 90-111. [Journal article (Paginated)]

[thumbnail of re.pdf]
Preview
Text
re.pdf

Download (9MB) | Preview

English abstract

The rapid proliferation of digital data in the research landscape has underscored the critical need for sustainable data curation strategies, especially regarding the long-term preservation of valuable datasets. Research data repositories, as key infrastructures for data stewardship, face mounting challenges in determining which datasets should be preserved for future reuse, validation, and scientific advancement. Given the constraints of storage, funding, and technical resources, not all data generated by research activities can or should be preserved indefinitely. Consequently, defining rigorous, transparent, and contextually appropriate evaluation and selection criteria has emerged as a vital concern within the broader scope of research data management (RDM) and digital curation. This study aims to identify and categorize the key criteria used to evaluate and select research data for long-term preservation in repositories. By conducting a systematic review of existing literature and practices, it seeks to offer a conceptual framework that supports repository managers, librarians, archivists, and data stewards in making informed and consistent decisions about what data to retain. The research further addresses the implications of these criteria on policy development, data sharing, and FAIR data principles (Findable, Accessible, Interoperable, and Reusable). Ultimately, the study contributes to improving data lifecycle management strategies and ensuring that preserved data retains its scientific, legal, ethical, and cultural value. Methods and Material This research adopted a qualitative content analysis approach based on a systematic literature review. The primary goal was to identify, classify, and synthesize the evaluation and selection criteria applied by data repositories in preserving research data. The review focused on peer-reviewed journal articles, white papers, policy documents, and institutional guidelines published between 2000 and 2024. Major databases such as Scopus, Web of Science, ScienceDirect, and Google Scholar were searched using combinations of keywords including "research data preservation," "data selection criteria," "data curation," and "digital repositories." Inclusion criteria for the literature involved the presence of explicit or implicit discussion on the assessment or selection of research data for long-term storage, including frameworks, models, or institutional case studies. A total of 67 relevant documents were identified and analyzed. Through iterative coding and constant comparison, the evaluation criteria were distilled into several thematic clusters, such as scientific value, legal and ethical considerations, technical characteristics, economic feasibility, data usability, and policy alignment. Resultss and Discussion The findings of this study, based on a systematic review of the literature and a meta-synthesis of previous studies, identify a comprehensive set of criteria and components for evaluating and selecting research data for retention in data repositories. These criteria are categorized into eight main components: data preparation, data quality, physical conditions and technical features, metadata management and features, ethical principles of data, document-related criteria, compliance with FAIR principles, and repository policies and issues. In the “Data Preparation” component, indicators such as data cleaning, data scale, presence of missing data, and evaluation of survey biases are highlighted. This component emphasizes the necessity of eliminating errors and inconsistencies, assessing the scale of data, and addressing missing values. It also stresses the importance of identifying and evaluating biases in survey data, such as sampling errors, non-response, and other confounding factors. The “Data Quality” component includes indicators such as accuracy, reliability, completeness, validity, documentation of limitations, and timeliness of data. Accuracy and correctness of information must be carefully assessed, and data reliability should be evaluated based on how the data was produced and analyzed. Completeness refers to the presence of all necessary elements in the dataset, and validity relates to the soundness of data collection tools and the extent to which findings reflect reality. Acknowledging study limitations helps clarify weaknesses, and up-to-date data are valued for their relevance in terms of collection time. The remaining components and their indicators are as follows: Physical Conditions and Technical Features of Data: Includes data formats, future readability, required software for access, and compatibility with technical standards. Metadata Management and Features: Covers the presence of sufficient metadata, use of standardized structures for data description, supplementary documentation, and necessary information for data reuse. Ethical Principles of Data: Encompasses protection of participants' privacy, anonymization or encryption of sensitive information, obtaining informed consent, and respect for intellectual property rights. Document-Related Criteria: Includes the association of data with specific research projects, traceability of data to published scholarly articles, and documentation of data collection methods. Compliance with FAIR Principles: Covers Findability, Accessibility, Interoperability, and Reusability of the data. Repository Policies and Issues: Involves adherence to legal requirements and repository policies, access licenses, data sharing conditions, and security considerations for data storage. These eight components and their corresponding indicators provide a comprehensive and evidence-based framework for evaluating and selecting suitable research data for long-term retention in data repositories. Discussion and Conclusion The discussion and conclusion of this paper emphasize the importance of various components in the evaluation and selection of research data for storage in data repositories. In the data preparation phase, accuracy in data cleaning and screening, particularly in quantitative research, is crucial. Challenges such as missing data and potential biases, such as sampling errors, can complicate analyses and reduce the quality of the data. Therefore, adherence to precise standards in cleaning and verifying data is essential. In evaluating data quality, accuracy and precision of information, reliability, and completeness of the data are key criteria. Data that is properly collected and analyzed can facilitate more effective research and reuse of data. Especially in both qualitative and quantitative data, the use of standardized formats and compatibility with various systems are significant technical issues that impact storage quality. Metadata documentation also plays a critical role in data evaluation. Metadata provides essential information about the data, enhancing transparency, collaboration, and trust. Furthermore, adhering to ethical principles, such as obtaining informed consent from participants and ensuring their privacy during the use of data, is crucial. These actions help maintain public trust and prevent misuse of data. The paper also emphasizes the importance of alignment with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) in data evaluation. Adherence to these principles ensures that data will be more effective and accessible for future use. Additionally, policies related to data repositories must consider user needs and technical limitations, preserving high-value research data for future use. The conclusion reveals that the evaluation and selection of research data for storage should be conducted with care and adherence to standardized criteria to improve the quality and effectiveness of data utilization in future research. Furthermore, practical recommendations such as developing data evaluation guidelines, training data specialists, and implementing technological tools to enhance the data evaluation and storage processes are proposed.  

Item type: Journal article (Paginated)
Keywords: Research data, research data management, research data evaluation, research data selection, data management, data repositories
Subjects: D. Libraries as physical collections. > DC. Public libraries.
Depositing user: HII Journal Human Information Interaction
Date deposited: 30 Jan 2026 17:55
Last modified: 30 Jan 2026 17:55
URI: http://hdl.handle.net/10760/47552

References

Alvarez-Romero, C., Martínez-García, A., Bernabeu-Wittel, M., & Parra-Calderón, C. L. (2023).

Health data hubs: an analysis of existing data governance features for research. Health Research

Policy and Systems, 21(1), 70.

Assante, M., Candela, L., Castelli, D., & Tani, A. (2016). Are scientific data repositories coping with

research data publishing?. Data Science Journal, 15, 6-6.

Batini, C & Scannapieco, M. (2016). Data quality dimensions. In: Batini, C and Scannapieco, M (eds.),

Data and information quality: Dimensions, principles and techniques. Berlin: Springer. pp. 21–51.

DOI: https://doi.org/10.1007/978-3-319-24106-7_2

Boté-Vericad, J. J., & Térmens i Graells, M. (2019). Reusing data: Technical and ethical challenges.

DESIDOC Journal of Library & Information Technology, 2019, vol. 39, num. 6, p. 329-337.

Bradshaw, A., Hughes, N., Vallez-Garcia, D., Chokoshvili, D., Owens, A., Hansen, C., ... & Diaz, C.

(2023). Data sharing in neurodegenerative disease research: challenges and learnings from the

innovative medicines initiative public-private partnership model. Frontiers in neurology, 14,

1187095.

Busetto, L., Wick, W., & Gumbinger, C. (2020). How to use and assess qualitative research

methods. Neurological Research and practice, 2(1), 14.

Cai, L., & Zhu, Y. (2015). The challenges of data quality and data quality assessment in the big data

era. Data science journal, 14, 2-2.

Corti, L., & Backhouse, G. (2005, May). Acquiring qualitative data for secondary analysis. In Forum

Qualitative Sozialforschung. 6(2). FQS.

Corti, L., Woollard, M., Bishop, L., & Van den Eynden, V. (2019). Managing and sharing research

data: A guide to good practice.

Costello, M. J., Michener, W. K., Gahegan, M., Zhang, Z. Q., & Bourne, P. E. (2013). Biodiversity

data should be published, cited, and peer reviewed. Trends in Ecology & Evolution, 28(8), 454-

461.

Costello, M. J., Michener, W. K., Gahegan, M., Zhang, Z. Q., & Bourne, P. E. (2013). Biodiversity

data should be published, cited, and peer reviewed. Trends in Ecology & Evolution, 28(8), 454-

461.

Downs, R. R. (2021). Improving opportunities for new value of open data: Assessing and certifying

research data repositories. Data Science Journal, 20, 1-1.

El Mestari, S. Z., Doğan, F. S., & Maria Botes, W. (2023, April). Technical and Legal Aspects

Relating to the (Re) Use of Health Data When Repurposing Machine Learning Models in the EU.

In Privacy Symposium: Data Protection Law International Convergence and Compliance with

Innovative Technologies (pp. 33-48). Cham: Springer International Publishing.

Fecher, B., Friesike, S., & Hebing, M. (2015). What drives academic data sharing?. PloS one, 10(2),

e0118053.

Filip, I. D., Ionite, C., González-Cebrián, A., Balanescu, M., Dobre, C., Chis, A. E., ... & GonzálezVélez, H. (2022, December). SMARDY: Zero-trust FAIR marketplace for research data. In 2022

IEEE International Conference on Big Data (Big Data) (pp. 1535-1541). IEEE.

Grady, C. (2015). Enduring and emerging challenges of informed consent. New England Journal of

Medicine, 372(9), 855-862.

Gregory, K. M., Cousijn, H., Groth, P., Scharnhorst, A., & Wyatt, S. (2020). Understanding data

search as a socio-technical practice. Journal of Information Science, 46(4), 459-475.

Harper, L. M. (2023). Data Reuse Among Digital Humanities Scholars: a Qualitative Study of

Practices, Challenges and Opportunities (Doctoral dissertation, Université d'Ottawa/University of

Ottawa).

Harper, L. M. (2023). Data Reuse Among Digital Humanities Scholars: a Qualitative Study of

Practices, Challenges and Opportunities (Doctoral dissertation, Université d'Ottawa/University of

Ottawa).

Harris, Howard (2001). Content analysis of secondary data: A study of courage in

managerial decision making. Journal of Business Ethics, 34, 191-208.

Kindling, M., & Strecker, D. (2022). Data quality assurance at research data repositories. Data

Science Journal, 21, 18-18.

Koltay, T. (2020). Quality of open research data: Values, convergences and

vernance. Information, 11(4), 175.

Kumuthini, J., Zass, L., Chaouch, M., Gill, Z., Ras, V., Mungloo-Dilmohamud, Z., ... & Baichoo, S.

(2023). Data standardization in the omics field. In Genomic Data Sharing (pp. 137-155).

Academic Press.

Kumuthini, J., Zass, L., Chaouch, M., Thompson, M., Olowoyo, P., Mbiyavanga, M., ... & Owolabi,

M. (2019). Proposed guideline for minimum information Stroke research and clinical data

reporting. Data Science Journal, 18, 26-26.

Majidi, A., Naghshineh, N., Ismaili-ghivi, M.R., Hashemi, M. (2017). Study of the foundation,

models and issues of research data curation and management in scientific and academic

environments. Human Information Interaction, 4(2), 31-57. [In Persian]

Mannheimer, S. (2022). Data curation for qualitative data reuse and big social research: Connecting

communities of practice. Humboldt Universitaet zu Berlin (Germany). Book

Manu, E., Akotia, J., Sarhan, S., & Mahamadu, A. M. (2021). Identifying and sourcing data for

secondary research. In Secondary research methods in the built environment (pp. 16-25).

Routledge.

Mckenna-Foster, A., Cotera, M., & Hahnel, M. (2022). Open Science ETDs and Institutional

Repositories: Making Research Data FAIRer. The Journal of Electronic Theses and Dissertations,

2(1), 5.

Niu, J. (2014). Appraisal and selection for digital curation. International Journal of Digital

Curation, 9(2), 65-82.

Omukuti, J., Megaw, A., Barlow, M., Altink, H., & White, P. (2021). The value of secondary use of

data generated by non-governmental organisations for disaster risk management research:

Evidence from the Caribbean. International journal of disaster risk reduction, 56, 102114.

Oza, V. H., Whitlock, J. H., Wilk, E. J., Uno-Antonison, A., Wilk, B., Gajapathy, M., ... &

Lasseigne, B. N. (2023). Ten simple rules for using public biological data for your research.

PLOS Computational Biology, 19(1), e1010749.

Palmer, C. L., Weber, N. M., & Cragin, M. H. (2011). The analytic potential of scientific data:

Understanding re‐use value. Proceedings of the American Society for Information Science and

Technology, 48(1), 1-10.

Panchenko, L., & Samovilova, N. (2020). Secondary data analysis in educational research:

opportunities for PhD students. In Shs web of conferences (Vol. 75, p. 04005). EDP Sciences.

Peng, G., Lacagnina, C., Downs, R. R., Ganske, A., Ramapriyan, H. K., Ivánová, I., ... & Moroni, D.

F. (2022). Global Community Guidelines for Documenting, Sharing, and Reusing Quality

Information of Individual Digital Datasets. Data Science Journal, 21:008, 20pp..

http://doi.org/10.5334/dsj-2022-008

Rabianski, J. S. (2003). Primary and secondary data: Concepts, concerns, errors, and issues. The

Appraisal Journal, 71(1), 43.

Rantasaari, J. (2021). Doctoral Students' Educational Needs in Research Data Management:

Perceived Importance and Current Competencies. International Journal of Digital Curation,

16(1), 36-36.

Rantasaari, J. (2022). Doctoral Students’ Research Data Management Competencies Based on the

Quality of Their Data Management Plans. Proceedings of the IATUL Conferences. Paper 4.

https://docs.lib.purdue.edu/iatul/2022/clr/4

Rehnert, M., & Takors, R. (2023). FAIR research data management as community approach in

bioengineering. Engineering in Life Sciences, 23(1).

Rolland, B., & Lee, C. P. (2013, February). Beyond trust and reliability: reusing data in collaborative

cancer epidemiology research. In Proceedings of the 2013 conference on Computer supported

cooperative work (pp. 435-444).

Sakai, Y., Miyata, Y., Yokoi, K., Wang, Y., & Kurata, K. (2023). Initial insight into three modes of

data sharing: Prevalence of primary reuse, data integration and dataset release in research articles.

Learned Publishing, 36(3), 417-425.

Sandelowski, M., & Barroso, J. (2007). Handbook for synthesizing qualitative research. New York:

NY: Springer.

Sherif, V. (2018, March). Evaluating preexisting qualitative research data for secondary analysis. In

Forum qualitative sozialforschung/forum: Qualitative social research (Vol. 19, No. 2).

Stratton, S. J. (2015). Assessing the accuracy of survey research. Prehospital and disaster

medicine, 30(3), 225-226.

Stvilia, B., & Lee, D. J. (2024). Data quality assurance in research data repositories: a theory-guided

exploration and model. Journal of Documentation, 80(4), 793-812. https://doi.org/10.1108/JD-09-

2023-0177

van der Velde, K. J., Singh, G., Kaliyaperumal, R., Liao, X., de Ridder, S., Rebers, S., ... & Swertz,

M. A. (2022). FAIR Genomes metadata schema promoting Next Generation Sequencing data

reuse in Dutch healthcare and research. Scientific data, 9(1), 169.

Vuokko, R., Mäkelä-Bengs, P., Hyppönen, H., Lindqvist, M., & Doupi, P. (2017). Impacts of

structuring the electronic health record: Results of a systematic literature review from the

perspective of secondary use of patient data. International journal of medical informatics, 97,

293-303.

Whiteside, Mary; Mills, Jane & McCalman, Janya (2012). Using secondary data for grounded theory

analysis. Australian Social Work, 65(4), 504-516.

Wilkinson, M. D., et al. (2016). The FAIR Guiding Principle for Scientific Data Management and

Stewardship. Scientific Data, 3(1): 160018. DOI: https://doi.org/10.1038/sdata.2016.18

Woolard, Robert H.; Carty, Kathleen; Wirtz, Philip; Longabaugh, Richard; Nirenberg, Ted D.;

Minugh, Allison P.; Becker, Bruce & Clifford, Patrick R. (2004). Research fundamentals: Followup with subjects in clinical trials: Addressing subject attrition. Academic Emergency Medicine,

11(8), 859-866.

Yoon, A. (2017). Data reusers' trust development. Journal of the Association for Information Science

and Technology, 68(4), 946-956.

Zuiderwijk, A., Türk, B. O., & Brazier, F. (2024). Identifying the most important facilitators of open

research data sharing and reuse in Epidemiology: A mixed-methods study. Plos one, 19(2),

e0297969.

Vaziri, E., Naghshineh, N., & Noroozi Chakoli, A. (2020). Barriers and Challenges of Research Data

Sharing. Library and Information Science Research, 9(2), 5-23.

https://doi.org/10.22067/riis.v0i0.60594 [In Persian]


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item