Repository of NSF Funded Publications and Data Sets: "Back of Envelope" 15 year Cost Estimate

Plale, Beth and Kouper, Inna and McDonald, Robert and Seiffert, Kurt and Konkiel, Stacy Repository of NSF Funded Publications and Data Sets: "Back of Envelope" 15 year Cost Estimate., 2013 [Technical report]

This is the latest version of this item.

Plale-2013-NSF-repository-estimate.pdf - Published version
Available under License Creative Commons Attribution Non-commercial.

Download (133kB) | Preview
Alternative locations:

English abstract

In this back of envelope study we calculate the 15 year fixed and variable costs of setting up and running a data repository (or database) to store and serve the publications and datasets derived from research funded by the National Science Foundation (NSF). Costs are computed on a yearly basis using a fixed estimate of the number of papers that are published each year that list NSF as their funding agency. We assume each paper has one dataset and estimate the size of that dataset based on experience. By our estimates, the number of papers generated each year is 64,340. The average dataset size over all seven directorates of NSF is 32 gigabytes (GB). A total amount of data added to the repository is two petabytes (PB) per year, or 30 PB over 15 years. The architecture of the data/paper repository is based on a hierarchical storage model that uses a combination of fast disk for rapid access and tape for high reliability and cost efficient long-term storage. Data are ingested through workflows that are used in university institutional repositories, which add metadata and ensure data integrity. Average fixed costs is approximately $.0.90/GB over 15-year span. Variable costs are estimated at a sliding scale of $150 - $100 per new dataset for up-front curation, or $4.87 – $3.22 per GB. Variable costs reflect a 3% annual decrease in curation costs as efficiency and automated metadata and provenance capture are anticipated to help reduce what are now largely manual curation efforts. The total projected cost of the data and paper repository is estimated at $167,000,000 over 15 years of operation, curating close to one million of datasets and one million papers. After 15 years and 30 PB of data accumulated and curated, we estimate the cost per gigabyte at $5.56. This $167 million cost is a direct cost in that it does not include federally allowable indirect costs return (ICR). After 15 years, it is reasonable to assume that some datasets will be compressed and rarely accessed. Others may be deemed no longer valuable, e.g., because they are replaced by more accurate results. Therefore, at some point the data growth in the repository will need to be adjusted by use of strategic preservation.

Item type: Technical report
Keywords: repositories, cost modeling, data management, data preservation
Subjects: F. Management. > FF. Funding.
H. Information sources, supports, channels. > HL. Databases and database Networking.
H. Information sources, supports, channels. > HS. Repositories.
Depositing user: Stacy Konkiel
Date deposited: 01 Sep 2013 02:26
Last modified: 02 Oct 2014 12:27

Available Versions of this Item


"SEEK" links will first look for possible matches inside E-LIS and query Google Scholar if no results are found.

Alliance for Permanent Access to the Records of Science Network (APARSEN). (2013). D32.1 Report on Cost Parameters for Digital Repositories. Retrieved from

Basken, P. (February 24, 2013). NSF Anticipates Pushing Boundaries on Open-Access Plan. The Chronicle of Higher Education,59(26). Retrieved from

Beagrie, N. and Jones, M. (2000). Preservation Management of Digital Materials Workbook: a prepublication draft. Retrieved from

Consultative Committee for Space Data Systems (CCSDS). (2002). Reference model for an open archival information system (OAIS), CCSDS 650.0-B-1 Blue Book. Retrieved from

Dryad. (2013). Pricing plans and submission fees. Retrieved from

Goldstein, S. J. & Ratliff, M. (2010). DataSpace: A Funding and Operational Model for Long-Term Preservation and Sharing of Research Data. Retrieved from

Kejser, U. B. (2012). Cost Model for Digital Preservation. Retrieved from

National Science Board. (2012). Science and Engineering Indicators 2012. Chapter 5. Academic Research and Development. Arlington VA: National Science Foundation (NSB 12-01). Retrieved from

National Science Foundation. (2011). Grant Proposal Guide, Chapter II.C.2.j. Special Information and Supplementary Documentation. Retrieved from

Open Planets Foundation. (2013). Digital Preservation and Data Curation Costing and Cost Modelling. Retrieved from

Palaiologk, A. S., Economides, A. A., Tjalsma, H. D., & Sesink, L. B. (2012). An activity-based costing model for long-term preservation and dissemination of digital research data: the case of DANS. International Journal on Digital Libraries, 12(4), 195-214. Retrieved from


Downloads per month over past year

Actions (login required)

View Item View Item