Plale, Beth, Seiffert, Kurt, McDonald, Robert, Konkiel, Stacy and Kouper, Inna Repository of NSF Funded Publications and Data Sets: "Back of Envelope" 15 year Cost Estimate., 2013 [Technical report]
Preview |
Text
Plale-2013-NSF-repository-estimate.pdf - Published version Available under License Creative Commons Attribution Non-commercial. Download (133kB) | Preview |
English abstract
In this back of envelope study we calculate the 15 year fixed and variable costs of setting up and running a data repository (or database) to store and serve the publications and datasets derived from research funded by the National Science Foundation (NSF). Costs are computed on a yearly basis using a fixed estimate of the number of papers that are published each year that list NSF as their funding agency. We assume each paper has one dataset and estimate the size of that dataset based on experience. By our estimates, the number of papers generated each year is 64,340. The average dataset size over all seven directorates of NSF is 32 gigabytes (GB). A total amount of data added to the repository is two petabytes (PB) per year, or 30 PB over 15 years. The architecture of the data/paper repository is based on a hierarchical storage model that uses a combination of fast disk for rapid access and tape for high reliability and cost efficient long-term storage. Data are ingested through workflows that are used in university institutional repositories, which add metadata and ensure data integrity. Average fixed costs is approximately $.0.90/GB over 15-year span. Variable costs are estimated at a sliding scale of $150 - $100 per new dataset for up-front curation, or $4.87 – $3.22 per GB. Variable costs reflect a 3% annual decrease in curation costs as efficiency and automated metadata and provenance capture are anticipated to help reduce what are now largely manual curation efforts. The total projected cost of the data and paper repository is estimated at $167,000,000 over 15 years of operation, curating close to one million of datasets and one million papers. After 15 years and 30 PB of data accumulated and curated, we estimate the cost per gigabyte at $5.56. This $167 million cost is a direct cost in that it does not include federally allowable indirect costs return (ICR). After 15 years, it is reasonable to assume that some datasets will be compressed and rarely accessed. Others may be deemed no longer valuable, e.g., because they are replaced by more accurate results. Therefore, at some point the data growth in the repository will need to be adjusted by use of strategic preservation.
Item type: | Technical report |
---|---|
Keywords: | repositories, cost modeling, data management, data preservation |
Subjects: | F. Management. > FF. Funding. H. Information sources, supports, channels. > HL. Databases and database Networking. H. Information sources, supports, channels. > HS. Repositories. |
Depositing user: | Stacy Konkiel |
Date deposited: | 24 Aug 2013 18:38 |
Last modified: | 02 Oct 2014 12:27 |
URI: | http://hdl.handle.net/10760/20017 |
Available Versions of this Item
- Repository of NSF Funded Publications and Data Sets: "Back of Envelope" 15 year Cost Estimate. (deposited 24 Aug 2013 18:38) [Currently Displayed]
References
Downloads
Downloads per month over past year
Actions (login required)
View Item |