Large Scale Research Data Archiving: Training for an Inconvenient Technology

Calhoun, Patrick; Zimmerman, Brett; Neeman, Henry

Large Scale Research Data Archiving: Training for an Inconvenient Technology

dc.contributor.author	Calhoun, Patrick
dc.contributor.author	Zimmerman, Brett
dc.contributor.author	Neeman, Henry
dc.date.accessioned	2016-09-07T00:01:55Z
dc.date.available	2016-09-07T00:01:55Z
dc.date.issued	2016-08-03
dc.description	Introduction: How physical storage is structured, and how it is used, can vary substantially across scales, because of both pricing concerns and technological aspects. At the smallest scales – for example, handhelds such as mobile phones and tablets – pricing is affordable(typically under US$1 per GB, with maximum sizes typically well under 1 TB), and use mechanisms and administration are convenient and intuitive (for example, push a Micro SD card into a slot in the handheld, and the operating system automatically recognizes it and puts it into service). By contrast, at the largest scales(from several TB to many PB and soon EB), storage can either be reasonably convenient to use but expensive (for example, large scale enterprise-class disk systems, which can be comparable in purchase price per GB to small scale but are much more expensive to operate), or reasonably affordable but inconvenient to use (for example, magnetic tape). At the same time, research datasets are increasingly being subject to requirements or needs not only to be retained over several to many years, but also to be made accessible to relevant communities external to the data owners, typically at no more than the incremental cost of creating and transferring a copy. For example,in 2013, the US Office of Science and Technology Policy released a memorandum [1] calling on every US federal research funding agency with a research funding budget over USD $100,000,000∗Corresponding author at: One Partners Place Suite 2600, 350 David L. Boren Blvd., Norman, OK, 73019, USA.E-mail address: hneeman@ou.edu (H. Neeman) to prepare a public access plan. In 2015, the US National Science Foundation (NSF) released its Public Access Plan [2], which stated:NSF requires applicants for funding to prepare a [Data Management Plan] . . . [which] may address . . . [p] policies for access and sharing . . .. All data resulting from [NSF-funded] research . . .should be deposited at the appropriate repository . . .. NSF’s data-sharing policy states: “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data . . . created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing”. . . . NSF requires applicants . . .to address archiving and preservation . . . Strategies for providing long-term storage and preservation will be a requirement for any future NSF-designated repository system whether for data or publication. . . .However, in an era of increasingly open access to massive data collections, some storage technologies and some extant business models, for large scale, long term (over 10 year) storage of “cold”data, including enterprise disk or tape systems and metered cloud providers, aren’t universally viable under current research funding approaches. This typically is because (i) the cost of storage is too high to be practical, and/or (ii) the file owners are obligated to continue paying substantial recurring charges even after the relevant research funding has ended.Among the key issues are: (1) the cost of storing large datasets (2) over the long term, while making the datasets (3) not only accessible to the owner (4) but also discoverable and accessible by third parties as appropriate, (5) and being able to use shorter term funding such as a 2–5 year research grant, (6)with minimal recurring costs, (7) encompassing multiple copies to improve resiliency (8) at minimal cost per TB per copy per year.Under these constraints, the following storage strategies are extremely challenging: (a) funding a disk system refresh after end-of-life (5–7 years) is very difficult; (b) enterprise disk in generally too expensive per TB per year; (c) buying disk drives in a centrally-managed disk array gets too little lifetime for some disk drives,because the useable lifetime of the disk drives typically ends at the end-of-life of the disk array, so disk drive purchases late in the life of the disk array have even higher cost per TB per year; (d) metered cloud storage can be unsustainable beyond the lifetime of the relevant project, because it can be difficult to justify expending funds from later grants on irrelevant datasets from earlier grants; (e) collections of standalone disk drives (for example, USB disk drives)are undiscoverable, inaccessible, cumbersome to manage at scale(tens of TB to many PB), and don’t last long enough; (f) buying a tape library per research team is impractical due to high fixed costs(5–8 figures per medium-to-large tape library1). Large scale tape archives, by contrast, have the following advantages: (i) low incremental price per unit (other than fixed costs, tape costs substantially less per TB per year than even USB disk drives[3,4]); (ii) longevity (10 years or more); (iii) accessibility; (iv) discoverability (via metadata catalogs); (v) media (tape cartridges) can be paid entirely up front, with zero recurring costs for 10+ years. Disadvantages of large scale tape archives include: (i) long latency (wait time) before any individual file can be read (30–120s for tape, vs 1–10 milliseconds for disk), so tape is best for “cold”archiving of files that are expected to be accessed infrequently; (ii) high fixed costs, typically six or seven figures for a tape library with hundreds of tape cartridges. [5–7]Thus, tape may be impractical at the research group scale, but can substantially reduce costs to researchers at institutional and national scales.Note that discoverability – whether on a tape archive or a disk system – depends first on physical access (for example, via the Internet) to the contents of the storage system. Metadata and related information describing the contents of files on such a storage resource can be crucial for users who need to search for such content (as well as for provenance, reproducibility and other purposes), but only come into play once physical accessibility is resolved. (Issues relating to metadata are outside the scope of this article.) At the University of Oklahoma (OU), the OU Supercomputing Center for Education and Research (OSCER), a division of OU Information Technology, has been using a very successful business model [8] that effectively addresses these concerns for an institutional-scale resource. This business model is based on three funding sources: (1) grant: an NSF Major Research Instrumentation (MRI) grant (OCI-1039829, “Acquisition of Extensible Petascale Storage for Data Intensive Research,” USD $792,925,10/1/2010 − 9/30/2014, PI H. Neeman) funds hardware, software and the first several years of warranty/maintenance/support; (2)institutional commitment by OU provides space, power, cooling and labor, as well as maintenance after the initial warranty period;(3) researchers buy their own media, typically but not exclusively via their own grants.1For example, on February 16, 2016, an IBM TS4500 tape library with 730 tape cartridge slots and 2 tape drives, driven by 5 Lenovo x3650M5 servers, an IBM Storwize V3700 disk array, a pair of IBM SAN24B-4 Express Fibre Channel switches, IBM’s General Parallel File System software and IBM’s Linear Tape File System Enterprise Edition software, with only a single year of support, had a Manufacturer’s Suggested Retail Price (MSRP, also known as list price) of over USD $450,000; the same con-figuration except with 12 tape drives and 9970 tape cartridge slots had an MSRP of over USD $1,000,000. [5–7]. Thus, researchers’ cost per TB per copy per year is significantly less than that of USB disk drives, because of both lower purchase costs (see above) and longer and more predictable media lifetimes[9,10]. Unfortunately, because of constraints of both budget and technology, the use of OU’s storage archive is neither straightforward nor convenient. In particular, the technology choices (informed by budget constraints) compel inconvenient usage mechanisms,which in turn require targeted tailoring of user training. Effective training regarding proper use is crucial, and this training must be both brief and intuitive, in order to reduce violations of appropriate practices and policies, while minimizing the amount of time devoted to this training by both users and operations staff.	en_US
dc.description.abstract	ABSTRACT: At small scales, storage is straightforward to afford and to use, but at large scales – from several Terabytes (TB) to many Petabytes (PB) and soon Exabytes (EB) – tradeoffs must be made between cost and convenience, and training for use of such resources needs to take such inconveniences into account. A large scale, long term (over 10 year) institutional research data storage archive is described, focusing onboth hardware and software. The technology choices give rise to inconveniences, which in turn not only lead to a crucial requirement for training on the proper use of the archive, but also inform the specifics of that training, as does each individual use case.	en_US
dc.identifier.citation	S.P. Calhoun, et al., Large scale research data archiving: Training for an inconvenient technology, J. Comput. Sci. (2016), http://dx.doi.org/10.1016/j.jocs.2016.07.005	en_US
dc.identifier.doi	http://dx.doi.org/10.1016/j.jocs.2016.07.005	en_US
dc.identifier.uri	http://hdl.handle.net/11244/45038
dc.language	en	en_US
dc.subject	Computer Science.	en_US
dc.title	Large Scale Research Data Archiving: Training for an Inconvenient Technology	en_US
ou.group	Other	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: jocs_article_training_inconvenient_technology_published_20160803.pdf
Size:: 542.69 KB
Format:: Adobe Portable Document Format
Description:: Main Article

Download

SHAREOK^TM

advancing Oklahoma scholarship, research and institutional memory

Large Scale Research Data Archiving: Training for an Inconvenient Technology

Files

Original bundle

Collections