Data Management Plans

Background Information

Creation of a data management plan is a “best practice” for research projects that involve the collection or dissemination of data. Data management plans help to insure that data collected by a project have the integrity, quality, and provenance needed to support the proposed research; and that data necessary for external replication of research findings will be available to the research community.

In addition, many organizations sponsoring research, including many federal agencies and non-profit foundations, require a formal data management plan. For example:

Data management plans are not one-size-fits-all. An appropriate data management plan should take into consideration, early on in the data life cycle, the size and complexity of the data to be collected or assembled, the likely audience for reuse of the data, sponsor requirements, and general legal and ethical requirements (e.g. that data be shared in a way that preserves the confidentiality of subject information).

The following pages outline recommended elements for consideration in most data management plans. Also included is a template for a data management plan, which is tailored to NSF requirements, and is appropriate for data that is relatively small in size and complexity (such that a separate budget for data processing is not required) and can be disseminated in a form that is not legally or ethically encumbered. The outline and sample plan are based on a comparison of data management checklists produced by funders, prominent data archives, and library associations; a review of sample data management plans from funders and data archiving organizations; and Library of Congress preservation format recommendations.

Checklist for Data Management Plan

Here is a checklist to consider as you write your NSF Data Management Plan (generic):

  1. Data Description: What data will be generated? How will you create the data? (simulated, observed, experimental, software, physical collections)
  2. Existing Data: Will you be using existing data? Relationship between the data you are collecting and existing data.
  3. Audience: Who will potentially use the data?
  4. Access and Sharing: How will data files be shared? How will others access them?
  5. Formats: What data formats will you be creating?
  6. Metadata and Documentation: What documentation will you provide to describe the data? Metadata formats and standards?
  7. Storage, backup, replication, versioning: Are the data files backed up regularly? Are there replicas in different locations? Are older versions of the data kept?
  8. Security: Are the system and storage that will be used secure?
  9. Budget: Any costs for preparing the data? Costs for storage and long-term access?
  10. Privacy, Intellectual Property: Does the data contain private or confidential information? Any copyrights?
  11. Archiving, Preservation, Long-term Access: What plans do you have to archive the data and other research products? Will it have long-term accessibility?
  12. Adherence: How will you check for adherence of this plan?

Template for Data Management Plan

[TEMPLATE FOR: NSF FUNDED PROJECT/ UNRESTRICTED DATA/ Harvard Dataverse ]

[ Note: This DMP describes how the project will conform in the Harvard Dataverse to NSF Policy on dissemination and sharing of research results, including the requirement to “share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered.” In addition, check for specific directorate/program requirements.]

1. Data description

[ Briefly describe nature & scale of data {simulated, observed, experimental information; samples; publications; physical collections; software; models} generated or collected. ]

2. Existing Data [ if applicable ]

[Briefly describe existing data relevant to the project; added value/justification of new data collection/generation; and plans for integration with existing data]

3. Audience

[ Briefly describe potential secondary users; scope and scale of use]

4. Access and Sharing All data collected or generated will be deposited in the Harvard Dataverse. The Harvard Dataverse is a public repository, hosted and maintained by Harvard University Information Technology (HUIT). The Harvard Dataverse facilitates data access by providing descriptive and variable/question-level search; topical browsing; data extraction and re-formatting; and on-line analysis.

All data will be deposited at least 90 days prior to the expiration of the award. Such data may be embargoed until the publication of research based on the data or until 1 year after the expiration of the award, whichever is sooner. Users will be required to agree to click-through terms that prohibit unlawful uses and intentional violations of privacy, and require attribution. Use of the data will be otherwise unrestricted and free of charge.

5. Formats

Immediately after collection, quantitative data will be converted to [ SELECT ALL THAT APPLY: Stata, SPSS, R, Excel, CSV] formats. These formats are fully supported by the Harvard Dataverse, which will perform archival format migration; metadata extraction; and validity checks. Deposit in these formats will also enable on-line analysis; variable-level search; data extraction and re-formatting; and other enhanced access capabilities. Documentation will be deposited in PDF/a, or plain-text formats, to ensure long-term accessibility, with any accompanying sound (in WAV), video, or images separate from the documentation deposited as JPEG 2000 files (with lossless compression) or uncompressed TIFF files.

6. Documentation, Metadata and Bibliographic Information

The project will create documentation detailing the sources, coding, and editing of all data, in sufficient detail to enable another researcher to replicate them from original sources; and descriptive metadata for each dataset including a title, author, description, descriptive keywords, and file descriptions. The project will include bibliographic information for any publication by the project based on that data.

The Dataverse application’s “templating” feature will be used for consistency of information across datasets. The Dataverse repository automatically generates persistent identifiers, and Universal Numeric Fingerprints (UNF) for datasets; extracts and indexes variable descriptions, missing-value codes and labels; creates variable-level summary statistics; and facilitates open distribution of metadata with a variety of standard formats (Data Cite, DDI v 2.5, Dublin Core, VO Resource, and ISA-Tab) and protocols (OAI-PMH, SWORD).

[ If applicable, briefly describe additional metadata/documentation to be provided; standards used; treatment of field notes and collection records; and quality assurance procedures for all of these ]

7. Storage, backup, replication, and versioning

The Dataverse repository provides automatic version (revision) control over all deposited materials and no versions of deposited material are destroyed except where such destruction is legally required. All systems providing on-line storage for the Dataverse are contained in a physically secured facility that is continually monitored. System backups are made on a daily basis. [For social science data: ] Replicas of data are held by independent archives as part of the Data-PASS archival partnership, regularly updated, and regularly validated, using the LOCKSS system.

8 . Security

The Harvard Dataverse complies with Harvard University requirements for good computer use practices. Harvard University has developed extensive technical and administrative procedures to ensure consistent and systematic information security. “Good practice” requirements include system security requirements (e.g., idle session timeouts; disabling of generic accounts; inhibiting password guessing) operational requirements (e.g., breach reporting; patching; password complexity; logging ); and regular auditing and review. The full Harvard University security policy can be found at http://security.harvard.edu/.

9. Budget

The cost of preparing data and documentation will be borne by the project, and is already reflected in the personnel costs included in the current budget. The incremental cost of permanent archiving activities will be borne by Harvard Dataverse.

[IF the data requires storage over 5GB, cannot be ingested using the acceptable formats above, requires extensive documentation, or is unusually complex in structure include: Staff time has been allocated in the proposed budget to cover the costs of preparing data and documentation for archiving for [describe complexities and management]. Harvard has estimated their additional cost to permanently archive the data is [insert dollar amount, to be agreed with Dataverse Project team at Harvard]. This fee appears in the budget for this application as well. ]

10. Privacy, Intellectual Property, Other Legal Requirements

Information collected can be released without privacy restrictions because [ it does not constitute private information about identified human subjects; informed consent for full public release of the data will be obtained; the data will be anonymized using an IRB-approved protocol prior to the conduct of analysis ]. The data will not be encumbered with intellectual property rights (including copyright, database rights, license restrictions, trade secret, patent or trademark) by any party (including the investigators, investigators’ institutions, and data providers.); nor is subject to any additional legal requirements. Depositing with the Harvard Dataverse does not require a transfer of copyright, but instead grant permission for the Harvard Dataverse to re-disseminate the data and to transform the data as necessary for preservation and access.

11. Archiving, Preservation, Long-term Access

The Harvard Dataverse commits to good archival practice, including independent geo-spatially distributed replication, a succession plan for holdings, and regular content migration. Should the archiving entity be unable to perform, transfer agreements with the Data-PASS partnership ensure the continued preservation of the data by partner institutions. All data under this dataset will also be made available for replication by any party under the CC-attribution license, using the LOCKSS protocols – which is fully supported by the Dataverse application.

12. Adherence

[If not the PI, briefly describe who/which project role is responsible for managing data for the project]

Adherence to this plan will be checked at least ninety-days prior to the expiration of the award by the P.I. Adherence checks will include review of the Harvard Dataverse content, number of datasets released, availability for each dataset of subsettable/preservation friendly data formats (possibly embargoed, but listed); availability of documentation (public); and correctness of data citation, including UNF integrity check.