Data Curation

Data Curation

Data curation is concerned with advancing access to trustworthy and reusable data resources. DataLab researchers are actively investigating how to build rich, functional collections of digital data for research communities in the sciences and social sciences and how to improve access to open data for the public. Their work contributes to sustaining the long-term value of open data resources and global progress toward shared cyberinfrastructure.

Current Projects

Open Data Literacy

Open Data Literacy is improving accessibility and use of open data through partnerships with public sector institutions. Action research projects are helping organizations make their data open and usable by the public. New curriculum and outreach are preparing information professionals to lead open data initiatives. (

ODL is funded by a grant from the Institute of Museum and Library Services, Laura Bush 21st Century Librarians Program. Grant number: 67-5285.

Research Reproducibility and Data Reuse in Earth System Science

Earth System Science (ESS) is concerned with the physical, chemical, biological and human interactions that determine the future of our planet and the destiny of humankind. ESS research requires an intersection of disciplinary methods and results and relies on heterogeneous, ever growing data collections. The interdisciplinary nature of ESS poses a number of data challenges related to the need for valid integration and reuse of data and research reproducibility. This survey study investigates the perceptions, experiences, and practices of ESS researchers to benchmark the current state of reproducible research and data reuse. The results will inform how to develop open data systems and services for ESS. There are high expectations that open data can support transparency, rigor, and innovation in science and accelerate the pace of new discovery. Our work addresses the particular problems optimizing open data for application and integration across disciplines. The survey results will provide a baseline to inform further work on the data infrastructure needed to sustain data quality and foster the data access and integration necessary for robust interdisciplinary and reproducible ESS.

Site-Based Data Curation

The Site-Based Data Curation project (SBDC) is developing a framework for the curation of research data generated at scientifically significant research sites. The framework is based on geobiology conducted at Yellowstone National Park, as an exemplar site producing data with long-term value. Yellowstone is a tremendously important and rich site for data collection in geobiology, drawing scientists investigating research questions ranging from the origin of life on Earth to the search for life on other planets. Modern research in the earth sciences increasingly depends on the development of systematic accounts of the interactions of physical, chemical and biological phenomena and the integration of diverse measurements and observations. Making data accessible and functional for these purposes will depend on 1) principled curation practices early in the data lifecycle and 2) curating cohesive and usable sets of data for transfer to repositories. The SBDC framework is also an important step forward in evolving the professional work of curation, and the inter-institutional relationships that are essential in the emerging ecology of scientific data curation. 

This work was funded by IMLS National Leadership Grant LG‐06‐12‐0706‐12.

This project is conceptualizing a US Research Software Sustainability Institute that will validate and address various classes of concerns impacting all software development and maintenance projects across all of NSF. URSSI conceptualization includes workshops and a widely-distributed survey that will engage important stakeholder communities to learn about the software they produce and use, and the ways they contemplate sustaining it, following the paths blazed by other successful software institutes. The workshops, survey, and community management approach allow the conceptualization project to iteratively build on existing, extensive understanding of the challenges for sustainable software and its developers. The project also addresses how URSSI could formalize, diversify, and improve the pipeline under which students enter universities, learn about and contribute to software, then graduate to full-time positions where they make use of their software skills, to increase the diversity of those entering research software development and to retain diversity over their university careers.

The Qualitative Data Repository and Hypothesis ( have partnered to develop a new approach to achieving transparency in qualitative and multi-method research: Annotation for Transparent Inquiry (ATI). ATI builds on “active citation,” an earlier approach pioneered by Andrew Moravcsik. Using ATI empowers social scientists to develop “data supplements” that can be linked directly to digital publications on multiple platforms. An ATI Data Supplement includes two sections: a “Data Overview” and a set of digital annotations (potentially linked to underlying data sources). The annotations elucidate the data and analysis on which the publication is based. ATI employs “open annotation,” which allows for the generation, sharing, and discovery of digital annotations across the web (Sanderson et al. 2017).

The Qualitative Data Repository (QDR) curates, stores, preserves, publishes, and enables the download of digital data generated through qualitative and multi-method research in the social sciences. The repository develops and disseminates guidance for managing, sharing, citing, and reusing qualitative data, and contributes to the generation of common standards for doing so. QDR’s overarching goals are to make sharing qualitative data customary in the social sciences, to broaden access to social science data, and to strengthen qualitative and multi-method research.

Weber is the Technical Director of QDR.