Skip to Main Content

Data Management

Explanatory Documentation

Documentation is written material that clarifies the meaning, creation, or use of datasets and data-related code. Some documentation is written for expert users, while some is written for newcomers; some documentation is written for machines, like the markup-based metadata that makes digital objects discoverable on the web.

Documentation can add immense value to a dataset, and we recommend creating and uploading basic documentation for all data that you share in a repository.

READMEs

A readme is a simple plain-text document (.txt format) that provides basic information about a dataset or other file. These are traditionally named “README.txt” and are placed in the same folder as the file(s) it describes. The README is the most basic form of documentation, and every project should have at least one.

Codebooks

Codebooks describe the contents, structure, layout, and variable definitions for a data collection. They are usually text documents (Word files, PDF, or .txt format.) Codebooks are frequently but not exclusively used for survey and interview data.

  • ICPSR’s codebook guide
  • Example: Duncan, Pamela W., Rosamond, Wayne D., and Bushnell, Cheryl D. Comprehensive Post-Acute Stroke Services (COMPASS) Study, North Carolina, 2016-2018. Inter-university Consortium for Political and Social Research [distributor], 2021-10-07. https://doi.org/10.3886/ICPSR38185.v1
    • Access the PI Codebook by clicking the "Download" drop-down button, then choosing "Documentation Only" and unzipping the downloaded folder

Data dictionaries

Data dictionaries also describe the contents, structure, layout, and variable definitions for a data collection, but are used in a wider range of contexts than codebooks. They are frequently formatted as a data table or spreadsheet.

  • OSF’s How to make a data dictionary
  • USGS’ guide to data dictionaries (meant for geology but very useful, even for the health sciences!)
  • Example: the NIH All of Us Research Hub provides information for individual data fields in its controlled-access curated datasets through data dictionaries linked on its Data Access Tiers page
    • View the data dictionaries as Google spreadsheets by scrolling down to the "Data Dictionary" section and choosing either "Registered Tier Data Dictionary" or "Controlled Tier Data Dictionary"

Commented code

If you write scripts, someone else may want to tweak your code or fix a bug. Make it easy for them by commenting out your code. This is especially important if you’re using an unusual library or other dependency, or created a non-obvious workaround.

Metadata

Metadata are the little bits of information that describe a file, like its title, the name of its creator, and the subjects it’s related to. When you share data in a repository, metadata is what allows other people to find your files by searching or browsing.

Typically, a repository will ask you to provide specific pieces of metadata when you upload your files. A small portion of that metadata will be mandatory: typically basic information like your name and the title of the dataset. Other metadata will be merely optional, like subject keywords or the grant numbers of any awards that funded your research. This optional metadata can be very useful for helping other people discover your work, so fill it out if you can.

Quick tips for better metadata:

  • Sign up for an ORCiD ID, which is a unique numeric string that can distinguish you from every other researcher with a similar name. When you upload your files to a repository, you can usually provide your ORCiD along with your full name.
  • Try to provide 3 or more keywords (also called “subject headings”) for your data. Many repositories don’t search full-text of the files they store, so those keywords are important for discovery of your data. Think about what you would search for if you were looking for similar data: pathogens? Diseases? Organs and tissues?

Common Data Elements (CDEs)

Have you ever tried to integrate two datasets and been frustrated by the lack of standardization for simple, common variables like employment status, medication name, or age at death? By using Common Data Elements as your variables, you can reference pre-defined machine-readable variables that save time and make computational analysis easier. According to the NIH Common Data Elements (CDE) Repository, "a Common Data Element (CDE) is a standardized, precisely defined question, paired with a set of allowable responses, used systematically across different sites, studies, or clinical trials to ensure consistent data collection."

The NIH Common Data Elements (CDE) Repository has two collections to browse: a collection of 125 NIH-endorsed CDEs that largely relate to patient status and COVID-19, and a much larger collection of cross-disciplinary CDEs submitted by NIH-recognized bodies like institutes and centers. Many disciplines and communities of practice are defining their own CDEs that have not yet been officially adopted by any federal organization. It may be worthwhile to perform a general web search for "common data elements [your discipline]" to see if any publications, white papers, or working groups have recently appeared with proposed CDEs relevant to your research.

Related Resources