Documentation is written material that clarifies the meaning, creation, or use of datasets and data-related code. Some documentation is written for expert users, while some is written for newcomers; some documentation is written for machines, like the markup-based metadata that makes digital objects discoverable on the web.
Documentation can add immense value to a dataset, and we recommend creating and uploading basic documentation for all data that you share in a repository.
A readme is a simple plain-text document (.txt format) that provides basic information about a dataset or other file. These are traditionally named “README.txt” and are placed in the same folder as the file(s) it describes. The README is the most basic form of documentation, and every project should have at least one.
Codebooks describe the contents, structure, layout, and variable definitions for a data collection. They are usually text documents (Word files, PDF, or .txt format.) Codebooks are frequently but not exclusively used for survey and interview data.
Data dictionaries also describe the contents, structure, layout, and variable definitions for a data collection, but are used in a wider range of contexts than codebooks. They are frequently formatted as a data table or spreadsheet.
If you write scripts, someone else may want to tweak your code or fix a bug. Make it easy for them by commenting out your code. This is especially important if you’re using an unusual library or other dependency, or created a non-obvious workaround.
Metadata are the little bits of information that describe a file, like its title, the name of its creator, and the subjects it’s related to. When you share data in a repository, metadata is what allows other people to find your files by searching or browsing.
Typically, a repository will ask you to provide specific pieces of metadata when you upload your files. A small portion of that metadata will be mandatory: typically basic information like your name and the title of the dataset. Other metadata will be merely optional, like subject keywords or the grant numbers of any awards that funded your research. This optional metadata can be very useful for helping other people discover your work, so fill it out if you can.
Quick tips for better metadata:
Have you ever tried to integrate two datasets and been frustrated by the lack of standardization for simple, common variables like employment status, medication name, or age at death? By using Common Data Elements as your variables, you can reference pre-defined machine-readable variables that save time and make computational analysis easier. According to the NIH Common Data Elements (CDE) Repository, "a Common Data Element (CDE) is a standardized, precisely defined question, paired with a set of allowable responses, used systematically across different sites, studies, or clinical trials to ensure consistent data collection."
The NIH Common Data Elements (CDE) Repository has two collections to browse: a collection of 125 NIH-endorsed CDEs that largely relate to patient status and COVID-19, and a much larger collection of cross-disciplinary CDEs submitted by NIH-recognized bodies like institutes and centers. Many disciplines and communities of practice are defining their own CDEs that have not yet been officially adopted by any federal organization. It may be worthwhile to perform a general web search for "common data elements [your discipline]" to see if any publications, white papers, or working groups have recently appeared with proposed CDEs relevant to your research.