The life cycle of data, from generation/collection to archival

Activity 2 round robin

American Journal of Botany

Journal policy

The American Journal of Botany thus requires all authors to archive the data, code, and any other information integral to the published research but not contained within the paper itself. This policy also applies to custom software described in the paper. Whenever possible the scripts and other artefacts used to generate the analyses presented in the paper should also be publicly archived. Data must be archived by the time of publication.

Implementation

Paper 1

The authors provide the character matrix and character states for all the features they measure and the taxa evaluated here. This data sets were included directly in the paper as Appendices. They described the methods, but did not include a script or more detailed protocol.

Paper 2

The authors provide a link to GenBank for all the sequences. However, this link is no longer available. They described the methods, but did not include a script or more detailed protocol.

Paper 3

The authors provided a link to the University of Stirling repository, were the data collected during both experiments are available. The data are presented in the form of .csv files. [An] R script with information about the analyses and figures generation is [also] provided in the repository. However, the script lacks descriptive commentary on the steps followed.

Paper 4

The raw data for these analyses were not provided in the paper. However, a Dryad repository was created to store the results from the pAUC analysis. [No analysis scripts provided].

Paper 5

A matrix containing anatomical data for 147 species (out of the 275 mentioned in the paper) was made available on Github. The phylogeny was not provided. Scripts for nearly all of the analyses, adapted from another paper, were also published on Github.

Ecology Letters

Journal policy

Data and code are important products of scientific enterprise, and they must be preserved and remain accessible in future decades.

For manuscripts that depend on new or existing data, and/or on code written by the authors, Ecology Letters requires that this material be supplied and accessible to editors and reviewers at the time of submission, and permanently archived in an accessible repository before publication.

Ecology Letters requires that the raw data (or subset of existing data) used to generate the results in the paper are archived in public repositories such as: Dryad, Figshare, Hal, Zenodo, NERC Environmental Data Service, OSF, US federal agency repositories, Environmental Data Initiative (EDI), DataONE, or a similar repository which assigns permanent unique DOIs.

Computer code used to produce the figures and conduct analyses or simulations must also be archived in a public repository (e.g., Zenodo, Figshare). Code should not be uploaded with your submission as a supporting document. All code must be annotated so readers can understand what each segment or function does

Note that Ecology Letters also includes a board of ~dozen Data Editors

Implementation

Paper 1

Full data set available on Figshare; analysis code also on Figshare.

Paper 2

Full data set and code available on Zenodo.

Paper 3

Full data set on Zenodo. Code for Microbiome DNA sequencing (BLAST) and perhaps network construction [is present], rest is likely missing. I’m not fully sure, since there are a large number of files and no clear documentation

Paper 4

Full dataset and code available on Zenodo.

Paper 5

Full dataset and code available on Zenodo.

Ecology and Evolution

Journal policy

Open data means sharing data and is made more meaningful by the term “FAIR” i.e. data that is findable, accessible, interoperable and reusable). Being more open with data means you can analyze other researcher’s findings and reuse it to inform new findings, and the research landscape becomes more efficient and accountable.

You will benefit from making your research data openly available too. More transparent data helps reviewers see how researchers went from data to analysis and provide nuanced feedback based on their own understanding of the data. When researchers make their data open, they are showing the world they are working transparently and reproducibly, building a strong reputation that will help them throughout their professional life, encouraging reuse and citation.

All research- and synthesis-based articles must include a Data Availability Statement, whether or not the data used in the article is shared.

Implementation

Paper 1

Data and code available on Figshare

Paper 2

Data and code available on Dryad

Paper 3

Data available at https://srtm.csi.cgiar.org/srtmdata/ and code available at https://github.com/DipeshDFRS/Snow_leopard. “The R code has analyses, I believe.”

Paper 4

Data and code available in a Github repository

But the repository README states “Status: in review”. So was it updated or is this the final version?

Paper 5

Raw data on Figshare; analyses in supplement.

Journal of Ecology

Journal policy

Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. The British Ecological Society thus requires, as a condition for publication, that all data supporting the results in papers published in its journals are archived in an appropriate public archive offering open access and guaranteed preservation. For theoretical papers the underlying model code must be archived. […] The archived data must allow each result in the published paper to be recreated and the analyses reported in the paper to be replicated in full to support the conclusions made. Authors are welcome to archive more than this, but not less

Implementation

Paper 1

Data archived in EDI (Environmental Data Initiative) data portal; no analysis code provided

Paper 2

Data archived in Dryad; no analysis code provided

Paper 3

Data archived in Dryad; no analysis code provided

Paper 4

Plant data in Northeastern University Data Portal; Microbial sequence data in NIH/NCBI Data Portal; no analysis code provided

Paper 5

Data archived in Dryad; no analysis code provided

Summary

State of the field 10 years ago:

We surveyed 100 datasets associated with nonmolecular studies in journals that commonly publish ecological and evolutionary research and have a strong PDA policy. Out of these datasets, 56% were incomplete, and 64% were archived in a way that partially or entirely prevented reuse.

Anyone want to use their methodology for last 10 years?

Principles guiding data archival

“FAIR” Data principles

“FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals”

“This article describes four foundational principles – Findability, Accessibility, Interoperability, and Reusability – that serve to guide data producers and publishers as they navigate around these obstacles, thereby helping to maximize the added-value gained by contemporary, formal scholarly digital publishing.

Importantly, it is our intent that the principles apply not only to ‘data’ in the conventional sense, but also to the algorithms, tools, and workflows that led to that data.”

FAIR guiding princples

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the data it describes

F4. (meta)data are registered or indexed in a searchable resource

To be Accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization procedure, where necessary

A2. metadata are accessible, even when the data are no longer available

To be Interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

To be Reusable:

R1. meta(data) are richly described with a plurality of accurate and relevant attributes

R1.1. (meta)data are released with a clear and accessible data usage license

R1.2. (meta)data are associated with detailed provenance

R1.3. (meta)data meet domain-relevant community standards

“These high-level FAIR Guiding Principles precede implementation choices, and do not suggest any specific technology, standard, or implementation-solution; moreover, the Principles are not, themselves, a standard or a specification.

Another way to think about it - the outcome isn’t necessarily “FAIR” or “not FAIR” archives; but rather more or less “FAIR”

FAIR principles are commonly adapted

Many journal policies (e.g. Ecology and Evolution) explicitly point to these
NSF’s Data Management Plans encourage defining plans relative to FAIR principles

Findability

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the data it describes

F4. (meta)data are registered or indexed in a searchable resource

What is a “Searchable resource”?
How to assign a globally unique and persistent identifier?
What is “sufficient” meta-data?

Databases for archiving

From our Activity 2 submissions, we saw that ecologists and evolutionary biologists frequently upload their data (including code) on a variety of places – supplement/appendix to paper, Dryad, Figshare, Zenodo, Institutional repositories, Github, etc.

Not all of these are equal.
Following the FAIR principles, the dataset should be assigned a globally unique and persistent identifier.
- Github and paper’s supplementary materials don’t achieve this – and they shouldn’t, e.g. the owner can delete a repository any time. Persistence is not guaranteed.
- Each dataset should be assigned a DOI (Digital Object Identifier)
- Each DOI is a unique link, which across the whole of the internet points only to this one place.
- DOIs are automatically generated, and are persistent, meaning that anything assigned a DOI is more or less “permanently” available at that address.

Databases for archiving

Within ecology and evolution, Dryad, Figshare, and Zenodo are commonly used archival repositories (for non-sequence data).
- Institutional repositories, Open Science Framework, etc. are also common
For certain types of data (e.g. long sequence reads, individual barcode sequences, protein sequences or 3D builds of proteins), there are established databases that you should become familiar with.
If you use any of these, you’re doing great.
If you are worried about which to use, look to your journal to understand the norms in your field.
Don’t use Github (or Gitlab) as an “Archive”

Accessibility

Data, once archived should be easy to access

If you are using one of the archival databases, this isn’t a concern.

Interoperable

Humans (and computers) should be able to exchange and interpret each other’s data.
Use file formats that are general across all computers, operating systems, etc. and are freely available
- e.g. Use csv files instead of xls files for spreadsheets
Store data in reasonable units – if there are “field standards”, use those; otherwise, have clear meta-data
For “Big data” in ecology - consider using the Ecological Metadata Language to document your work.

Reusable

Legal reusability

In addition to being technically reusable, data should be legally reusable
When you publish data, clarify what are the usage rights by including a license
- E.g. some licenses might allow anyone to use, modify, remix your data, without even acknowledging its source.
- Another license might allow users to modify and remix your data, but need to cite the source material and may use it for non-commercial uses (i.e. can’t “sell” a new product)
- https://choosealicense.com/ If you want to explore good options

Practical reusability

For someone to reuse your data, they need to know its provenance
I.e. How the data were generated in the first place: who, how, when, why, where, etc.