This vignette describes version 1.0 of the ROCK project file format.
ROCK project files have extension .ROCKproject
and are
ZIP archives. They contain two things:
Files containing the data, ideally in a deliberately designed set of sub-directories to facilitate tracing the data through different stages of processing and analysis;
Files containing settings and directives for applications and processing of the data.
The former are raw data files and ROCK files. ROCK files are plain
text files with the .rock
extension.
The latter are YAML files. Of these, the only required one is the
_ROCKproject.yml
file. This file must always be a regular
YAML file that contains a map with key _ROCKproject
. This
map in turn must contain maps with keys project
,
codebook
, sources
, and
workflow
.
The project
map contains project metadata, such as the
project’s title
, its authors
, optional (but
strongly recommended!) author identifiers in authorIds
, the
project’s version
, the version of the ROCK standard used in
the project (with key ROCK_version
), the version of the
ROCK project file (with key ROCK_project_version
), the date
the project was created (with key date_created
), and the
date the project was last modified (with key
date_modified
).
The codebook
map contains the project’s codebook, either
embedded or by linking to it. The codebook
key can also
have value ~
(NULL) if not codebook information is
specified (or the codebook is embedded in the ROCK files). Valid keys to
be specified with the codebook
map are urcid
,
embedded
, and local
. The urcid
key can store the project’s Unique ROCK Codebook Identifier (i.e. its
URCID) as a URL to a ROCK codebook in spreadsheet (.xlsx
or
.ods
format) or YAML (.yml
or
.rock
) format.
The sources
map specifies where the project’s data
resides. This is specified in terms of regular expressions. The first
valid key is extension
, which is not a regular expression
but can be used to conveniently specify that files with a given
extension must be imported. This is used if regex
is
~
(NULL, i.e. unspecified). However, if a value is
specified for regex
, a program importing a ROCK project
should ignore whatever is specified for extension
. The
value stored in the dirsToIncludeRegex
key should be a
regular expression indicating which directories contain the data
(i.e. the ROCK files forming the project). The recursive
key can be true
or false
and indicates whether
all subdirectories of matched directories should be imported too. The
dirsToExcludeRegex
regular expression can be used to ignore
directories. In addition, if filesToIncludeRegex
is
specified, only files matching that regular expression should be
imported; and if filesToExcludeRegex
is specified, files
matching that regular expression should be ignored.
Finally, the workflow
map described the workflow and
data management template used in this project. It consists of a
pipeline
and actions
. The
pipeline
is a sequence of stages, each with an identifier
(in key stage
); the directory containing files in that
stage (in key dirName
; note that this is a single directory
name, not a regular expression!); and a sequence of one or more next
stage (with key nextStages
). Each element in
nextStages
has a nextStageId
key and a
actionId
. The nextStageId
specifies to which
stage files transfer (i.e. are saved) when the action with the
corresponding actionId
is executed. These
actions
are stored in a sequence where each element has an
actionId
; a language
specified the programming
language the action is specified in; one or more
dependencies
(typically packages that need to be loaded in
that programming environment before the script
can be
executed), and a script
section specifying the commands to
run to execute that action. In this script, two placeholders can be
used: {currentStage::dirName}
will be replaced with the
contents of dirName
for the current stage; and
{nextStage::dirName}
will be replaced with the contents of
dirName
for the next stage. The latter part of these
expressions (dirName
in both of these examples) can be
replaced by other keys specified in each stage to allow setting
parameters in the pipeline specification.
An example of a _ROCKproject.yml
file is included
below.
_ROCKproject:
project:
title: "The Alice Study" # Any character string
authors: "Author names as string" # Any character string
authorIds:
-
display_name: "Talea Cornelius" # Any character string
orcid: "0000-0001-7181-0981" # Any character string matching ^([0-9]{4}-){3}[0-9]{4}$
shorcid: "ip6b381" # Any character string matching ^([0-9a-zA-Z]+$
-
display_name: "Gjalt-Jorn Peters" # Any character string
orcid: "0000-0002-0336-9589" # Any character string matching ^([0-9]{4}-){3}[0-9]{4}$
shorcid: "it36ll9" # Any character string matching ^([0-9a-zA-Z]+$
version: "1.1" # Anything matching regex [0-9]+(\\.[0-9]+)*
ROCK_version: 1 # Anything matching regex [0-9]+(\\.[0-9]+)*
ROCK_project_version: 1 # Anything matching regex [0-9]+(\\.[0-9]+)*
date_created: "2023-03-01 20:03:51 UTC" # Anything matching that date format, preferably converted to UTC timezone
date_modified: "2023-03-08 20:03:51 UTC" # Anything matching that date format, preferably converted to UTC timezone
codebook:
urcid: ""
embedded: ~
local: ""
sources:
extension: ".rock" # Any valid extension
regex: ~ # Any regex or ~
dirsToIncludeRegex: data/ # Any regex or ~
recursive: true # true or false
dirsToExcludeRegex: ~ # Any regex or ~
filesToIncludeRegex: ~ # Any regex or ~
filesToExcludeRegex: ~ # Any regex or ~
workflow:
pipeline:
-
stage: raw # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
dirName: "data/010---raw-sources" # Any valid directory name, using a forward slash as separator
nextStages:
-
nextStageid: clean # A different stage identifier or ~
actionId: cleanSource
-
nextStageid: uids # A different stage identifier or ~
actionId: addUIDs
-
stage: clean # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
dirName: "data/020---cleaned-sources" # Any valid directory name, using a forward slash as separator
nextStages:
-
nextStageid: uids # A different stage identifier or ~
actionId: addUIDs
-
stage: uids # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
dirName: "data/030---sources-with-uids" # Any valid directory name, using a forward slash as separator
nextStage: coded # A different stage identifier or ~
-
stage: coded # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
dirName: "data/040---coded-sources" # Any valid directory name, using a forward slash as separator
nextStage: masked # A different stage identifier or ~
-
stage: masked # Anything matching regex [a-A-Z][a-zA-Z0-9_]*
dirName: "data/090---masked-sources" # Any valid directory name, using a forward slash as separator
nextStage: ~ # A different stage identifier or ~
actions:
-
actionId: addUIDs # String, referenced from the stages
language: R # Language, has to be matched to interpreter
dependencies: rock # Dependencies to be loaded before running the script
script: | # Literal block style string
rock::prepend_ids_to_sources(
input = {currentStage::dirName},
output = {nextStage::dirName}
);