Title: | dbGaP Checkup |
---|---|
Description: | Contains functions that check for formatting of the Subject Phenotype data set and data dictionary as specified by the National Center for Biotechnology Information (NCBI) Database of Genotypes and Phenotypes (dbGaP) <https://www.ncbi.nlm.nih.gov/gap/docs/submissionguide/>. |
Authors: | Lacey W. Heinsberg [aut, cre], Daniel E. Weeks [aut], University of Pittsburgh [cph] |
Maintainer: | Lacey W. Heinsberg <[email protected]> |
License: | GPL-2 |
Version: | 1.1.0 |
Built: | 2025-01-21 05:26:49 UTC |
Source: | https://github.com/lwheinsberg/dbgapcheckup |
This function adds additional fields required by this package including variable type (TYPE
), minimum value (MIN
), and maximum value (MAX
).
add_missing_fields(DD.dict, DS.data)
add_missing_fields(DD.dict, DS.data)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
Even though MIN, MAX, and TYPE are not required by dbGaP, our package was created to use these variables in a series of other checks and awareness functions (e.g., render_report, values_check, etc.). MIN/MAX columns will be added as empty columns as dbGaP instructions state that the MIN and MAX should be the "logical" MIN/MAX for the data, not necessarily the observed MIN/MAX, which would be study and variable specific. TYPE will be inferred from the data set and data dictionary VALUES columns. Note however, that if the VALUES columns are not set up correctly, then this function can't properly infer the data TYPE from the data set and data dictionary.
A data frame containing the updated data dictionary with missing fields added in, or NULL if any required pre-checks fail.
# Example data(ExampleD) DD.dict.updated <- add_missing_fields(DD.dict.D, DS.data.D)
# Example data(ExampleD) DD.dict.updated <- add_missing_fields(DD.dict.D, DS.data.D)
This function generates a user-readable report of the checks run by the complete_check function.
check_report(DD.dict, DS.data, non.NA.missing.codes = NA, compact = TRUE)
check_report(DD.dict, DS.data, non.NA.missing.codes = NA, compact = TRUE)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
non.NA.missing.codes |
A user-defined vector of numerical missing value codes (e.g., -9999). |
compact |
When TRUE, the function prints a compact report, listing information from only the non-passed checks. |
Tibble, returned invisibly, containing the following information for each check: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (More detailed information about the potential errors identified).
# Example 1: Incorrectly showing as pass check on first attempt data(ExampleB) report <- check_report(DD.dict.B, DS.data.B) # Addition of missing value codes calls attention to error # at missing_value_check report <- check_report(DD.dict.B, DS.data.B, non.NA.missing.codes=c(-4444, -9999)) # Example 2: Several fail checks or not attempted data(ExampleC) report <- check_report(DD.dict.C, DS.data.C, non.NA.missing.codes=c(-4444, -9999)) # Note you can also run report using compact=FALSE report <- check_report(DD.dict.C, DS.data.C, non.NA.missing.codes=c(-4444, -9999), compact = FALSE)
# Example 1: Incorrectly showing as pass check on first attempt data(ExampleB) report <- check_report(DD.dict.B, DS.data.B) # Addition of missing value codes calls attention to error # at missing_value_check report <- check_report(DD.dict.B, DS.data.B, non.NA.missing.codes=c(-4444, -9999)) # Example 2: Several fail checks or not attempted data(ExampleC) report <- check_report(DD.dict.C, DS.data.C, non.NA.missing.codes=c(-4444, -9999)) # Note you can also run report using compact=FALSE report <- check_report(DD.dict.C, DS.data.C, non.NA.missing.codes=c(-4444, -9999), compact = FALSE)
This function runs a full workflow check including field_check
, pkg_field_check
, dimension_check
, name_check
, id_check
, row_check
, NA_check
, type_check
, values_check
, integer_check
, decimal_check
, misc_format_check
, description_check
, minmax_check
, and missing_value_check
.
complete_check( DD_dict, DS_data, non.NA.missing.codes = NA, reorder.dict = FALSE, name.correct = FALSE )
complete_check( DD_dict, DS_data, non.NA.missing.codes = NA, reorder.dict = FALSE, name.correct = FALSE )
DD_dict |
Data dictionary. |
DS_data |
Data set. |
non.NA.missing.codes |
A user-defined vector of encoded, numerical (i.e., non-NA) missing value codes (e.g., -9999). |
reorder.dict |
When TRUE, and only if the names between the data and data dictionary match perfectly but are in the wrong order, the function will reorder the rows of the dictionary to match the columns of the data; note please use with caution: we recommend first running the function with the default set to FALSE to understand potential errors. |
name.correct |
When TRUE, if name mismatches are identified, the function will rename the variable names in the data set to match the data dictionary; note please use with caution: we recommend first running the function with the default set to FALSE to identify order/dimension mismatches (vs. name mismatches). |
Tibble containing the following information for each check: (1) Time (time stamp); (2) Name (name of the function); (3) Status (Passed/Failed/Warning); (4) Message (A copy of the message the function printed out); (5) Information (More detailed information about the potential errors identified).
# Example 1 # Note in this example, the missing value codes are not defined, # so the last check ('missing_value_check') doesn't know to # to check for encoded values data(ExampleB) complete_check(DD.dict.B, DS.data.B) # Rerun check after defining missing value codes complete_check(DD.dict.B, DS.data.B, non.NA.missing.codes=c(-9999, -4444)) # Example 2 data(ExampleA) complete_check(DD.dict.A, DS.data.A, non.NA.missing.codes=c(-9999, -4444)) # Example 3 data(ExampleD) results <- complete_check(DD.dict.D, DS.data.D, non.NA.missing.codes=c(-9999, -4444)) # View output in greater detail results$Message[2] # Recommend using add_missing_fields results$Information$pkg_field_check.Info # We see that MIN, MAX, and TYPE are all missing # Use the add_missing_fields function to add in data DD.dict.updated <- add_missing_fields(DD.dict.D, DS.data.D) # Be sure to call in the new version of the dictionary (DD.dict.updated) complete_check(DD.dict.updated, DS.data.D)
# Example 1 # Note in this example, the missing value codes are not defined, # so the last check ('missing_value_check') doesn't know to # to check for encoded values data(ExampleB) complete_check(DD.dict.B, DS.data.B) # Rerun check after defining missing value codes complete_check(DD.dict.B, DS.data.B, non.NA.missing.codes=c(-9999, -4444)) # Example 2 data(ExampleA) complete_check(DD.dict.A, DS.data.A, non.NA.missing.codes=c(-9999, -4444)) # Example 3 data(ExampleD) results <- complete_check(DD.dict.D, DS.data.D, non.NA.missing.codes=c(-9999, -4444)) # View output in greater detail results$Message[2] # Recommend using add_missing_fields results$Information$pkg_field_check.Info # We see that MIN, MAX, and TYPE are all missing # Use the add_missing_fields function to add in data DD.dict.updated <- add_missing_fields(DD.dict.D, DS.data.D) # Be sure to call in the new version of the dictionary (DD.dict.updated) complete_check(DD.dict.updated, DS.data.D)
This function generates an awareness report in HTML format, and optionally opens it in the web browser.
create_awareness_report( DD.dict, DS.data, non.NA.missing.codes = NA, threshold = 95, output.path = tempdir(), open.html = TRUE, fn.stem = "AwarenessReport" )
create_awareness_report( DD.dict, DS.data, non.NA.missing.codes = NA, threshold = 95, output.path = tempdir(), open.html = TRUE, fn.stem = "AwarenessReport" )
DD.dict |
Data dictionary. |
DS.data |
Data set. |
non.NA.missing.codes |
A user-defined vector of numerical missing value codes (e.g., -9999). |
threshold |
Threshold for missingness of concern (as a percent). |
output.path |
Path to the folder in which to create the HTML report document. |
open.html |
If TRUE, open the HTML report document in the web browser. |
fn.stem |
File name stem. |
Full path to the HTML report document.
data(ExampleB) create_awareness_report(DD.dict.B, DS.data.B, non.NA.missing.codes=c(-9999), output.path= tempdir(), open.html = FALSE)
data(ExampleB) create_awareness_report(DD.dict.B, DS.data.B, non.NA.missing.codes=c(-9999), output.path= tempdir(), open.html = FALSE)
This function calls eval_function to generate a textual and graphical report of the selected variables in HTML format, and optionally opens it in the web browser.
create_report( DD.dict, DS.data, sex.split = FALSE, sex.name = NULL, start = 1, end = 1, non.NA.missing.codes = NA, output.path = tempdir(), open.html = TRUE, fn.stem = "Report" )
create_report( DD.dict, DS.data, sex.split = FALSE, sex.name = NULL, start = 1, end = 1, non.NA.missing.codes = NA, output.path = tempdir(), open.html = TRUE, fn.stem = "Report" )
DD.dict |
Data dictionary. |
DS.data |
Data set. |
sex.split |
When TRUE, split reports by the field named as defined by the sex.name variable. |
sex.name |
Character string specifying the name of the sex field. |
start |
Staring index of the first select trait. |
end |
Ending index of the last selected trait. |
non.NA.missing.codes |
A user-defined vector of numerical missing value codes (e.g., -9999). |
output.path |
Path to the folder in which to create the HTML report document. |
open.html |
If TRUE, open the HTML report document in the web browser. |
fn.stem |
File name stem. |
Full path to the HTML report document.
data(ExampleB) create_report(DD.dict.B, DS.data.B, sex.split=TRUE, sex.name= "SEX", start = 3, end = 7, non.NA.missing.codes=c(-9999,-4444), output.path= tempdir(), open.html = FALSE)
data(ExampleB) create_report(DD.dict.B, DS.data.B, sex.split=TRUE, sex.name= "SEX", start = 3, end = 7, non.NA.missing.codes=c(-9999,-4444), output.path= tempdir(), open.html = FALSE)
This function calls eval_function to generate a textual and graphical report of the selected variables.
dat_function( DS.dataset, DD.dictionary, sex.split = FALSE, sex.name = NULL, DS.dataset.na )
dat_function( DS.dataset, DD.dictionary, sex.split = FALSE, sex.name = NULL, DS.dataset.na )
DS.dataset |
Data set. |
DD.dictionary |
Data dictionary. |
sex.split |
When TRUE, split reports by the field named by the sex.name string. |
sex.name |
Character string giving the name of the sex field. |
DS.dataset.na |
Data set with missing values set to NA. |
Invisible NULL, called for its side effects.
This function calls eval_function to generate a textual and graphical report of the selected variables.
dat_function_selected( dataset, dictionary, sex.split = FALSE, sex.name = NULL, start = 1, end = 1, dataset.na, h.level = 2 )
dat_function_selected( dataset, dictionary, sex.split = FALSE, sex.name = NULL, start = 1, end = 1, dataset.na, h.level = 2 )
dataset |
Data set. |
dictionary |
Data dictionary. |
sex.split |
When TRUE, split reports by the field named 'Sex'. |
sex.name |
Character string giving the name of the sex field. |
start |
Staring index of the first selected trait. |
end |
Ending index of the last selected trait. |
dataset.na |
Data set with missing values set to NA. |
h.level |
Header level for pandoc function. |
Invisible NULL, called for its side effects
This function searches for variables that appear to be incorrectly listed as TYPE decimal.
decimal_check(DD.dict, DS.data, verbose = TRUE)
decimal_check(DD.dict, DS.data, verbose = TRUE)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
verbose |
When TRUE, the function prints the Message out, as well as a list of variables that may be incorrectly labeled as TYPE decimal. |
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (Names of variables that are listed as TYPE decimal, but do not appear to be decimals).
# Example 1: Fail check data(ExampleF) decimal_check(DD.dict.F, DS.data.F) print(integer_check(DD.dict.F, DS.data.F, verbose=FALSE)) # Example 2: Required pre-check fails data(ExampleE) decimal_check(DD.dict.E, DS.data.E) print(decimal_check(DD.dict.E, DS.data.E, verbose=FALSE)) # Example 3: Pass check data(ExampleA) decimal_check(DD.dict.A, DS.data.A) print(decimal_check(DD.dict.A, DS.data.A, verbose=FALSE))
# Example 1: Fail check data(ExampleF) decimal_check(DD.dict.F, DS.data.F) print(integer_check(DD.dict.F, DS.data.F, verbose=FALSE)) # Example 2: Required pre-check fails data(ExampleE) decimal_check(DD.dict.E, DS.data.E) print(decimal_check(DD.dict.E, DS.data.E, verbose=FALSE)) # Example 3: Pass check data(ExampleA) decimal_check(DD.dict.A, DS.data.A) print(decimal_check(DD.dict.A, DS.data.A, verbose=FALSE))
This function checks that there is a unique description for every variable in the data dictionary (VARDESC
column).
description_check(DD.dict, verbose = TRUE)
description_check(DD.dict, verbose = TRUE)
DD.dict |
Data dictionary. |
verbose |
When TRUE, the function prints the Message out, as well as a list of the variables that are missing a |
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (Names of the variables with missing or duplicated descriptions).
# Example 1: Fail check data(ExampleG) description_check(DD.dict.G) print(description_check(DD.dict.G, verbose=FALSE)) # Example 2: Pass check data(ExampleA) description_check(DD.dict.A) print(description_check(DD.dict.A, verbose=FALSE))
# Example 1: Fail check data(ExampleG) description_check(DD.dict.G) print(description_check(DD.dict.G, verbose=FALSE)) # Example 2: Pass check data(ExampleA) description_check(DD.dict.A) print(description_check(DD.dict.A, verbose=FALSE))
This awareness function helps you search the data dictionary for a specific term; intended for use as an investigative aid to supplement other checks in this package.
dictionary_search( DD.dict, search.term = c("blood pressure"), search.column = c("VARDESC") )
dictionary_search( DD.dict, search.term = c("blood pressure"), search.column = c("VARDESC") )
DD.dict |
Data dictionary. |
search.term |
Search term. |
search.column |
Column of the data dictionary to search. |
Tibble containing dictionary rows in which the search term was detected in specified column or an error message if the search column could not be detected.
# Successful search data(ExampleB) dictionary_search(DD.dict.B, search.term=c("skinfold"), search.column=c("VARDESC")) # Attempted search in wrong column dictionary_search(DD.dict.B, search.term=c("skinfold"), search.column=c("VARIABLE_DESCRIPTION"))
# Successful search data(ExampleB) dictionary_search(DD.dict.B, search.term=c("skinfold"), search.column=c("VARDESC")) # Attempted search in wrong column dictionary_search(DD.dict.B, search.term=c("skinfold"), search.column=c("VARIABLE_DESCRIPTION"))
This function checks that the number of variables match between the data set and the data dictionary.
dimension_check(DD.dict, DS.data, verbose = TRUE)
dimension_check(DD.dict, DS.data, verbose = TRUE)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
verbose |
When TRUE, the function prints the Message out, as well as the number of variables in the data set and data dictionary. |
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (number of variables in the data and dictionary and names of mismatched variables if applicable).
# Example 1: Fail check data(ExampleG) dimension_check(DD.dict.G, DS.data.G) print(dimension_check(DD.dict=DD.dict.G, DS.data=DS.data.G,verbose=FALSE)) # Example 2: Pass check data(ExampleA) dimension_check(DD.dict.A, DS.data.A) print(dimension_check(DD.dict.A, DS.data.A,verbose=FALSE))
# Example 1: Fail check data(ExampleG) dimension_check(DD.dict.G, DS.data.G) print(dimension_check(DD.dict=DD.dict.G, DS.data=DS.data.G,verbose=FALSE)) # Example 2: Pass check data(ExampleA) dimension_check(DD.dict.A, DS.data.A) print(dimension_check(DD.dict.A, DS.data.A,verbose=FALSE))
This function checks for duplicate VALUES column names in the data dictionary.
dup_values(DD.dict)
dup_values(DD.dict)
DD.dict |
Data dictionary. |
Logical, TRUE if only one VALUES column is detected.
This function generates a textual and graphical report of the selected variables.
eval_function( dataset, dictionary, sex.split = FALSE, sex.name = NULL, dataset.na, h.level = 2 )
eval_function( dataset, dictionary, sex.split = FALSE, sex.name = NULL, dataset.na, h.level = 2 )
dataset |
Data set. |
dictionary |
Data dictionary. |
sex.split |
When TRUE, split reports by the field named 'Sex'. |
sex.name |
Name of the Sex field. |
dataset.na |
Data set with missing values set to NA. |
h.level |
Header level for pandoc function. |
Invisible NULL, called for its side effects.
Example data set and data dictionary with no errors.
data(ExampleA)
data(ExampleA)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example1.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.A <- readxl::read_xlsx(DD.path) path <- system.file("extdata", "DS_Example.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.A <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.A, DS.data.A, file = "ExampleA.rda")
Example data set and data dictionary with intentional errors.
data(ExampleB)
data(ExampleB)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example1b.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.B <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example1b.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.B <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.B, DS.data.B, file = "ExampleB.rda")
Example data set and data dictionary with intentional errors.
data(ExampleC)
data(ExampleC)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example2d.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.C <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example1b.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.C <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.C, DS.data.C, file = "ExampleC.rda")
Example data set and data dictionary with intentional errors.
data(ExampleD)
data(ExampleD)
R data file
that contains two objects:
Data dictionary
Data set
path <- system.file("extdata", "3b_SSM_DD_Example2f.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.D <- readxl::read_xlsx(path) DS.path <- system.file("extdata", "DS_Example.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.D <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.D, DS.data.D, file = "ExampleD.rda")
Example data set and data dictionary with intentional errors.
data(ExampleE)
data(ExampleE)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example2b.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.E <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example2.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.E <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.E, DS.data.E, file = "ExampleE.rda")
Example data set and data dictionary with intentional errors.
data(ExampleF)
data(ExampleF)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example4.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.F <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example3d.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.F <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.F, DS.data.F, file = "ExampleF.rda")
Example data set and data dictionary with intentional errors.
data(ExampleG)
data(ExampleG)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example2.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.G <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.G <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.G, DS.data.G, file = "ExampleG.rda")
Example data set and data dictionary with intentional errors.
data(ExampleH)
data(ExampleH)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example1.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.H <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example3c.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.H <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.H, DS.data.H, file = "ExampleH.rda")
Example data set and data dictionary with intentional errors.
data(ExampleI)
data(ExampleI)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example2c.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.I <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example2c.txt",package = "dbGaPCheckup", mustWork=TRUE) DS.data.I <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.I, DS.data.I, file = "ExampleI.rda")
Example data set and data dictionary with intentional errors.
data(ExampleJ)
data(ExampleJ)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example2d.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.J <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example2.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.J <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.J, DS.data.J, file = "ExampleJ.rda")
Example data set and data dictionary with intentional errors.
data(ExampleK)
data(ExampleK)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example2d.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.K <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example2b.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.K <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.K, DS.data.K, file = "ExampleK.rda")
Example data set and data dictionary with intentional errors.
data(ExampleL)
data(ExampleL)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example2b.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.L <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example2c.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.L <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.L, DS.data.L, file = "ExampleL.rda")
Example data set and data dictionary with intentional errors.
data(ExampleM)
data(ExampleM)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example2b.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.M <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.M <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.M, DS.data.M, file = "ExampleM.rda")
Example data set and data dictionary with intentional errors.
data(ExampleN)
data(ExampleN)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example2e.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.N <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.N <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.N, DS.data.N, file = "ExampleN.rda")
Example data set with intentional errors.
data(ExampleO)
data(ExampleO)
R data file
that contains a single object:
Data set
DS.path <- system.file("extdata", "DS_Example3.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.O <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DS.data.O, file = "ExampleO.rda")
Example data set with intentional errors.
data(ExampleP)
data(ExampleP)
R data file
that contains a single object:
Data set
DS.path <- system.file("extdata", "DS_Example3b.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.P <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DS.data.P, file = "ExampleP.rda")
Example data set and data dictionary with no errors.
data(ExampleQ)
data(ExampleQ)
R data file
that contains two objects:
Data dictionary
Data set
DD.path <- system.file("extdata", "3b_SSM_DD_Example5.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.Q <- readxl::read_xlsx(DD.path) DS.path <- system.file("extdata", "DS_Example5.txt", package = "dbGaPCheckup", mustWork=TRUE) ### FIX THIS DS.data.Q <- read.table(DS.path, header=TRUE, sep="\t", quote="", as.is = TRUE) save(DD.dict.Q, DS.data.Q, file = "ExampleQ.rda")
Example data set and data dictionary with no errors.
data(ExampleR)
data(ExampleR)
R data file
that contains two objects:
Data dictionary
Data set
library(tidyverse) DD.dict.R <- DD.dict.A DS.data.R <- DS.data.A # Change SUBJECT_ID to a string DS.data.R$SUBJECT_ID <- paste0("A",DS.data.R$SUBJECT_ID) DD.dict.R$TYPE[DD.dict.R$VARNAME=="SUBJECT_ID"] <- "string" # Change HX_DEPRESSION to a string DS.data.R <- DS.data.R %>% mutate(HX_DEPRESSION = recode(HX_DEPRESSION, '0' = 'no','1'='yes','-9999' = '-9999')) DD.dict.R$TYPE[DD.dict.R$VARNAME=="HX_DEPRESSION"] <- "string" DD.dict.R$VALUES[DD.dict.R$VARNAME=="HX_DEPRESSION"] <- "-9999=missing value" # Set the extra VALUES column names to blank DD.dict.R$`...18`[DD.dict.R$VARNAME=="HX_DEPRESSION"] <- NA DD.dict.R$`...19`[DD.dict.R$VARNAME=="HX_DEPRESSION"] <- NA nval <- which(names(DD.dict.R) == "VALUES") names(DD.dict.R)[(nval + 1):ncol(DD.dict.R)] <- "" save(DD.dict.R, DS.data.R, file="ExampleR.rda")
Example data set and data dictionary with intentional errors.
data(ExampleS)
data(ExampleS)
R data file
that contains two objects:
Data dictionary
Data set
DS.path <- system.file("extdata", "DS_Example6.txt", package = "dbGaPCheckup", mustWork=TRUE) DS.data.S <- read.table(DS.path, header=TRUE, sep="\t", quote="") DD.path <- system.file("extdata", "DD_Example5b.xlsx", package = "dbGaPCheckup", mustWork=TRUE) DD.dict.S1 <- readxl::read_xlsx(DD.path) DD.dict.S <- reorder_dictionary(DD.dict.S1, DS.data.S) save(DD.dict.S, DS.data.S, file = "ExampleS.rda")
This function checks for dbGaP required fields variable name (VARNAME
), variable description (VARDESC
), units (UNITS
), and variable value and meaning (VALUES
).
field_check(DD.dict, verbose = TRUE)
field_check(DD.dict, verbose = TRUE)
DD.dict |
Data dictionary. |
verbose |
When TRUE, the function prints the Message out, as well as a list of the fields not found in the data dictionary. |
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (Named vector of TRUE/FALSE values alerting user if checks passed (TRUE) or failed (FALSE) for VARNAME
, VARDESC
, UNITS
, and VALUE
).
data(ExampleA) field_check(DD.dict.A) print(field_check(DD.dict.A, verbose=FALSE))
data(ExampleA) field_check(DD.dict.A) print(field_check(DD.dict.A, verbose=FALSE))
This function checks that the first column of the data set is the primary ID for each participant labeled as SUBJECT_ID
, that values contain no illegal characters or padded zeros, and that each participant has an ID.
id_check(DS.data, verbose = TRUE)
id_check(DS.data, verbose = TRUE)
DS.data |
Data set. |
verbose |
When TRUE, the function prints the Message out, as well as more detailed diagnostic information. |
Subject IDs should be an integer or string value. Integers should not have zero padding. IDs should not have spaces. Specifically, only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). All IDs should be filled in (i.e., no misisng IDs are allowed).
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (Detailed information about the four ID checks that were performed).
# Example 1: Fail check, 'SUBJECT_ID' not present data(ExampleO) id_check(DS.data.O) print(id_check(DS.data.O, verbose=FALSE)) # Example 2: Fail check, 'SUBJECT_ID' includes illegal spaces data(ExampleP) id_check(DS.data.P) results <- id_check(DS.data.P) results$Information[[1]]$details print(id_check(DS.data.P, verbose=FALSE)) # Example 3: Pass check data(ExampleA) id_check(DS.data.A) print(id_check(DS.data.A, verbose=FALSE))
# Example 1: Fail check, 'SUBJECT_ID' not present data(ExampleO) id_check(DS.data.O) print(id_check(DS.data.O, verbose=FALSE)) # Example 2: Fail check, 'SUBJECT_ID' includes illegal spaces data(ExampleP) id_check(DS.data.P) results <- id_check(DS.data.P) results$Information[[1]]$details print(id_check(DS.data.P, verbose=FALSE)) # Example 3: Pass check data(ExampleA) id_check(DS.data.A) print(id_check(DS.data.A, verbose=FALSE))
This utility function reorders the data set so that SUBJECT_ID comes first.
id_first_data(DS.data)
id_first_data(DS.data)
DS.data |
Data set. |
SUBJECT_ID is required to be the first column of the data set and first variable listed in the data dictionary.
Updated data set with SUBJECT_ID as first column.
data(ExampleQ) head(DS.data.Q) DS.data.updated <- id_first_data(DS.data.Q) head(DS.data.updated)
data(ExampleQ) head(DS.data.Q) DS.data.updated <- id_first_data(DS.data.Q) head(DS.data.updated)
This utility function reorders the data dictionary so that SUBJECT_ID comes first.
id_first_dict(DD.dict)
id_first_dict(DD.dict)
DD.dict |
Data dictionary. |
SUBJECT_ID is required to be the first column of the data set and first variable listed in the data dictionary.
Updated data dictionary with SUBJECT_ID as first variable.
data(ExampleQ) head(DD.dict.Q) DD.dict.updated <- id_first_dict(DD.dict.Q) head(DD.dict.updated)
data(ExampleQ) head(DD.dict.Q) DD.dict.updated <- id_first_dict(DD.dict.Q) head(DD.dict.updated)
This function checks for integer values.
int_check(data)
int_check(data)
data |
Number or vector of numbers. |
Logical, TRUE if all non-missing entries in the input vector are integers.
This function searches for variables that appear to be incorrectly listed as TYPE integer.
integer_check(DD.dict, DS.data, verbose = TRUE)
integer_check(DD.dict, DS.data, verbose = TRUE)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
verbose |
When TRUE, the function prints the Message out, as well as a list of variables that may be incorrectly labeled as TYPE integer. |
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (Names of variables that are listed as TYPE integer, but do not appear to be integers).
# Example 1: Fail check data(ExampleH) integer_check(DD.dict.H, DS.data.H) print(integer_check(DD.dict.H, DS.data.H, verbose=FALSE)) # Example 2: Pass check data(ExampleA) integer_check(DD.dict.A, DS.data.A) print(integer_check(DD.dict.A, DS.data.A, verbose=FALSE)) data(ExampleR) integer_check(DD.dict.R, DS.data.R) print(integer_check(DD.dict.R, DS.data.R, verbose=FALSE))
# Example 1: Fail check data(ExampleH) integer_check(DD.dict.H, DS.data.H) print(integer_check(DD.dict.H, DS.data.H, verbose=FALSE)) # Example 2: Pass check data(ExampleA) integer_check(DD.dict.A, DS.data.A) print(integer_check(DD.dict.A, DS.data.A, verbose=FALSE)) data(ExampleR) integer_check(DD.dict.R, DS.data.R) print(integer_check(DD.dict.R, DS.data.R, verbose=FALSE))
Using the information in the data dictionary, this function adds non-missing information from the data dictionary as attributes to the data.
label_data(DD.dict, DS.data, non.NA.missing.codes = NA)
label_data(DD.dict, DS.data, non.NA.missing.codes = NA)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
non.NA.missing.codes |
A user-defined vector of numerical missing value codes (e.g., -9999). |
A tibble containing the labelled data set, with the data dictionary information embedded as attributes and variables labelled using Haven SPSS conventions.
data(ExampleB) DS_labelled_data <- label_data(DD.dict.B, DS.data.B, non.NA.missing.codes=c(-9999)) labelled::var_label(DS_labelled_data$SEX) labelled::val_labels(DS_labelled_data$SEX) attributes(DS_labelled_data$SEX) labelled::na_values(DS_labelled_data$HX_DEPRESSION)
data(ExampleB) DS_labelled_data <- label_data(DD.dict.B, DS.data.B, non.NA.missing.codes=c(-9999)) labelled::var_label(DS_labelled_data$SEX) labelled::val_labels(DS_labelled_data$SEX) attributes(DS_labelled_data$SEX) labelled::na_values(DS_labelled_data$HX_DEPRESSION)
This function flags variables that have values exceeding the MIN
or MAX
listed in the data dictionary.
minmax_check(DD.dict, DS.data, verbose = TRUE, non.NA.missing.codes = NA)
minmax_check(DD.dict, DS.data, verbose = TRUE, non.NA.missing.codes = NA)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
verbose |
When TRUE, the function prints the Message out, as well as a list of variables that violate the listed |
non.NA.missing.codes |
A user-defined vector of numerical missing value codes (e.g., -9999). |
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (A sorted list of unique values that are either less than the MIN
value or greater than the MAX
value).
# Example 1 # Fail check (incorrectly flagging NA value codes -9999 # and -4444 as outside of the min max range) data(ExampleA) minmax_check(DD.dict.A, DS.data.A) # View out of range values: details <- minmax_check(DD.dict.A, DS.data.A)$Information details[[1]]$OutOfRangeValues # Attempt 2, specifying -9999 and -4444 as missing value # codes so check works correctly minmax_check(DD.dict.A, DS.data.A, non.NA.missing.codes=c(-9999, -4444)) # Example 2 data(ExampleI) minmax_check(DD.dict.I, DS.data.I, non.NA.missing.codes=c(-9999, -4444)) # View out of range values: details <- minmax_check(DD.dict.I, DS.data.I, non.NA.missing.codes=c(-9999, -4444))$Information details[[1]]$OutOfRangeValues
# Example 1 # Fail check (incorrectly flagging NA value codes -9999 # and -4444 as outside of the min max range) data(ExampleA) minmax_check(DD.dict.A, DS.data.A) # View out of range values: details <- minmax_check(DD.dict.A, DS.data.A)$Information details[[1]]$OutOfRangeValues # Attempt 2, specifying -9999 and -4444 as missing value # codes so check works correctly minmax_check(DD.dict.A, DS.data.A, non.NA.missing.codes=c(-9999, -4444)) # Example 2 data(ExampleI) minmax_check(DD.dict.I, DS.data.I, non.NA.missing.codes=c(-9999, -4444)) # View out of range values: details <- minmax_check(DD.dict.I, DS.data.I, non.NA.missing.codes=c(-9999, -4444))$Information details[[1]]$OutOfRangeValues
This function checks miscellaneous dbGaP formatting requirements to ensure (1) no empty variable names; (2) no duplicate variable names; (3) variable names do not contain "dbgap"; (4) there are no duplicate column names in the dictionary; and (5) column names falling after VALUES
column are unnamed.
misc_format_check(DD.dict, DS.data, verbose = TRUE)
misc_format_check(DD.dict, DS.data, verbose = TRUE)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
verbose |
When TRUE, the function prints the Message out, as well as more detailed information about which formatting checks failed. |
Note that this check will return a WARNING for Check #5 depending on how the data set is read into R. Depending on the method used, R will automatically fill in column names after VALUES with "...col_number". This is allowed by the package, but it is NOT allowed by dbGaP, so please use caution if you write out a data set after making adjustments directly in R.
Tibble, returned invisibly, containing: (1) Time (time stamp); (2) Name (name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (Names of variables that fail one of these checks).
# Example 1: Fail check data(ExampleJ) misc_format_check(DD.dict.J, DS.data.J) print(misc_format_check(DD.dict.J, DS.data.J, verbose=FALSE)) # Example 2: Pass check data(ExampleA) misc_format_check(DD.dict.A, DS.data.A) print(misc_format_check(DD.dict.A, DS.data.A, verbose=FALSE))
# Example 1: Fail check data(ExampleJ) misc_format_check(DD.dict.J, DS.data.J) print(misc_format_check(DD.dict.J, DS.data.J, verbose=FALSE)) # Example 2: Pass check data(ExampleA) misc_format_check(DD.dict.A, DS.data.A) print(misc_format_check(DD.dict.A, DS.data.A, verbose=FALSE))
This function flags variables that have non-encoded missing value codes.
missing_value_check( DD.dict, DS.data, verbose = TRUE, non.NA.missing.codes = NA )
missing_value_check( DD.dict, DS.data, verbose = TRUE, non.NA.missing.codes = NA )
DD.dict |
Data dictionary. |
DS.data |
Data set. |
verbose |
When TRUE, the function prints the Message out, as well as a list of variables that have non-encoded missing values. |
non.NA.missing.codes |
A user-defined vector of numerical missing value codes (e.g., -9999). |
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (A list of variables where a missing value code is not properly encoded).
data(ExampleB) missing_value_check(DD.dict.B, DS.data.B, non.NA.missing.codes = c(-9999,-4444)) data(ExampleS) missing_value_check(DD.dict.S, DS.data.S, non.NA.missing.codes = c(-9999,-4444))
data(ExampleB) missing_value_check(DD.dict.B, DS.data.B, non.NA.missing.codes = c(-9999,-4444)) data(ExampleS) missing_value_check(DD.dict.S, DS.data.S, non.NA.missing.codes = c(-9999,-4444))
This awareness function summarizes the amount of missingness in the data set.
missingness_summary(DS.data, non.NA.missing.codes = NA, threshold = 95)
missingness_summary(DS.data, non.NA.missing.codes = NA, threshold = 95)
DS.data |
Data set. |
non.NA.missing.codes |
A user-defined vector of numerical missing value codes (e.g., -9999). |
threshold |
Threshold for missingness of concern (as a percent). |
Tibble containing: (1) Message containing information on the number of variables with a % missingness greater than the threshold; (2) Missingness by variable summary; and (3) Summary of missingness for variables with a missingness level greater than the threshold.
# Correct useage data(ExampleA) missingness_summary(DS.data.A, non.NA.missing.codes=c(-4444, -9999))
# Correct useage data(ExampleA) missingness_summary(DS.data.A, non.NA.missing.codes=c(-4444, -9999))
This function runs a workflow of the minimum number of checks required for a user to run minmax_check; the checks include pkg_field_check
, dimension_check
, and name_check
.
mm_precheck(dict, data)
mm_precheck(dict, data)
dict |
Data dictionary. |
data |
Data set. |
Tibble containing the following information for each check: (1) Time (time stamp); (2) Name (name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (More detailed information about the potential errors identified).
data(ExampleB) mm_precheck(DD.dict.B, DS.data.B)
data(ExampleB) mm_precheck(DD.dict.B, DS.data.B)
This function runs a workflow of the minimum number of checks required for a user to run missing_value_check; the checks include field_check
and pkg_field_check
.
mv_precheck(dict, data)
mv_precheck(dict, data)
dict |
Data dictionary. |
data |
Data set. |
Tibble containing the following information for each check: (1) Time (time stamp); (2) Name (name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (More detailed information about the potential errors identified).
data(ExampleB) mv_precheck(DD.dict.B, DS.data.B)
data(ExampleB) mv_precheck(DD.dict.B, DS.data.B)
Checks for NA values in the data set; if NA values are present, also performs check for NA value=meaning.
NA_check(DD.dict, DS.data, verbose = TRUE)
NA_check(DD.dict, DS.data, verbose = TRUE)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
verbose |
When TRUE, the function prints the Message out, as well as the number of NA values observed in the data set. |
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (the number of NA values in the data set and information on if NA is a properly encoded value).
# Example 1: Fail check data(ExampleK) NA_check(DD.dict.K, DS.data.K) print(NA_check(DD.dict.K, DS.data.K, verbose=FALSE)) # Example 2: Pass check data(ExampleA) NA_check(DD.dict.A, DS.data.A) print(NA_check(DD.dict.A, DS.data.A, verbose=FALSE)) # Example 3: Pass check (though missing_value_check detects a more specific error) data(ExampleS) NA_check(DD.dict.S, DS.data.S)
# Example 1: Fail check data(ExampleK) NA_check(DD.dict.K, DS.data.K) print(NA_check(DD.dict.K, DS.data.K, verbose=FALSE)) # Example 2: Pass check data(ExampleA) NA_check(DD.dict.A, DS.data.A) print(NA_check(DD.dict.A, DS.data.A, verbose=FALSE)) # Example 3: Pass check (though missing_value_check detects a more specific error) data(ExampleS) NA_check(DD.dict.S, DS.data.S)
This function runs a workflow of the minimum number of checks required for a user to run minmax_check; the checks include pkg_field_check
, dimension_check
, and name_check
.
NA_precheck(dict, data)
NA_precheck(dict, data)
dict |
Data dictionary. |
data |
Data set. |
Tibble containing the following information for each check: (1) Time (time stamp); (2) Name (name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (More detailed information about the potential errors identified).
data(ExampleB) NA_precheck(DD.dict.B, DS.data.B)
data(ExampleB) NA_precheck(DD.dict.B, DS.data.B)
This function checks if the variable names match between the data dictionary and the data.
name_check(DD.dict, DS.data, verbose = TRUE)
name_check(DD.dict, DS.data, verbose = TRUE)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
verbose |
When TRUE, the function prints the Message out, as well as a list of the non-matching variable names. |
Tibble, returned invisibly, containing: (1) Time (time stamp); (2) Name (name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (Names of variables that mismatch between the data and data dictionary).
# Example 1: Fail check (name mismatch) data(ExampleM) name_check(DD.dict.M, DS.data.M) DS.data_updated <- name_correct(DD.dict.M, DS.data.M) name_check(DD.dict.M, DS.data_updated) # Example 2: Pass check data(ExampleA) name_check(DD.dict.A, DS.data.A) print(name_check(DD.dict.A, DS.data.A, verbose=FALSE))
# Example 1: Fail check (name mismatch) data(ExampleM) name_check(DD.dict.M, DS.data.M) DS.data_updated <- name_correct(DD.dict.M, DS.data.M) name_check(DD.dict.M, DS.data_updated) # Example 2: Pass check data(ExampleA) name_check(DD.dict.A, DS.data.A) print(name_check(DD.dict.A, DS.data.A, verbose=FALSE))
This utility function updates the data set so variable names match those listed in the data dictionary.
name_correct(DD.dict, DS.data)
name_correct(DD.dict, DS.data)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
Recommend use with caution; perform name_check first.
Updated data set with variables renamed to match the data dictionary.
data(ExampleM) name_check(DD.dict.M, DS.data.M) DS.data_updated <- name_correct(DD.dict.M, DS.data.M) name_check(DD.dict.M, DS.data_updated)
data(ExampleM) name_check(DD.dict.M, DS.data.M) DS.data_updated <- name_correct(DD.dict.M, DS.data.M) name_check(DD.dict.M, DS.data_updated)
This function runs a workflow of the minimum number of checks required for a user to run minmax_check; the checks include pkg_field_check
, dimension_check
, and name_check
.
name_precheck(dict, data)
name_precheck(dict, data)
dict |
Data dictionary. |
data |
Data set. |
Tibble containing the following information for each check: (1) Time (time stamp); (2) Name (name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (More detailed information about the potential errors identified).
data(ExampleB) name_precheck(DD.dict.B, DS.data.B)
data(ExampleB) name_precheck(DD.dict.B, DS.data.B)
This function checks for additional fields required by this package including variable type (TYPE
), minimum value (MIN
), and maximum value (MAX
).
pkg_field_check(DD.dict, DS.data, verbose = TRUE)
pkg_field_check(DD.dict, DS.data, verbose = TRUE)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
verbose |
When TRUE, the function prints the Message out, as well as a list of the fields not found in the data dictionary. |
Even though MIN, MAX, and TYPE are not required by dbGaP, our package was created to use these variables in a series of other checks and awareness functions (e.g., render_report, values_check, etc.). If this function fails, the add_missing_fields function can be used.
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (Named vector of TRUE/FALSE values alerting user if checks passed (TRUE) or failed (FALSE) for TYPE
, MIN
, and MAX
).
# Example 1: Fail check data(ExampleD) pkg_field_check(DD.dict.D, DS.data.D) # Use the add_missing_fields function to add in data DD.dict.updated <- add_missing_fields(DD.dict.D, DS.data.D) # Be sure to call in the new version of the dictionary (DD.dict.updated) pkg_field_check(DD.dict.updated, DS.data.D) # Example 2: Pass check data(ExampleA) pkg_field_check(DD.dict.A, DS.data.A) print(pkg_field_check(DD.dict.A, DS.data.A, verbose=FALSE))
# Example 1: Fail check data(ExampleD) pkg_field_check(DD.dict.D, DS.data.D) # Use the add_missing_fields function to add in data DD.dict.updated <- add_missing_fields(DD.dict.D, DS.data.D) # Be sure to call in the new version of the dictionary (DD.dict.updated) pkg_field_check(DD.dict.updated, DS.data.D) # Example 2: Pass check data(ExampleA) pkg_field_check(DD.dict.A, DS.data.A) print(pkg_field_check(DD.dict.A, DS.data.A, verbose=FALSE))
This utility function reorders the data set to match the data dictionary.
reorder_data(DD.dict, DS.data)
reorder_data(DD.dict, DS.data)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
Updated data set with variables reordered to match the data dictionary.
data(ExampleN) name_check(DD.dict.N, DS.data.N) DS.data_updated <- reorder_data(DD.dict.N, DS.data.N) name_check(DD.dict.N, DS.data_updated)
data(ExampleN) name_check(DD.dict.N, DS.data.N) DS.data_updated <- reorder_data(DD.dict.N, DS.data.N) name_check(DD.dict.N, DS.data_updated)
This utility function reorders the data dictionary to match the data set.
reorder_dictionary(DD.dict, DS.data)
reorder_dictionary(DD.dict, DS.data)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
Updated data dictionary with variables reordered to match the data set.
data(ExampleN) name_check(DD.dict.N, DS.data.N) DD.dict_updated <- reorder_dictionary(DD.dict.N, DS.data.N) name_check(DD.dict_updated, DS.data.N)
data(ExampleN) name_check(DD.dict.N, DS.data.N) DD.dict_updated <- reorder_dictionary(DD.dict.N, DS.data.N) name_check(DD.dict_updated, DS.data.N)
This function checks for empty or duplicate rows in the data set and data dictionary.
row_check(DD.dict, DS.data, verbose = TRUE)
row_check(DD.dict, DS.data, verbose = TRUE)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
verbose |
When TRUE, the function prints the Message out, as well as the row numbers of any problematic rows. |
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (A list of problematic row and participant ID numbers).
# Example 1: Fail check data(ExampleK) row_check(DD.dict.K, DS.data.K) print(row_check(DD.dict.K, DS.data.K, verbose=FALSE)) # Example 2: Pass check data(ExampleC) row_check(DD.dict.C, DS.data.C) print(row_check(DD.dict.C, DS.data.C, verbose=FALSE))
# Example 1: Fail check data(ExampleK) row_check(DD.dict.K, DS.data.K) print(row_check(DD.dict.K, DS.data.K, verbose=FALSE)) # Example 2: Pass check data(ExampleC) row_check(DD.dict.C, DS.data.C) print(row_check(DD.dict.C, DS.data.C, verbose=FALSE))
This function checks for dbGaP required fields variable name (VARNAME
), and variable description (VARDESC
) as a pre-check embedded in name_check
.
short_field_check(DD.dict, verbose = TRUE)
short_field_check(DD.dict, verbose = TRUE)
DD.dict |
Data dictionary. |
verbose |
When TRUE, the function prints the Message out, as well as a list of the fields not found in the data dictionary. |
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (Named vector of TRUE/FALSE values alerting user if checks passed (TRUE) or failed (FALSE) for VARNAME
, VARDESC
, UNITS
, and VALUE
).
data(ExampleA) short_field_check(DD.dict.A)
data(ExampleA) short_field_check(DD.dict.A)
This function runs a workflow of the minimum number of checks required for a user to run dbGaPCheckup_required_field_check; the checks include dbGaP_required_field_check
, dimension_check
, and name_check
.
short_precheck(dict, data)
short_precheck(dict, data)
dict |
Data dictionary. |
data |
Data set. |
Tibble containing the following information for each check: (1) Time (time stamp); (2) Name (name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (More detailed information about the potential errors identified).
data(ExampleB) short_precheck(DD.dict.B, DS.data.B)
data(ExampleB) short_precheck(DD.dict.B, DS.data.B)
This function runs a workflow of the minimum number of checks required for a user to run dbGaPCheckup_required_field_check; the checks include dbGaP_required_field_check
, dimension_check
, and name_check
.
super_short_precheck(dict, data)
super_short_precheck(dict, data)
dict |
Data dictionary. |
data |
Data set. |
Tibble containing the following information for each check: (1) Time (time stamp); (2) Name (name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (More detailed information about the potential errors identified).
# Example 1: Pass check data(ExampleB) super_short_precheck(DD.dict.B, DS.data.B)
# Example 1: Pass check data(ExampleB) super_short_precheck(DD.dict.B, DS.data.B)
If a TYPE field exists, this function checks for any TYPE entries that aren't allowable per dbGaP instructions.
type_check(DD.dict, verbose = TRUE)
type_check(DD.dict, verbose = TRUE)
DD.dict |
Data dictionary. |
verbose |
When TRUE, the function prints the Message out, as well as more detailed diagnostic information. |
Allowable entries in TYPE column include: integer; decimal; encoded value; or string. For mixed values, list all types present using commas to separate (e.g., integer, encoded value).
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (List of illegal TYPE entries).
data(ExampleB) type_check(DD.dict.B) print(type_check(DD.dict.B, verbose=FALSE))
data(ExampleB) type_check(DD.dict.B) print(type_check(DD.dict.B, verbose=FALSE))
This function generates a value-meaning table by parsing the VALUES fields.
value_meaning_table(DD.dict)
value_meaning_table(DD.dict)
DD.dict |
Data dictionary. |
A data frame with the columns VARNAME, TYPE, VALUE, MEANING.
data(ExampleB) head(value_meaning_table(DD.dict.B))
data(ExampleB) head(value_meaning_table(DD.dict.B))
This function checks for consistent usage of encoded values and missing value codes between the data dictionary and the data itself.
value_missing_table(DD.dict, DS.data, non.NA.missing.codes = NA)
value_missing_table(DD.dict, DS.data, non.NA.missing.codes = NA)
DD.dict |
Data dictionary. |
DS.data |
Data set. |
non.NA.missing.codes |
A user-defined vector of numerical missing value codes (e.g., -9999). |
For each variable, we have three sets of possible values: the set D of all the unique values observed
in the data, the set V of all the values explicitly encoded in the VALUES columns of the data dictionary, and
the set M of the missing value codes defined by the user via the non.NA.missing.codes
argument.
This function examines various intersections of these three sets, providing awareness
checks to the user about possible issues of concern. While ideally all defined values in set V should
be observed in the data (e.g., in set D), it is not necessarily an error if one does not. This function
checks for:
(A) In Set M and Not in Set D: If the user defines a missing value code that is not present in the data.
(B) In Set V and Not in Set D: If a VALUES entry defines an encoded code value, but that code value is not present in the data.
(C) In Set M and Not in Set V: If the user defines a missing value code that is not defined in a VALUES entry.
(D) M in Set D and Not in Set V: If a defined global missing value code is present in the data for a given variable, but that variable does not have a corresponding VALUES entry.
(E) (Set V values that are not in Set M) that are NOT in Set D = (Set V not in M) not in D: If a VALUES entry is not defined as a missing value code AND is not detected in the data.
A list, returned invisibly,with two components:
"report"Tibble containing: (1) Name (Name of the function) and (2) Information (Details of all potential flagged variables).
"tb"Tibble with detailed information used to construct the Information.
data(ExampleB) value_missing_table(DD.dict.B, DS.data.B, non.NA.missing.codes = c(-9999)) print(value_missing_table(DD.dict.B, DS.data.B, non.NA.missing.codes = c(-9999))) results <- value_missing_table(DD.dict.B, DS.data.B, non.NA.missing.codes = c(-9999)) results$report$Information$details
data(ExampleB) value_missing_table(DD.dict.B, DS.data.B, non.NA.missing.codes = c(-9999)) print(value_missing_table(DD.dict.B, DS.data.B, non.NA.missing.codes = c(-9999))) results <- value_missing_table(DD.dict.B, DS.data.B, non.NA.missing.codes = c(-9999)) results$report$Information$details
This function checks for potential errors in the VALUES columns by ensuring (1) required format of VALUE=MEANING
(e.g., 0=Yes or 1=No); (2) no leading/trailing spaces near the equals sign; (3) all variables of TYPE encoded have VALUES entries; and (4) all variables with VALUES entries are listed as TYPE encoded.
values_check(DD.dict, verbose = TRUE)
values_check(DD.dict, verbose = TRUE)
DD.dict |
Data dictionary. |
verbose |
When TRUE, the function prints the Message out, as well as a list of variables that fail one of the values checks. |
Tibble, returned invisibly, containing: (1) Time (Time stamp); (2) Name (Name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (Details of which checks passed/failed for which value=meaning instances).
# Example 1: Fail check data(ExampleE) values_check(DD.dict.E) print(values_check(DD.dict.E, verbose=FALSE)) # Example 2: Pass check data(ExampleA) values_check(DD.dict.A) print(values_check(DD.dict.A, verbose=FALSE))
# Example 1: Fail check data(ExampleE) values_check(DD.dict.E) print(values_check(DD.dict.E, verbose=FALSE)) # Example 2: Pass check data(ExampleA) values_check(DD.dict.A) print(values_check(DD.dict.A, verbose=FALSE))
This function runs a workflow of the minimum number of checks required for a user to run values_check; the checks include field_check
, and type_check
.
values_precheck(dict)
values_precheck(dict)
dict |
Data dictionary. |
Tibble containing the following information for each check: (1) Time (time stamp); (2) Name (name of the function); (3) Status (Passed/Failed); (4) Message (A copy of the message the function printed out); (5) Information (More detailed information about the potential errors identified).
data(ExampleB) values_precheck(DD.dict.B)
data(ExampleB) values_precheck(DD.dict.B)