Skip to content
Snippets Groups Projects
Jonathan Hartman's avatar
Jonathan Hartman authored
workaround for instance where the files were all present, but we were failing to properly update the schema assignment from the auditor
34cd2cbe
History

FAIR Dataspaces Quality Assurance and Validation

What is it?

The FAIR Data Quality Assurance and Workflows demonstrator uses the workflow engine provided by the source code hosting platform GitLab to analyze, transform and verify research data artifacts. Research data is assumed to be collected in the form of CSV files by an individual researcher or a group of researchers making use of the "social coding" paradigm to maintain their research data.

Maintainers:

How do I use it?

We intend for this library to be used in concert with GitLab's CI/CD. This library comes pre-installed in a docker container. Once a .gitlab-ci.yml file is added to the repository, it will look for data files and schema files, performing various validation and data checks in order to compile a report. This report is viewable at the following link:

https://fair-ds.pages.rwth-aachen.de/{Repository Name} (The exact link can be found in the repository under "Settings" -> "Pages" )

The results of the validation and data checks are also viewable in the GitLab interface under a given pipeline, under the "Tests" tab.

Whever a new commit is pushed to the repository, the entire pipeline will run again. (This behavior can be changed by editing the .gitlab-ci.yml file)

We have provided a repository containing a basic version of the gitlab-ci.yml file required to run this code on your own data here

Setup

At a minimum, the demonstrator expects to see one data file in the data folder. The user may optionally provide additional data, as fits their use data and use case. For details about additional files which may be provided, please see the "Example Files" section below.

Locally

Data files and schemas can be stored directly in this repository under the "data" and "schemas" directories. The data must be in either "csv" or "parquet" format and the schemas must be in "JSON" format. These files may be organized in their respective folders in any structure desired.

S3 Bucket

The Data and Schemas files may be located in a S3 bucket, provided that the repository has the keys needed to access that data. In order for the demonstrator to use these keys, you must add them to the the repository under Settings -> CI/CD Settings -> Variables. The demonstrator looks for two specific variables at run time:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY

Expected Behavior

- I provide no schemas for any of my datasets

We will infer a schema from the data present and save them to the "schemas" folder. These schemas will be limited to feature names and data types. You may then edit these following the frictionless standard if you would like them to include other rules for validation.

- I provided schema files for one or more of my dataset schemas

We will attempt to assign the schemas to the correct datafile(s) in the data folder. In the event none of the schemas applies to a given file, we will infer a basic schema from the dataset as though none had been provided.

- I provided schema files and a schemas assignment file

We will validate that the schemas assigned to each dataset are valid. Any data sets which are assigned to a schema that does not apply will be given a new generic schema file.

Example Files

Settings File - settings.json (optional)

Placed in the root folder. Any settings in this file will override defaults

{
    "project_name": "My Project",
    "file_delimiter": ',',
    "project_folder": ".",
    "project_source": "local",
    "s3_bucket_name": None,
    "input": {
        "data_folder": "data",
        "schema_folder": "schemas",
        "schemas_file": "schemas.json"
        },
    "output": {
        "out_folder": "out",
        "log_folder": "logs",
        "package_file": "package.json"
    },
    "geodata": {},
    # EXAMPLE GEODATA Field
    # "geodata": {
    #     "data/my_file.csv": {
    #         "latitude": "latitude_column_name",
    #         "longitude": "longitude_column_name"
    #     }
    # }
    "crs": "EPSG:4326",
    "shape_file": "naturalearth_lowres",
    "shape_file_column": "name",
    "cache_file_size_threshold": 3e7
}

Schema Assignement File (optional)

Placed in the "schmeas" folder. Default filename "schemas.json". Can be changed in settings.yml. This file will be used to assign specific schemas to data files.

If a schema does not validate on an assigned data file, we will raise an error and no validation or quality checks will be performed on that data file.

If a data file is present, but no schema is assinged with this file, we will attempt to identify which schema is most likely to apply. If no schema applies, we will generate a schema based on the file.

{
    "schemas/schema_1.json": [
        "data/file_1.csv",
        "data/file_2.csv"
    ],
    "schemas/schema_2.json": [
        "data/file_1.csv"
    ]
}

Schema File (optional)

Placed in the "schemas" folder. Should adhere to frictionless data standards. If a schema file is not provided along with a data format, we will infer a basic schema for each new format encountered (just column names and data types)

{
    "fields": [
        {
            "name": "index",
            "type": "integer",
        },
        {
            "name": "char",
            "type": "string",
            "constraints": {
                "required": True,
                "enum": ["X", "Y", "Z"],
                "minLength": 1,
                "maxLength": 1
            }
        },
        {
            "name": "str",
            "type": "string",
            "constraints": {
                "pattern": "[0-9A-F]{16}"
            }
        },
        {
            "name": "bool",
            "type": "boolean"
        },
        {
            "name": "date",
            "type": "date",
            "constraints": {
                "unique": True
            }
        },
        {
            "name": "float",
            "type": "number",
            "constraints": {
                "minimum": 0,
                "maximum": 1
            }
        },
        {
            "name": "int",
            "type": "integer",
            "constraints": {
                "minimum": 0
            }
        }
    ],
    "missingValues": [
        "NaN",
        "-",
        "",
        " "
    ],
    "primaryKey": "index"
}

Error Messages

A visual overview of the demonstrator process and the error messages which may arise can be found here.

Schema Inference

Load Schemas JSON

Success Messages ✔️

  • "Found Schemas JSON: {file path}"
    • We have found a valid schemas file in the schemas folder.

Warning Messags

  • "No Schemas JSON File present: {file path}"
    • There was no schemas file present. If there are schema files present, we will attempt to assign them to the correct data files, however if a given schema does not broadly validate on a particular file (same columns names in the same order & valid datatypes), then that datafile will be given a generic schema.

Error Messages

  • "Invalid Schemas JSON File: {file path}"
    • We have found a schemas file, but it appears to be malformed in some way. Check that the file is in correct JSON format (see Schema Assignment File for an example)

Validate Schema Files

Success Messages ✔️

  • "No Issues in Schemas JSON File: {file path}"
    • There were no issues identified when checking the files that make up the schema assignment file.
  • "All Files in {file path} are present in {file path}"
    • All of the files present in the schema assignment file are also present in the data folder.

Warning Messags

  • "Missing Schema in {file path}"
    • One of the schema files listed in the schema assignment file does not exist. Any files that were associated with this schema will have their schmeas inferred.
  • "Missing Data File: {file path}"
    • One of the data files listed int he schema assignment file does not exist. The file path may be incorrect, or the file may be missing. We do not currently make any attempt to locate these files.

Error Messages

  • "Failed to validate file {file path} with {file path}"
    • One of the files assigned to a schema in the schema assignment file does not pass basic validation. Check that the schema has the same columns in the same order as in the data file.
  • "Validation Error: {file path} on {file path}"
    • A list of any validation errors noted when trying to apply the schema file to a data file. Note that a single error can compund here, for example a missing column results in type errors for every column to the right.
  • "File Assigned to Multiple Schemas: {file path}"
    • A data file has been assigned to two or more schemas. Any additional times this data file is references in the schema assignment file will be ignored.
  • "File "{file path}" could not be located"
    • A file specified in the schema assignment file could not be located.

Identify Schema

Success Messages ✔️

  • "Found Matching Schema for {file path}: {file path}"
    • One of the schemas we have validated successfully on the given file. Note that this may be either a user provided schema or a schema inferred from another file.

Warning Messags

  • "No Schema Files in {file path} match {file path}"
    • None of the schemas currently in the schema folder match the given file
  • "No Schema matches {file path}: Creating generic schema {file_path} from {file path} for this file"
    • As we could not find a valid schema, we are generating a new file that will apply to this file generically. There are no conditions included in this schema, we only take the column names and generic data types.

Error Messages ❌ (There are no ERROR messages for this step at this time)

Schema Infereence

Success Messages ✔️

  • "Saving Package File to {file path}"
    • We have created a frictionless package and saved the file to the indicated location. This file is used in later steps to keep track of which schemas go to which data files.

Warning Messags ❔ (There are no WARNING messages for this step at this time)

Error Messages ❌ (There are no ERROR messages for this step at this time)

CSV Schema Validation

Success Messages ✔️

  • "{file path}|Schema Validated ({file path})"
    • The schema associated with this file has been applied and there are no identified violations of the schema.

Warning Messags ❔ (There are no WARNING messages for this step at this time)

Error Messages

  • "{file path}: Schema Violation Line {code}"
    • Itemized list of schema violations noted when trying to apply this schema to the file.

Data Quality Check

Success Messages ✔️

  • "{filename}: No Quality Issues Noted"
    • No data quality issues were noted in this file.

Warning Messags

  • "Column "{column name}": No Values"
    • The indicated column contains no non-null values.
  • "Column "{column name}": >95% missing Values"
    • The indicated column is highly missing - more than 95% of all values in this column are null.
  • "Column "{column name}": Only contains a single value"
    • The indicated column contains only a single value.
  • "Column "{column name}": No Variation"
    • The indicated numerical column has a variance of 0.

Error Messages ❌ (There are no ERROR messages for this step at this time)