The KH populator package represents a unified solution to harvest remote metadata and import it into the Knowledge Hub (KH). It shall consist out of a set of harvesting pipelines that are developed by different members of the NFDI4Earth developers group.
## 1.1 Requirements overview
- The output of the pipelines conforms to the NFDI4Earth (meta) [data model](https://drive.google.com/file/d/1cWuEtz7kqKZ5nKfYYjgEV8M0TZuBnJzU/view?usp=share_link)(link will change in the future)
- Pipelines can be scheduled to run with a job scheduler
- Pipelines can be run independently from each other
- Pipelines can be run repeatedly without creating duplicate data of any kind in the KH
- Pipelines don't overwrite data in the KH that has been added by other pipelines or users
1.**Conformance to the NFDI4Earth data model**: The output of the pipelines conforms to the NFDI4Earth (meta) [data model](https://drive.google.com/file/d/1cWuEtz7kqKZ5nKfYYjgEV8M0TZuBnJzU/view?usp=share_link)(link will change in the future)
2.**Scheduable pipelines**: Pipelines can be scheduled to run with a job scheduler
3.**Independent pipelines**: Pipelines can be run independently from each other
4.**No duplicates**: Pipelines can be run repeatedly without creating duplicate data of any kind in the KH
5.**No data loss by overriding**: Pipelines don't overwrite data in the KH that has been added by other pipelines or users
## 1.2 Quality goals
The top four quality goals for this software project are
...
...
@@ -52,3 +52,18 @@ A Python package is being developed which provides CLI commands to trigger indiv
*`kh_populator_domain` contains modules with domain specific function for the harvesting and transformation of external (meta)data sources - these functions should be called from the respective pipeline
*`kh_populator_logic` contains useful functions which might be required in different domain modules
*`kh_populator_model` contains classes which reflect the data model of the Knowledge Hub. For each individual information resource type that is being collected, a Python class must exist in `kh_populator_model` where the specific properties for this resource type are defined as Python instance variable, and the (de)serialization from the Python class to RDF is defined
**Conformance to the NFDI4Earth data model** | Model ...
**Scheduable pipelines** | TODO
**Independent pipelines** | Every pipeline script (in `kh_populator/pipelines`) runs independently
**No duplicates** | Every script after harvesting a resource object must check whether this object already exists in the KH (preferably via a query based on a PID of the object, otherwise using for example the homepage URL, ...). Only if no object exists, create a new one, otherwise fetch the current object from the KH and update it