Update home authored by Jonas Grieb's avatar Jonas Grieb
......@@ -11,11 +11,11 @@
The KH populator package represents a unified solution to harvest remote metadata and import it into the Knowledge Hub (KH). It shall consist out of a set of harvesting pipelines that are developed by different members of the NFDI4Earth developers group.
## 1.1 Requirements overview
- The output of the pipelines conforms to the NFDI4Earth (meta) [data model](https://drive.google.com/file/d/1cWuEtz7kqKZ5nKfYYjgEV8M0TZuBnJzU/view?usp=share_link) (link will change in the future)
- Pipelines can be scheduled to run with a job scheduler
- Pipelines can be run independently from each other
- Pipelines can be run repeatedly without creating duplicate data of any kind in the KH
- Pipelines don't overwrite data in the KH that has been added by other pipelines or users
1. **Conformance to the NFDI4Earth data model**: The output of the pipelines conforms to the NFDI4Earth (meta) [data model](https://drive.google.com/file/d/1cWuEtz7kqKZ5nKfYYjgEV8M0TZuBnJzU/view?usp=share_link) (link will change in the future)
2. **Scheduable pipelines**: Pipelines can be scheduled to run with a job scheduler
3. **Independent pipelines**: Pipelines can be run independently from each other
4. **No duplicates**: Pipelines can be run repeatedly without creating duplicate data of any kind in the KH
5. **No data loss by overriding**: Pipelines don't overwrite data in the KH that has been added by other pipelines or users
## 1.2 Quality goals
The top four quality goals for this software project are
......@@ -52,3 +52,18 @@ A Python package is being developed which provides CLI commands to trigger indiv
* `kh_populator_domain` contains modules with domain specific function for the harvesting and transformation of external (meta)data sources - these functions should be called from the respective pipeline
* `kh_populator_logic` contains useful functions which might be required in different domain modules
* `kh_populator_model` contains classes which reflect the data model of the Knowledge Hub. For each individual information resource type that is being collected, a Python class must exist in `kh_populator_model` where the specific properties for this resource type are defined as Python instance variable, and the (de)serialization from the Python class to RDF is defined
# 5. Building Block View
![image](uploads/3002dbecd71d9e4d9d20110ff3c0516e/image.png)
Solutions mapped to architecture requirements:
**Requirement** | **Solution**
--------------- | ------------
**Conformance to the NFDI4Earth data model** | Model ...
**Scheduable pipelines** | TODO
**Independent pipelines** | Every pipeline script (in `kh_populator/pipelines`) runs independently
**No duplicates** | Every script after harvesting a resource object must check whether this object already exists in the KH (preferably via a query based on a PID of the object, otherwise using for example the homepage URL, ...). Only if no object exists, create a new one, otherwise fetch the current object from the KH and update it
**No data loss by overriding** | TODO...
\ No newline at end of file