The KH populator package represents a unified solution to harvest remote metadata and import it into the Knowledge Hub (KH). It shall consist out of a set of harvesting pipelines that are developed by different members of the NFDI4Earth developers group.
## 1.1 Requirements overview
- The output of the pipelines conforms to the NFDI4Earth (meta) [data model](https://drive.google.com/file/d/1cWuEtz7kqKZ5nKfYYjgEV8M0TZuBnJzU/view?usp=share_link)(link will change in the future)
- Pipelines can be scheduled to run with a job scheduler
- Pipelines can be run independently from each other
- Pipelines can be run repeatedly without creating duplicate data of any kind in the KH
- Pipelines don't overwrite data in the KH that has been added by other pipelines or users
## 1.2 Quality goals
The top four quality goals for this software project are
- Functional suitability
- Reliability
- Maintainability
- Compatibility
## 1.3 Stakeholders
- NFDI4Earth developers
- (to a lesser degree): partners who collect metadata for the KH or provide a system from where metadata is harvested into the KH
# 2. Constraints
**Constraint** | **Explanation**
-------------- | ---------------
RDF output | output of the harvesting must always be in RDF so it can be added to the KH
Python | general programming language for the pipelines is Python (preferred language by 4Earth developers)
# 3. Context and Scope
The KH populator harvests metadata of different information resource types as defined by the KH (meta) data model (TODO: link), e.g. about repositories, research organizations, datasets, ...
The data model of the KH populator must therefore be exactly aligned to the KH data model and the output must pass validation against the KH data model.
The input are the different systems the get harvested. We refer to them as **source systems**. The systems usually provide an open, well-documented API via the Internet for the harvesting. These can be SPARQL endpoints, specific REST APIs or interfaces following standardized protocols for (meta)data exchange e.g. OAI-PMH, OGC WCS/WMS, ...
Additionally, KH populator must provide the option to import metadata from TSV tables of manually collected metadata. The goal is to move to a completely web-based harvesting approach, but at least for the piloting phase of the KH manually collected data plays an important role.
# 4. Solution strategy
A Python package is being developed which provides CLI commands to trigger individual harvesting pipelines. The package is divided into subpackages to represent the logical structure:
*`kh_populator` is the main package which contains the code which triggers harvesting pipelines
*`kh_populator_domain` contains modules with domain specific function for the harvesting and transformation of external (meta)data sources - these functions should be called from the respective pipeline
*`kh_populator_logic` contains useful functions which might be required in different domain modules
*`kh_populator_model` contains classes which reflect the data model of the Knowledge Hub. For each individual information resource type that is being collected, a Python class must exist in `kh_populator_model` where the specific properties for this resource type are defined as Python instance variable, and the (de)serialization from the Python class to RDF is defined