example.rst

Example - GUI version
=====================

This example illustrates the usage of the GUI version of SHIRE based on provided datasets. 
In the following, the generation of a training dataset and prediction dataset 
is shown step by step. The generated input datasets are then used for creating
a landslide susceptibility map. 

**This example is designed for illustration purposes only! The produced map is not
intended for any analysis purpose. Caution is advised.**

The example illustrates how to use SHIRE for a binary susceptibility assessment to the occurrence of shallow
landslides in Switzerland.

Preliminary considerations
--------------------------
The first decision to be made is if the GUI version or the Plain version
shall be used. The following example shows the GUI version. The Plain version
is introduced in :doc:`example-plain`


Datasets
---------

| All necessary datasets can be found in the Gitlab repository in the examples folder.

| **Geospatial datasets**:
| *European Union's Copernicus Land Monitoring Service information:*
| Imperviousness Density 2018 (https://doi.org/10.2909/3bf542bd-eebd-4d73-b53c-a0243f2ed862)
| Dominant Leaf Type 2018 (https://doi.org/10.2909/7b28d3c1-b363-4579-9141-bdd09d073fd8)
| CORINE Land Cover 2018 (https://doi.org/10.2909/960998c1-1870-4e82-8051-6485205ebbac)

| All datasets were edited. Imperviousness Density 2018 and Dominant Leaf Type 2018 were merged from smaller tiles and then stored in a netCDF4 file.

| **Landslide database**:
| Datenquelle Hangmuren-Datenbank, Eidg. Forschungsanstalt WSL, Forschungseinheit Gebirgshydrologie & Massenbewegungen (status October 2024).
| The spatial coordinates of the landslide locations were transformed into WGS84 coordinates using QGIS.

| **Absence locations database**:
| Randomly sampled locations outside of a buffer zone around the entries in the landslide database. The database contains more absence locations than will be integrated into the example. This is intentional as
  both landslide as well as absence locations are removed during the training dataset generation process if one of their features
  contains a no data value. Having additional absence locations available allows SHIRE to integrate the number of absence locations
  as intended by the user. 

| **Metadata files**:
| keys_to_include_examples.csv
| data_summary_examples.csv

Launching SHIRE
---------------

It is recommended to launch SHIRE for the GUI version from the command line:

.. code-block:: console

   (venv) $ python shire.py

.. figure:: _static/images/intro.png
   :scale: 80%
   :align: center

   Then the window on the left opens. In this window all basic settings are defined
   which are relevant for training and prediction dataset generation as well as model training and map generation.
   As described in the user manual in the git repository, at the top of the window, the desired option(s) need to be ticked.
   
   Under **General settings**, you must provide:

   - The desired resolution for the final susceptibility map. In this example we want to generate a map with 100m resolution.
   - The general no-data value to indicate locations where susceptibility assessment isn't possible, here -999 is chosen.
   - The coordinate reference system (CRS), which is important metadata. Coordinates in the geospatial datasets and in the landslide and absence locations database are given in wgs84 coordinates.
   - The random seed to make the process reproducible, here 42.

   **Save the settings for later use** can be ticked if the process needs to be repeated several times, or if you want to save settings for later comparison. When rerunning the process, you can use the **Import settings** button to introduce only changes to the mask.
   The settings will be saved in a pickle file called settings.pkl in the folder that you specify when you klick
   **Submit** to proceed to the next step. Depending on the option(s) you ticked at the top, a different window opens. 
   Please be aware that SHIRE sticks to a specific order if several steps are initialized at ones.

.. raw:: html

   <div style="clear: both;"></div>
   
Training dataset generation
---------------------------

.. figure:: _static/images/training.png
   :scale: 70%
   :align: center

   If a training dataset shall be generated, the window above opens. 
   If you klick on the buttons given under **Provide path to:**
   a window opens to manually navigate to the individual files. Please
   refer to the user manual for the necessary structure of the *dataset_summary.csv*
   and the landslide and absence locations databases. You can also check the
   provided datasets for this example.
   
   The *keys_to_include.csv* file contains the feature keys specified in data_summary.csv
   and define which available dataset shall actually be used in training dataset generation.
   For this example, all three datasets are used. 
   
   Choose the directory where you want to save the generated training dataset to.
   The dataset is automatically named *training.csv*. Beware that existing files with the same name are overwritten!
   
   Under **Number of absence locations in the training dataset** provide the number of absence locations you want
   to integrate into your training dataset. If the same number of absence locations are provided as there are entries in the
   landslide inventory, then SHIRE assumes that this is intentional and the ratio of 1:1 shall be kept even if some landslide
   instances need to be removed due to missing geospatial information and the total number of absence locations are reduced accordingly. 
   Here, the landslide inventory contains 762 entries, therefore, also 762 absence locations shall be integrated into the training dataset.
   
   Tick **One-hot encoding of categorical variables** if the categorical features in the training dataset shall be
   one-hot encoded. This means that a separate column is introduced into the training dataset with the naming convention
   *<feature name>_<class number>*. As currently only numeric raster data are supported by SHIRE this is sufficient.
   In this example, the box is not ticked which means that the training dataset is ordinal encoded.
   
   Furthermore, the naming information of the landslide inventory and absence location database need to be provided.
   For the provided exemplary landslide inventory, the longitude values are contained in a column called *X*, the
   latitude values in *Y* and as it is a Swiss dataset, the ID column is called *Ereignis-Nr*. The absence locations
   database contains variables called *Longitude* and *Latitude* which contains the respective coordinates.
   
   **Save the settings for later use** can be ticked if the process needs to be repeated several times, or if you want to save settings for later comparison. When rerunning the process, you can use the **Import settings** button to introduce only changes to the mask.
   The settings will be saved in a pickle file called *settings_train.pkl* in the folder that you specify when you klick
   **Submit** to proceed to start the training dataset generation if that is the only intention of the run or open the next settings window if you want to proceed with prediction dataset generation.   
   Under **Choose from:** three different options are available. For this example, we assume that we want to generate a training dataset from scratch
   and that there is no existing training dataset that we want to supplement or reduce. Therefore, we choose
   **Generate training dataset from scratch**. There are three different compilation strategies implemented in SHIRE.
   Here we choose the most simple and most cost-time-effective **No interpolation** option. In this option the geospatial
   datasets are not interpolated to the final resolution of the map to extract the properties of the landslide sites and
   absence locations. After pressing submit when only generating a training dataset in this run, a seperate window
   opens that provides a progress report.
   
   *Adding feature(s) to existing training dataset:*
   The *keys_to_include.csv* file then only contains the feature key that is also contained in *data_summary.csv*
   which shall be added to the existing training dataset. The **Path to directory for storing the training dataset**
   then needs to lead to the directory of the existing training dataset which needs to be called *training.csv*.
   Under **Choose from:** now **Add feature(s) to existing training dataset** needs to be chosen. It is recommended to use the same
   **Compilation using:** as the other features in the existing training dataset.
   
   *Deleting feature(s) from existing training dataset:*
   Similarly to adding features, for deleting them the *keys_to_include.csv* file name only contains the 
   feature keys as given in *data_summary.csv* that shall be removed from an existing training dataset.
   The **Path to directory for storing the training dataset** then needs to lead to the directory of the existing training dataset which needs to be called *training.csv*.
   Under **Choose from:** now **Delete feature(s) to existing training dataset** needs to be chosen.
   
   *Choosing One-hot encoding instead of ordinal encoding:*
   Tick **One-hot encoding of categorical variables** for one-hot encoding of the categorical variables in the training dataset
   instead of the default ordinal encoding.
   
   *Choosing a different compilation appraoch:*
   In contrast to the in the example chosen **No interpolation** option, the geospatial datasets can also be interpolated before 
   extracting the geospatial characteristics of the absence locations and landslide sites in two different ways.
   When choosing **Interpolation**, all geospatial datasets are cropped to the minimum spatial extend that contains all landslide sites and absence locations.
   Then the cropped datasets are each interpolated to the same coordinate grid with the same spatial resolution as the final map.
   Finally, the values are extracted and introduced as features into the training dataset.
   This can be quite cost and time intensive depending on the size of the area in which landslides and absence locations
   were collected. Therefore, **Clustering** can be used. The absence locations and landslide sites are spatially clustered and 
   consequently for each cluster the original dataset is cropped and is individually interpolated to the desired resolution
   before extracting the feature values. As the spatial extends of the clusters are much smaller this requires less computational
   power and time to perform the task for large areas.
   The interpolation when choosing **Interpolation** is automatically performed in one of three different ways, depending
   on the original size of the geospatial dataset and its size after interpolation. For details, please see the associated
   puplications found in the git repository.
   
   
.. raw:: html

   <div style="clear: both;"></div>


Prediction dataset generation
-----------------------------

.. figure:: _static/images/prediction.png
   :scale: 70%
   :align: center
   
   If a prediction dataset shall be generated, the window above opens. 
   If you klick on the buttons given under **Path to summary of geospatial data** and **Features include**
   a window opens to manually navigate to the *data_summary.csv* and *keys_to_include.csv* files. Provide the **Path to directory
   for storing the prediction dataset** similarily as it was done for the training dataset.
   
   Tick **One-hot encoding of categorical variables** if the categorical features in the training dataset shall be
   one-hot encoded. Careful, make sure that this decision is consistent with the training dataset.
   As we are using ordinal encoding in this example, the box is not ticked.
   
   Then provide the bounds of the area of interest, the top two lines give the east and west coordinates and the bottom
   two the north and south coordinates. Here, the extents of Switzerland were chosen in accordance with the above outlines
   test case.
   
   **Save the settings for later use** can be ticked if the process needs to be repeated several times, or if you want to save settings for later comparison. When rerunning the process, you can use the **Import settings** button to introduce only changes to the mask.
   The settings will be saved in a pickle file called *settings_pred.pkl* in the folder that you specify when you klick
   **Submit** to proceed to start the prediction dataset generation if that is the only intention of the run or open the next settings window if you want to proceed with model training and map generation.   
   
   The **Submit** button only appears after choosing from one of the options given under **Choose from:** similar to
   the process described for training dataset generation. Here, as we want to generate a prediction dataset from scratch,
   we choose the first option.
   
   For more information on **Delete feature(s) from existing prediction dataset** and **Add feature(s) to existing prediction dataset**
   see the same options described above for the training dataset generation. The process is identical.
   

.. raw:: html

   <div style="clear: both;"></div>
   <div style="clear: both;"></div>

Susceptibility map generation
-----------------------------

.. figure:: _static/images/mapping.png
   :scale: 70%
   :align: center

   Under **Path to training and prediction dataset** in separate windows manually the locations of the training and prediction
   datasets on the local machine or external hard drive need to be chosen. Similarly to previous steps, under **Where
   do you want the models to be stored** choose the directory where folder shall be created which in the end contains
   the mapping results. Consequently, the **Folder name** can be specified in. The mask distinguished between model to save and
   model to load. This is because model training and mapping are separate processes which are conducted independently. If
   you conduct both at the same time, both fields need to contain the same folder name. However, if you are mapping using
   a pretrained model, or only train without prediction you can just provide the respective information.
   
   Not all features that are contained in the training and prediction dataset might be needed for model training and mapping.
   It is possible to drop features from the training and prediction dataset under **Features not to consider**. Here, we
   don't want to remove any features from the training dataset, however, the prediction dataset still contains the spatial coordinates
   within the area of interest for which an individual prediction will be made. As this information should not be part of 
   the mapping, they need to be removed. The feature names need to be provided in a comma-separated way without any spaces.
   Then we also need to provide the **Name of the label column in the training dataset**.
   
   Finally, the Random Forest needs to be defined regarding the number of trees, depth of the trees and the evaluation criterion.  
   For more information, see the documentation for scikit learn's `Random Forest Classifier <https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html>`_.
   Provide the **Size of the test dataset (0...1)** as well. For the values chosen in this example, refer to the image above.
   
   **Save the settings for later use** can be ticked if the process needs to be repeated several times, or if you want to save settings for later comparison. When rerunning the process, you can use the **Import settings** button to introduce only changes to the mask.
   The settings will be saved in a pickle file called *settings_map.pkl* in the folder that you specify when you klick
   **Submit** to proceed to start the mapping process if that is the only intention of the run or launch training and/or prediction dataset
   generation if several processes were initialized at the same time.
   
   Before klicking **Submit**, it's necessary to choose **What do you want to do?**. It is possible to only train or map or
   do both at the same time. When **Mapping** is ticked, it is possible to **Predict in parallel** to speed up map generation.


.. raw:: html

   <div style="clear: both;"></div>
   <div style="clear: both;"></div>

Final map, output files and validation information
--------------------------------------------------

.. figure:: _static/images/results.png
   :scale: 50%
   :align: center
   
   
   The figure above shows the susceptibility map as returned by SHIRE (left) and susceptibility map when combined with Swiss boundaries (right). The yellow areas are predicted as susceptible to landslide occurrence
   and blue shows stable areas. In the top right corner in the image on the left, the value is -999, hence the no data value set in the initial GUI. This shows that in this area no prediction of the landslide
   susceptibility was possible. The reason for this lies in the geospatial areas which have no information available information in this area as well. 
   
   
   Each of the previously described steps habe their own input files, which have been discussed and are described in the user manual.
   When checking the folder of the training and prediction dataset as well as the folder where training and prediction results are stored, it can be seen that
   several new files were created.

   **Beware!:** The files produced in each run depend also on the chosen options, e.g., regarding compilation strategy of the training dataset.

   Most of the files are intended to support transparency and reusability.

   **Pickle files:**
   The pickle files created after the initialization of each step containing the properties chosen from the run. This provides documentation, transparency and reproducibility.

   **Training and prediction dataset generation:**
   Of course, the main products of these steps are the training dataset as a csv file and the prediction dataset as a netCDF4 file. 
   Interpolation of the geospatial datasets either for training or for prediction dataset generation results in the generation of a pickle and a netCDF4 file called
   *data_combined_<training/prediction>_<resolution>.<nc/pkl>*. The netCDF4 file contains the interpolated geospatial datasets. They can be used for quality checking the interpolation result.
   The pickle file contains the interpolation information.
	  
   **Model training and map generation:**
   Inside the model folder, in this example called *Switzerland_Map*, there are several datasets:

	- *prediction.nc* contains the binary prediction result with the value 1 indicating landslide susceptibility, 0 no landslide susceptibility and -999 no prediction possible
	- *prediction_results.csv* contains the prediction result for each individual location within the area of interest before it was reshaped into the final map
	- *pos_prediction_results.csv* contains only the locations with landslide susceptibility predicted
	- *neg_prediction_results.csv* contains only the locations without predicted susceptibility to landslide occurrence
	- *saved_model.pkl* contains the trained Random Forest model
	- *model_params.pkl* contains metadata and model quality information
	- *feature_importance.csv* contains the feature importance ranking as determined by the Random Forest algorithm