Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
Example - GUI version
=====================
This example illustrates the usage of the GUI version of SHIRE based on provided datasets.
In the following, the generation of a training dataset and prediction dataset
is shown step by step. The generated input datasets are then used for creating
a landslide susceptibility map.
**This example is designed for illustration purposes only! The produced map is not
intended for any analysis purpose. Caution is advised.**
The example illustrates how to use SHIRE for a binary susceptibility assessment to the occurrence of shallow
landslides in Switzerland.
Preliminary considerations
--------------------------
The first decision to be made is if the GUI version or the Plain version
shall be used. The following example shows the GUI version. The Plain version
is introduced in :doc:`example-plain`
Datasets
---------
| All necessary datasets can be found in the Gitlab repository in the examples folder.
| **Geospatial datasets**:
| *European Union's Copernicus Land Monitoring Service information:*
| Imperviousness Density 2018 (https://doi.org/10.2909/3bf542bd-eebd-4d73-b53c-a0243f2ed862)
| Dominant Leaf Type 2018 (https://doi.org/10.2909/7b28d3c1-b363-4579-9141-bdd09d073fd8)
| CORINE Land Cover 2018 (https://doi.org/10.2909/960998c1-1870-4e82-8051-6485205ebbac)
| All datasets were edited. Imperviousness Density 2018 and Dominant Leaf Type 2018 were merged from smaller tiles and then stored in a netCDF4 file.
| **Landslide database**:
| Datenquelle Hangmuren-Datenbank, Eidg. Forschungsanstalt WSL, Forschungseinheit Gebirgshydrologie & Massenbewegungen (status October 2024).
| The spatial coordinates of the landslide locations were transformed into WGS84 coordinates using QGIS.
| **Absence locations database**:
| Randomly sampled locations outside of a buffer zone around the entries in the landslide database. The database contains more absence locations than will be integrated into the example. This is intentional as
both landslide as well as absence locations are removed during the training dataset generation process if one of their features
contains a no data value. Having additional absence locations available allows SHIRE to integrate the number of absence locations
as intended by the user.
| **Metadata files**:
| keys_to_include_examples.csv
| data_summary_examples.csv
Launching SHIRE
---------------
It is recommended to launch SHIRE for the GUI version from the command line:
.. code-block:: console
(venv) $ python shire.py
.. figure:: _static/images/intro.png
:scale: 80%
:align: center
Then the window on the left opens. In this window all basic settings are defined
which are relevant for training and prediction dataset generation as well as model training and map generation.
As described in the user manual in the git repository, at the top of the window, the desired option(s) need to be ticked.
Under **General settings**, you must provide:
- The desired resolution for the final susceptibility map. In this example we want to generate a map with 100m resolution.
- The general no-data value to indicate locations where susceptibility assessment isn't possible, here -999 is chosen.
- The coordinate reference system (CRS), which is important metadata. Coordinates in the geospatial datasets and in the landslide and absence locations database are given in wgs84 coordinates.
- The random seed to make the process reproducible, here 42.
**Save the settings for later use** can be ticked if the process needs to be repeated several times, or if you want to save settings for later comparison. When rerunning the process, you can use the **Import settings** button to introduce only changes to the mask.
The settings will be saved in a pickle file called settings.pkl in the folder that you specify when you klick
**Submit** to proceed to the next step. Depending on the option(s) you ticked at the top, a different window opens.
Please be aware that SHIRE sticks to a specific order if several steps are initialized at ones.
.. raw:: html
<div style="clear: both;"></div>
Training dataset generation
---------------------------
.. figure:: _static/images/training.png
:scale: 70%
:align: center
If a training dataset shall be generated, the window above opens.
If you klick on the buttons given under **Provide path to:**
a window opens to manually navigate to the individual files. Please
refer to the user manual for the necessary structure of the *dataset_summary.csv*
and the landslide and absence locations databases. You can also check the
provided datasets for this example.
The *keys_to_include.csv* file contains the feature keys specified in data_summary.csv
and define which available dataset shall actually be used in training dataset generation.
For this example, all three datasets are used.
Choose the directory where you want to save the generated training dataset to.
The dataset is automatically named *training.csv*. Beware that existing files with the same name are overwritten!
Under **Number of absence locations in the training dataset** provide the number of absence locations you want
to integrate into your training dataset. If the same number of absence locations are provided as there are entries in the
landslide inventory, then SHIRE assumes that this is intentional and the ratio of 1:1 shall be kept even if some landslide
instances need to be removed due to missing geospatial information and the total number of absence locations are reduced accordingly.
Here, the landslide inventory contains 762 entries, therefore, also 762 absence locations shall be integrated into the training dataset.
Tick **One-hot encoding of categorical variables** if the categorical features in the training dataset shall be
one-hot encoded. This means that a separate column is introduced into the training dataset with the naming convention
*<feature name>_<class number>*. As currently only numeric raster data are supported by SHIRE this is sufficient.
In this example, the box is not ticked which means that the training dataset is ordinal encoded.
Furthermore, the naming information of the landslide inventory and absence location database need to be provided.
For the provided exemplary landslide inventory, the longitude values are contained in a column called *X*, the
latitude values in *Y* and as it is a Swiss dataset, the ID column is called *Ereignis-Nr*. The absence locations
database contains variables called *Longitude* and *Latitude* which contains the respective coordinates.
**Save the settings for later use** can be ticked if the process needs to be repeated several times, or if you want to save settings for later comparison. When rerunning the process, you can use the **Import settings** button to introduce only changes to the mask.
The settings will be saved in a pickle file called *settings_train.pkl* in the folder that you specify when you klick
**Submit** to proceed to start the training dataset generation if that is the only intention of the run or open the next settings window if you want to proceed with prediction dataset generation.
Under **Choose from:** three different options are available. For this example, we assume that we want to generate a training dataset from scratch
and that there is no existing training dataset that we want to supplement or reduce. Therefore, we choose
**Generate training dataset from scratch**. There are three different compilation strategies implemented in SHIRE.
Here we choose the most simple and most cost-time-effective **No interpolation** option. In this option the geospatial
datasets are not interpolated to the final resolution of the map to extract the properties of the landslide sites and
absence locations. After pressing submit when only generating a training dataset in this run, a seperate window
opens that provides a progress report.
*Adding feature(s) to existing training dataset:*
The *keys_to_include.csv* file then only contains the feature key that is also contained in *data_summary.csv*
which shall be added to the existing training dataset. The **Path to directory for storing the training dataset**
then needs to lead to the directory of the existing training dataset which needs to be called *training.csv*.
Under **Choose from:** now **Add feature(s) to existing training dataset** needs to be chosen. It is recommended to use the same
**Compilation using:** as the other features in the existing training dataset.
*Deleting feature(s) from existing training dataset:*
Similarly to adding features, for deleting them the *keys_to_include.csv* file name only contains the
feature keys as given in *data_summary.csv* that shall be removed from an existing training dataset.
The **Path to directory for storing the training dataset** then needs to lead to the directory of the existing training dataset which needs to be called *training.csv*.
Under **Choose from:** now **Delete feature(s) to existing training dataset** needs to be chosen.
*Choosing One-hot encoding instead of ordinal encoding:*
Tick **One-hot encoding of categorical variables** for one-hot encoding of the categorical variables in the training dataset
instead of the default ordinal encoding.
*Choosing a different compilation appraoch:*
In contrast to the in the example chosen **No interpolation** option, the geospatial datasets can also be interpolated before
extracting the geospatial characteristics of the absence locations and landslide sites in two different ways.
When choosing **Interpolation**, all geospatial datasets are cropped to the minimum spatial extend that contains all landslide sites and absence locations.
Then the cropped datasets are each interpolated to the same coordinate grid with the same spatial resolution as the final map.
Finally, the values are extracted and introduced as features into the training dataset.
This can be quite cost and time intensive depending on the size of the area in which landslides and absence locations
were collected. Therefore, **Clustering** can be used. The absence locations and landslide sites are spatially clustered and
consequently for each cluster the original dataset is cropped and is individually interpolated to the desired resolution
before extracting the feature values. As the spatial extends of the clusters are much smaller this requires less computational
power and time to perform the task for large areas.
The interpolation when choosing **Interpolation** is automatically performed in one of three different ways, depending
on the original size of the geospatial dataset and its size after interpolation. For details, please see the associated
puplications found in the git repository.
.. raw:: html
<div style="clear: both;"></div>
Prediction dataset generation
-----------------------------
.. figure:: _static/images/prediction.png
:scale: 70%
:align: center
If a prediction dataset shall be generated, the window above opens.
If you klick on the buttons given under **Path to summary of geospatial data** and **Features include**
a window opens to manually navigate to the *data_summary.csv* and *keys_to_include.csv* files. Provide the **Path to directory
for storing the prediction dataset** similarily as it was done for the training dataset.
Tick **One-hot encoding of categorical variables** if the categorical features in the training dataset shall be
one-hot encoded. Careful, make sure that this decision is consistent with the training dataset.
As we are using ordinal encoding in this example, the box is not ticked.
Then provide the bounds of the area of interest, the top two lines give the east and west coordinates and the bottom
two the north and south coordinates. Here, the extents of Switzerland were chosen in accordance with the above outlines
test case.
**Save the settings for later use** can be ticked if the process needs to be repeated several times, or if you want to save settings for later comparison. When rerunning the process, you can use the **Import settings** button to introduce only changes to the mask.
The settings will be saved in a pickle file called *settings_pred.pkl* in the folder that you specify when you klick
**Submit** to proceed to start the prediction dataset generation if that is the only intention of the run or open the next settings window if you want to proceed with model training and map generation.
The **Submit** button only appears after choosing from one of the options given under **Choose from:** similar to
the process described for training dataset generation. Here, as we want to generate a prediction dataset from scratch,
we choose the first option.
For more information on **Delete feature(s) from existing prediction dataset** and **Add feature(s) to existing prediction dataset**
see the same options described above for the training dataset generation. The process is identical.
.. raw:: html
<div style="clear: both;"></div>
<div style="clear: both;"></div>
Susceptibility map generation
-----------------------------
.. figure:: _static/images/mapping.png
:scale: 70%
:align: center
Under **Path to training and prediction dataset** in separate windows manually the locations of the training and prediction
datasets on the local machine or external hard drive need to be chosen. Similarly to previous steps, under **Where
do you want the models to be stored** choose the directory where folder shall be created which in the end contains
the mapping results. Consequently, the **Folder name** can be specified in. The mask distinguished between model to save and
model to load. This is because model training and mapping are separate processes which are conducted independently. If
you conduct both at the same time, both fields need to contain the same folder name. However, if you are mapping using
a pretrained model, or only train without prediction you can just provide the respective information.
Not all features that are contained in the training and prediction dataset might be needed for model training and mapping.
It is possible to drop features from the training and prediction dataset under **Features not to consider**. Here, we
don't want to remove any features from the training dataset, however, the prediction dataset still contains the spatial coordinates
within the area of interest for which an individual prediction will be made. As this information should not be part of
the mapping, they need to be removed. The feature names need to be provided in a comma-separated way without any spaces.
Then we also need to provide the **Name of the label column in the training dataset**.
Finally, the Random Forest needs to be defined regarding the number of trees, depth of the trees and the evaluation criterion.
For more information, see the documentation for scikit learn's `Random Forest Classifier <https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html>`_.
Provide the **Size of the test dataset (0...1)** as well. For the values chosen in this example, refer to the image above.
**Save the settings for later use** can be ticked if the process needs to be repeated several times, or if you want to save settings for later comparison. When rerunning the process, you can use the **Import settings** button to introduce only changes to the mask.
The settings will be saved in a pickle file called *settings_map.pkl* in the folder that you specify when you klick
**Submit** to proceed to start the mapping process if that is the only intention of the run or launch training and/or prediction dataset
generation if several processes were initialized at the same time.
Before klicking **Submit**, it's necessary to choose **What do you want to do?**. It is possible to only train or map or
do both at the same time. When **Mapping** is ticked, it is possible to **Predict in parallel** to speed up map generation.
.. raw:: html
<div style="clear: both;"></div>
<div style="clear: both;"></div>
Final map, output files and validation information
--------------------------------------------------
.. figure:: _static/images/results.png
:scale: 50%
:align: center
The figure above shows the susceptibility map as returned by SHIRE (left) and susceptibility map when combined with Swiss boundaries (right). The yellow areas are predicted as susceptible to landslide occurrence
and blue shows stable areas. In the top right corner in the image on the left, the value is -999, hence the no data value set in the initial GUI. This shows that in this area no prediction of the landslide
susceptibility was possible. The reason for this lies in the geospatial areas which have no information available information in this area as well.
Each of the previously described steps habe their own input files, which have been discussed and are described in the user manual.
When checking the folder of the training and prediction dataset as well as the folder where training and prediction results are stored, it can be seen that
several new files were created.
**Beware!:** The files produced in each run depend also on the chosen options, e.g., regarding compilation strategy of the training dataset.
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
Most of the files are intended to support transparency and reusability.
**Pickle files:**
The pickle files created after the initialization of each step containing the properties chosen from the run. This provides documentation, transparency and reproducibility.
**Training and prediction dataset generation:**
Of course, the main products of these steps are the training dataset as a csv file and the prediction dataset as a netCDF4 file.
Interpolation of the geospatial datasets either for training or for prediction dataset generation results in the generation of a pickle and a netCDF4 file called
*data_combined_<training/prediction>_<resolution>.<nc/pkl>*. The netCDF4 file contains the interpolated geospatial datasets. They can be used for quality checking the interpolation result.
The pickle file contains the interpolation information.
**Model training and map generation:**
Inside the model folder, in this example called *Switzerland_Map*, there are several datasets:
- *prediction.nc* contains the binary prediction result with the value 1 indicating landslide susceptibility, 0 no landslide susceptibility and -999 no prediction possible
- *prediction_results.csv* contains the prediction result for each individual location within the area of interest before it was reshaped into the final map
- *pos_prediction_results.csv* contains only the locations with landslide susceptibility predicted
- *neg_prediction_results.csv* contains only the locations without predicted susceptibility to landslide occurrence
- *saved_model.pkl* contains the trained Random Forest model
- *model_params.pkl* contains metadata and model quality information
- *feature_importance.csv* contains the feature importance ranking as determined by the Random Forest algorithm