Handling high predictor dimensionality in slope- unit-based landslide susceptibility models through
LASSO-penalized Generalized Linear Model
Item Type Article
Authors Camilo, Daniela Castro;Lombardo, Luigi;Mai, Paul Martin;Dou, Jie;Huser, Raphaël
Citation Camilo DC, Lombardo L, Mai PM, Dou J, Huser R (2017) Handling high predictor dimensionality in slope-unit-based landslide susceptibility models through LASSO-penalized Generalized Linear Model. Environmental Modelling & Software 97: 145–156.
Available: http://dx.doi.org/10.1016/j.envsoft.2017.08.003.
Eprint version Post-print
DOI 10.1016/j.envsoft.2017.08.003
Publisher Elsevier BV
Journal Environmental Modelling & Software
Rights NOTICE: this is the author’s version of a work that was accepted for publication in Environmental Modelling & Software. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document.
Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Environmental Modelling & Software, [97, , (2017-08-30)] DOI:
10.1016/j.envsoft.2017.08.003 . © 2017. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://
creativecommons.org/licenses/by-nc-nd/4.0/
Download date 2023-12-01 17:26:52
Link to Item http://hdl.handle.net/10754/626008
Castro Camilo et al. 2017 - Environmental Modelling and Software - doi={10.1016/j.envsoft.2017.08.003}
Handling high predictor dimensionality in slope-unit-based landslide susceptibility models through LASSO-penalized Generalized Linear Model
Daniela Castro Camiloa, Luigi Lombardoa,b,c,∗, P. Martin Maib, Jie Doud, Rapha¨el Husera
aCEMSE Division, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia.
bPSE Division, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia.
cDepartment of Geosciences, T¨ubingen University, T¨ubingen, Germany.
dPublic Works Research Institute, Japan.
Abstract
Grid-based landslide susceptibility models at regional scales are computationally demanding when using a fine grid resolution. Conversely, Slope-Unit (SU) based susceptibility models allows to investigate the same areas offering two main advantages: 1) a smaller computational burden and 2) a more geomorphologically-oriented interpretation. In this contribution, we generate SU-based landslide susceptibility for the Sado Island in Japan. This island is characterized by deep-seated landslides which we assume can only limitedly be explained by the first two statistical moments (mean and variance) of a set of predictors within each slope unit. As a consequence, in a nested experiment, we first analyse the distributions of a set of continuous predictors within each slope unit computing the standard deviation and quantiles from 0.05 to 0.95 with a step of 0.05. These are then used as predictors for landslide susceptibility. In addition, we combine shape indices for polygon features and the normalized extent of each class belonging to the outcropping lithology in a given SU. This procedure significantly enlarges the size of the predictor hyperspace, thus producing a high level of slope-unit characterization. In a second step, we adopt a LASSO-penalized Generalized Linear Model to shrink back the predictor set to a sensible and interpretable number, carrying only the most significant covariates in the models. As a result, we are able to document the geomorphic features (e.g., 95% quantile of Elevation and 5% quantile of Plan Curvature) that primarily control the SU-based susceptibility within the test area while producing high predictive performances. The implementation of the statistical analyses are included in a parallelized R script (LUDARA) which is here made available for the community to replicate analogous experiments.
Keywords: High predictor dimensionality, Slope Unit-based landslide susceptibility, Lasso penalized variable selection, R scripting, Open sourcing
1. Introduction
1
The landslide susceptibility (LS, hereafter) repre-
2
sents the relative spatial probability of landslide occur-
3
rence (Chung and Fabbri,1999;Guzzetti et al.,2006b)
4
that is typically computed through different statistical
5
approaches and over a geographic space which can be
6
discretized into different mapping units (Carrara et al.,
7
1995;Guzzetti et al.,1999).
8
A mapping unit refers to a portion of land surface
9
with analogous geologic and/or geomorphic properties
10
that differ from its surroundings across definable bound-
11
aries (Hansen, 1984). The advancements in LS as-
12
sessment over the last three decades have primarily re-
13
flected developments on the two aforementioned top-
14
ics where: 1) various statistical approaches have been
15
∗Corresponding author
Email address:[email protected](Luigi Lombardo)
tested (Cama et al., 2015; Conoscenti et al., 2016),
16
compared (Felic´ısimo et al., 2013; S¨uzen and Doyu-
17
ran,2004) and reviewed (Aleotti and Chowdhury,1999;
18
Brenning,2005;Van Westen et al.,2006) while 2) dif-
19
ferent mapping units have been adopted (Galli et al.,
20
2008;Van Westen,2000), their respective effects eval-
21
uated (Arnone et al., 2016; Carrara et al., 2008; Van
22
Den Eeckhaut et al., 2009) and broadly summarized
23
into four categories: grid cells (Lombardo et al.,2015),
24
terrain units (Calvello et al., 2013), unique condition
25
units (Chung et al.,1995) and slope units (SU) (Guzzetti
26
and Reichenbach,1994). Each of these mapping units
27
offer advantages and drawbacks which were already de-
28
scribed byCarrara et al. (1995). Landslide automatic
29
recognition (McKean and Roering, 2004; Nichol and
30
Wong,2005;Tarolli et al.,2012), remote sensing (Ab-
31
dulwahid and Pradhan,2016;Lombardo et al.,2016a)
32
and deterministic (An et al.,2016;Iovine et al.,2007)
33
based predictions are also regarded as important topics
34
&
within the geomorphological community for suscepti-
35
bility purposes.
36
Despite the long-term efforts and the rich literature,
37
the adoption of the best mapping unit when assessing
38
the LS is still unclear (Guzzetti et al.,2000). Few ex-
39
amples exist where different mapping units were tested
40
for the same area. Van Den Eeckhaut et al. (2009)
41
compared grid-based and SU-based structures obtaining
42
analogous spatial patterns and performances. In par-
43
ticular, the grid structure had a slightly higher predic-
44
tion skill but a lower robustness. Conversely,Erener and
45
D¨uzg¨un(2012) denoted an increase in predictive perfor-
46
mances when using the SU. The reason behind these
47
results may be suggested by Rotigliano et al. (2012)
48
who highlighted that single cell values are less repre-
49
sentative when considering phenomena involving por-
50
tion or whole slopes, thus making the SU the correct
51
spatial domain to operate upon. Assuming that SU is a
52
reliable mean for partitioning a given area for landslide
53
spatial prediction purposes, we conducted a literature
54
review on the topic. We found that the whole commu-
55
nity operates by summarizing then-pixels contained in a
56
given SU by using just their mean and variance or stan-
57
dard deviation (e.g.,Guzzetti et al.,2006a;Rossi et al.,
58
2010;Zhou et al.,2015) and in fewer cases adding the
59
range (Reichenbach et al.,2014). The distribution of the
60
adopted predictors within each SU has never been inves-
61
tigated more in detail, despite the fact that thousands of
62
pixels may be contained in a SU, making their represen-
63
tation through just two values potentially inadequate.
64
Following this observation, we first hypothesize that
65
the two aforementioned statistical moments within a
66
given slope unit can only offer part of the information
67
for modelling large landslides. To be able to capture
68
potentially non-linear effects, and to better assess the in-
69
fluence of thewholepredictors’ distribution, we repre-
70
sent each continuous predictor by their 5th to 95th per-
71
centiles (computed for each SU) using a 5% quantile
72
step. As a result of this procedure, the size of the pre-
73
dictor hyperspace proportionally increased by one order
74
of magnitude and high multicollinearity among predic-
75
tors was induced. In order to reduce and extract only the
76
most relevant covariates, we secondly applied a LASSO
77
(Least Absolute Shrinkage and Selection Operator) pe-
78
nalized Generalized Linear Model. This is also different
79
from the other contributions in the literature as the com-
80
munity unanimously adopts a stepwise framework for
81
variable selection (Baeza and Corominas,2001;Cama
82
et al.,2016; Capitani et al.,2013; Goetz et al.,2015;
83
S¨uzen and Kaya,2012). The advantage of LASSO is
84
that model selection and parameter estimation are per-
85
formed simultaneously, while the LASSO penalty can
86
be adjusted for optimal prediction.
87
As a result, the main novelty in this paper consists
88
of testing the general assumption that mean and vari-
89
ance are good indicators of the whole distribution of any
90
continuous covariate in the model within a given slope
91
unit. If this assumption holds we expect median or near-
92
median quantiles to emerge from the variable selection
93
process. However, if quantiles significantly different
94
from the median are selected, then the final suscepti-
95
bility might differ from the one generated with the tra-
96
ditional mean values. In this framework, the introduc-
97
tion of an untested variable selection tool also represents
98
a further improvement in the literature as the stepwise
99
selection procedure is already known for many weak-
100
nesses in the statistical community despite its wide use
101
in the geomorphological one.
102
2. Materials and methods
103
2.1. Study area
104
The study area is located in a mountainous region of
105
Sado Island of Niigata Prefecture, in the Japan Sea (Fig-
106
ure1(a)). Sado Island is composed of two parallel, elon-
107
gated ridges approximately in a northeast to southwest
108
direction. These are locally named Osado and Kosado
109
and cover approximately 856km2. The climate is hu-
110
mid, subtropical with warm summers and cold winters.
111
The average temperature is 13.7◦C and the mean annual
112
rainfall is around 1550mm (Yamagishi,2008). The el-
113
evation varies from sea level to 1172m with a mean of
114
333m. The highest point of the island coincides with
115
the peak of Mt. Kimpoku in Osado. The local geol-
116
ogy is primarily represented by Neogene terrestrial and
117
marine volcanic rocks of rhyolitic, dacitic and andesitic
118
nature associated with pyroclastites and intrusives rhy-
119
olites. Most of the coastal slopes are rocky, and some
120
shores are recently formed by semi-consolidated and
121
unconsolidated sand and gravel. Being characterized by
122
mountains, hills upheaved benches and active faults, the
123
study area is prone to landsliding and several damages
124
are reported (Yamagishi,2008). In the Osado area, the
125
primary cause for landslides is represented by rainfall
126
and partly by snow melting.
127
The exposure to high intensity precipitation is mainly
128
concentrated between June-July and November-January
129
as shown in Figure 2 where TRMM rainfall esti-
130
mates (https://pmm.nasa.gov/trmm) are summa-
131
rized over 14 years to highlight seasonal trends. The
132
aforementioned wet intervals coincide with times when
133
displacements have been measured by Ayalew et al.
134
(2005a) for specific landslide cases coinciding with
135
Figure 1: Study area: Geographic location (a), Sado Island (b), detail of the landslide inventory (red polygons) and head scarps (yellow polylines).
strong groundwater level rise. In particular, Ayalew
136
et al. (2005a) highlight the role of heavy rainfall be-
137
tween October, 2000, and January, 2001, for the mo-
138
bilization of Shiidomari landslide located to the East of
139
the Island.
140
2.2. Landslide inventory
141
In this contribution we test our method using a well-
142
studied landslide archive. The archive for Sado Island
143
(Figure 1(b)) provided such conditions as it has been
144
validated over several articles (Ayalew et al.,2005a,b;
145
Dou et al.,2015b,a). The landslide inventory was built
146
by the National Research Institute for Earth Science
147
and Disaster Prevention (NIED) in Japan, being mapped
148
through multiple aerial photographs since the year 2000.
149
The NIED digitized the landslides scars and made them
150
available as polygon shapefiles (Figure 1(c)). 1538
151
landslides are mapped within the island affecting both
152
the ridges with analogous distributions along the rocky
153
coastline whilst a greater number of occurrences char-
154
acterizes the northern area. They account for a to-
155
tal surface of 91.5km2 which represents 11.1% of the
156
whole surface area. From a geomorphometric perspec-
157
tive (Figure 3(a)-(f)) the landslide distribution is pos-
158
itively skewed and heavy-tailed, with the majority of
159
the mass movements having similar characteristics and
160
fewer much more extreme cases due to their size. In
161
term of individual sizes (Figure3(a)), 84% of the land-
162
slides extent for more than 10000m2. This attests for
163
the significant area exposed to failures and indicates the
164
almost unique landslide class considered in the present
165
study.
166
The landslide types are reported as deep-seated, ei-
167
ther rotational or translational slides, and a limited num-
168
ber of the rock falls which were excluded for the present
169
research. The geological control over landslides appears
170
evident when intersecting the landslide scars with the
171
underlying geology (Figure3(g)). Overall, 65% of the
172
landslides occur on volcanic andesite followed by 23%
173
on dacite and 7.6% on sandstone. These lithotypes are
174
severely fractured as the whole island is subjected to ac-
175
tive faulting since the early and middle Miocene when
176
Figure 2: Average rainfall cumulative curves computed between years 2000-2014 (a). Average monthly rainfall computed between years 2000-2014 (b). Average weekly rainfall computed between years 2000-2014 (c). Light blue colors refer to the TRMM pixel falling to the North of the Sado island whereas red represents TRMM data from the south. Lines represent mean data across 14 years while areas are 95% confidence intervals.
Overlapping confidence intervals are shown in a purple color.
the extensional stress due to the Japan Sea opening (Jo-
177
livet et al., 1992) reached the Island producing the
178
graben structure between the Osado and Kosado ridges.
179
These lineaments cuts through Oligocene to Middle
180
Miocene volcanics and volcanoclastic rocks (Ganzawa,
181
1983). Towards the end of the Miocene, the stress
182
regime even switched to compressive in conjunction
183
with the interruption of the sea opening (Jolivet et al.,
184
1991) giving rise to thrusts that further weakened the
185
rock mass. The whole inventory provided by NIED
186
lacks systematic classification of landslide typologies
187
as their sizes hinder investigations for each case. This
188
is actually available for few landslides while the ma-
189
jority is defined as deep-seated without discriminating
190
between rotational or translational mechanisms. Fur-
191
thermore, the inventory lacks the temporal information
192
necessary for evolution assessments in relation to the
193
precipitation regime.
194
2.3. Model building strategy
195
2.3.1. Slope Units and status assignment
196
Among the mapping units, a SU represents the re-
197
gion of space delimited between ridges and valleys un-
198
der the constraint of homogeneous slope aspect and
199
steepness distributions (Carrara et al.,1991;Tian et al.,
200
2010). Thus, it physically describes either the left or
201
right side of a sub-basin of any order into which a wa-
202
tershed is subdivided (Erener and D¨uzg¨un,2012). As
203
a consequence, it naturally reflects the morphodynamic
204
response of a given slope in case of landslide occur-
205
rence (Huabin et al.,2005) making it possible to avoid
206
the shortcomings of low geomorphological represen-
207
tativeness of grid-based susceptibility mapping. This
208
property makes the SU particularly suitable for mod-
209
elling slope instabilities due to large landslides (Be-
210
guer´ıa and Lorente,2007). Therefore, we adopt the SU
211
as a mean to represent the deep-seated instabilities dom-
212
inating the failure mechanisms within the Sado Island.
213
It is worth mentioning that SU has been commonly used
214
for rainfall triggered landslides whereas few contribu-
215
tions can be found on their seismically induced counter-
216
part (e.g., Zhou et al.,2015). The reason for this may
217
be due to the actual size of the latter type spanning over
218
more than one slope unit, thus adding complexity to the
219
mapping unit choice.
220
The SUs (Figure 4) have been computed by us-
221
ing ther.slopeunitssoftware developed at the Geomor-
222
phology Research Group (http://geomorphology.
223
irpi.cnr.it/tools/slope-units) and recently ex-
224
plained in details byAlvioli et al.(2016). This Python
225
code is accessible through GRASS GIS (Neteler and
226
Mitasova, 2013) and allows for automatic delineation
227
of SUs from a Digital Elevation Model (DEM). The
228
code initially uses the r.watershed script (Metz et al.,
229
2011) to obtain a first-order delineation of half-basins.
230
Subsequently, it iteratively tries to subdivide the par-
231
ent half-basins into smaller half-basin children and ac-
232
cepts or refuses the new partition under a rule-based
233
criterion. This criterion comprises two distinct aspects.
234
First, the homogeneity of terrain aspect within a SU is
235
controlled through a parameter defined ascircular vari-
236
ance. Second, the acceptable size of the SU is deter-
237
mined as a function offlow accumulation (FA, Jenson
238
and Domingue,1988)thresholdandminimum SU area.
239
Finally the units below a given extent (orcleansize) are
240
dissolved by automatically merging them to the adjacent
241
units.
242
The final parameterization of r.slopeunits has been
243
Figure 3: Landslide characteristics: area (a), perimeter (b), maximum distance within a polygon (c), perimeter to area ratio (d), maximum distance to area ratio (e) shape Index (f) computed followingForman and Gordon(1986) and landslide to lithotype surfaces ratio (g) whose extent is swhon in the small map and overimposed with red landslides.
chosen among 9 possible combinations of circular vari-
244
ance (0.2, 0.4, 0.6), flow accumulation threshold inm2
245
(100·103, 250·103, 500·103) keeping the minimum
246
area and the cleansize parameters as constant (50·103
247
and 25·103 m2, respectively). Figure SM1 shows the
248
aforementioned combinations, including the selected
249
one which is computed with a FA threshold of 100·103
250
m2and a circular variance of 0.2. The final parameter-
251
ization has been decided on a qualitative level by com-
252
paring the partition to the underlying slope aspect which
253
resulted into 8239 SUs. One of the advantages of adopt-
254
ing slope units is that the high dimensionality of grid-
255
based prediction is summarized from million pixels to
256
thousands SUs thus reducing the associated computa-
257
tional burden.
258
We extract the Landslide Identification Points (LIPs,
259
hereafter) by calculating the centroid of each of the
260
1538 mass movements in the Island. The LIP calcu-
261
lation must follow a geomorphological criterion which
262
for the case of deep-seated landslides is commonly
263
adopted to be the centroid (Davis and Sims, 2013;
264
Oliveira et al.,2015;Xu et al.,2012). However, other
265
Figure 4: Slope-unit subdivision of Sado Island is underlain by the slope aspect. The right sub-panel shows the detail for a small region. The legend is valid for the whole island and the zoomed area.
landslide classes require different representation ap-
266
proaches (Hussin et al.,2016; Lombardo et al.,2014;
267
S¨uzen and Doyuran, 2004). We then intersect the
268
LIPs with the SU and assign them with a positive sta-
269
tus. Therefore, each landslide belongs to one and only
270
one SU. Conversely, every SU that has not crossed a
271
LIP within its boundary has been assigned with a neg-
272
ative status. This operation ensures the creation of a
273
SU-based dichotomous dependent variable expressing
274
landslide presences/absences across the island.
275
2.3.2. Independent variables
276
We represent the predictor distribution within a given
277
SU through quantiles and use them as predictors. In
278
particular, we compute for each SU the p-quantiles
279
with probabilitiesp=0.05,0.10, . . . ,0.95. This should
280
provide a near-complete description of the distribu-
281
tion of each predictor within SUs, but induces high-
282
multicollinearity which we deal with using a LASSO
283
penalty; see Section 2.4 and Supplementary material
284
for details. In terms of predictor choice, we adopt
285
part of the covariates which were used by Dou et al.
286
(2015b,a) to explain the landslide occurrences in the
287
same area. This is done to enable comparisons across
288
different studies as the landslide data is the same. In
289
particular, we adopt the following covariates: i) El-
290
evation (10m DEM), ii and iii) Eastness and North-
291
ness (Leempoel et al., 2015), iv and v) Plan and Pro-
292
file Curvature (Heerdegen and Beran, 1982), vi) Dis-
293
tance to Faults (Pourghasemi et al., 2012), vii) Dis-
294
tance to Geological Boundaries (Dou et al.,2015a), viii)
295
Relative Slope Position (B¨ohner and Selige,2006), xi)
296
Slope (Zevenbergen and Thorne,1987), x) Topographic
297
Wetness Index (Beven and Kirkby(1979), TWI, here-
298
after). Most of the covariates are derived from the
299
DEM obtained from the Geospatial Information Author-
300
ity of Japan, GSI (https://www.hcc.co.jp/work/
301
gismap/youtobetsu/terrain.html). The actual
302
data provider is Hokkaido-chizu Co. Ltd., a private
303
company that reports a 0.1m vertical resolution for the
304
whole Japan. Furthermore, we have included the out-
305
cropping lithology (scale 1:200,000) produced by the
306
GeoNavi of the Geological Survey of Japan (GSJ). For
307
the latter, we compute the proportion between each
308
lithotype outcropping extent within a given SU and the
309
full extent of the SU itself. This procedure differs
310
from the common practice of getting the predominant
311
lithotype only within a SU (e.g.Tian et al.,2010) and
312
provides a transformation from the original categorical
313
variable to its continuous counterpart expressed through
314
the lithotype to SU surface-ratio. Finally, we have also
315
computed two morphometric indices: the maximum
316
possible distance in a given SU divided by the square
317
root of the area within a given SU (maxD/√ A, here-
318
after) and the perimeter divided by the square root of
319
the area within a given SU (P/√
A, hereafter), which
320
are measures of roundness of each SU. These indices
321
are computed using the Polygon Shape Indices Module
322
available in SAGA GIS (Conrad et al.,2015) Overall,
323
from an initial set of 13 predictors we generated 273 of
324
them. This number is equal to 13 covariates times 20
325
(19 quantiles+standard deviation) plus 2 morphomet-
326
ric indices and 11 out of 12 lithology to SU ratios. We
327
excluded the ratio of the most representative lithotype in
328
the island, as it can be expressed as a linear combination
329
of the other lithotype ratios, and this avoids producing
330
a rank-deficient design matrix. In the next section, we
331
discuss how to assess the importance of each predictor
332
and how to perform variable selection, in order to get
333
back to a reasonable and interpretable number of signif-
334
icant and relevant predictors. The physical meaning and
335
interpretation of these covariates may be summarised as
336
follows: i) slope orientation and subsequent potential
337
evapotranspiration effects through Eastness and North-
338
ness. These are the continuous counterpart of the As-
339
pect which, being a cyclic number has historically be-
340
ing used as a categorical covariate. Recent contributions
341
(e.g.,Steger et al.,2016) are using those by respectively
342
calculating sin(Aspect) and cos(Aspect) to produce con-
343
tinuous covariates whose domain are bounded between
344
-1 and 1. For Eastness values equal to 1 indicate expo-
345
sition facing East whereas -1 faces West. In analogy,
346
for Northness values equal to 1 are facing North and -
347
1 are exposed to the South. The elevation is typically
348
used asii)a proxy for rainfall intensities while Plan and
349
Profile curvature are interpretable foriii)their influence
350
in directing overland water flows (Ohlmacher, 2007).
351
Distance to Fault indicates iv) pre-existing weakness
352
planes whereas Distance to Geological Boundaries may
353
complement this structural information by indicatingv)
354
contacts between lithotypes with different geotechnical
355
properties. Relative Slope Position represent the contin-
356
uous counterpart to the more general Landform Classifi-
357
cation, by expressingvi)pixel positions along the land-
358
scape with values equal to 0 for lowland areas and val-
359
ues equal to 1 for ridges. Finally, Topographic Wetness
360
Index expressesvii)the tendency of a given pixel to re-
361
tain water and subsequently increase pore water pres-
362
sures.
363
2.4. LASSO-penalized generalized linear model
364
In terms of modelling, our goal is to link the dichoto-
365
mous dependent variable expressing landslide pres-
366
ences/absences with the set of predictors (or indepen-
367
dent variables) described in Section 2.3.2. A natu-
368
ral framework for this is provided by logistic regres-
369
sion models, which are constructed using a function
370
that links the probabilities of the binary responses (i.e.,
371
the probabilities of landslide presence or absence) with
372
a linear function of the predictors. These models are
373
framed in terms of the generalized linear model (GLM)
374
family, and are able to assess the strength of associa-
375
tion between a particular predictor and a landslide oc-
376
currence, whilst adjusting for all the other predictors.
377
Details regarding model formulation and estimation can
378
be found in the Supplementary Material.
379
To make sense of the results obtained from the fit-
380
ted landslide susceptibility model and to avoid statisti-
381
cal complexities, we need to find an interpretable sub-
382
set of predictors that are associated with a landslide
383
event. In other words, we need a method to reduce
384
the large number of highly correlated predictors with-
385
out losing parameter interpretability. Although stepwise
386
variable selection has been a very popular technique to
387
face this issue (seeCama et al.(2017)), several authors
388
(see for instanceCopas and Long(1991);Derksen and
389
Keselman(1992);Harrell(2015)) have reported impor-
390
tant statistical deficiencies related to the correct distri-
391
bution of the test statistics, size of the p-values, bias
392
of estimates and standard errors, among other prob-
393
lems. To avoid these issues, we choose to work with
394
the LASSO (Least Shrinkage and Selection Operator)
395
proposed by Tibshirani (1996), designed to deal with
396
multicollinearity in the estimation of large generalized
397
linear models. The LASSO improves prediction error
398
by appropriately shrinking (or penalizing) large regres-
399
sion coefficients towards zero in order to reduce overfit-
400
ting, while performing simultaneous variable selection
401
and parameter estimation with any generalized linear re-
402
gression model. Details regarding the LASSO penaliza-
403
tion applied to logistic regression can be found in the
404
Supplementary Material.
405
Regarding the practical implementation of these
406
methods, we have structured a code (called LUDARA)
407
to fit and validate our model parallelizing over a certain
408
number of replicates N. Specifically, our code (which
409
can be found in the Supplementary Material) proceeds
410
as follows: i) it randomly extracts 75% of landslide
411
presences in the whole region, as well as the same num-
412
ber of landslide absences according to a balanced sam-
413
pling strategy. The logistic regression model is then fit-
414
ted using a LASSO penalty optimally chosen by cross-
415
validation.ii)the code then stores the predictor IDs and
416
corresponding coefficents. iii)the complementary 25%
417
0.2 0.4 0.6 0.8
0.00.20.40.60.81.0
0.2 0.4 0.6 0.8
Cutoff
0.00.20.40.60.81.0Accuracy
(a)
0.0 0.2 0.4 0.6 0.8 1.0
0.00.20.40.60.81.0
0.0 0.2 0.4 0.6 0.8 1.0 False positive rate
0.00.20.40.60.81.0True positive rate
(b)
Figure 5: Accuracy test (a) and Receiver Operating Characteristic curves (b) for each of the 500 replicates. Red lines represent the corresponding means.
of presences is then used for validation together with
418
an equal number of (randomly chosen) absences; using
419
the validation subset, iv) the Accuracy or (True Posi-
420
tives+True Negatives)/Total Population is computed
421
varying the probability cutoffvalue that separates land-
422
slide from no-landslide cases; v)the Receiver Operat-
423
ing Characterisitic (ROC) curve and the AUC (area un-
424
der the ROC curve) are computed;vi)the whole proce-
425
dure is repeatedN =500 times and therefore produces
426
N different models with their corresponding parameter
427
estimates, ROC curves and AUCs. The predictor im-
428
portance may be assessed by counting the actual selec-
429
tions of the 273 predictors across theNmodels. Mean
430
fitted probabilities, mean ROC curve and mean AUC
431
are obtained by averaging over all theNreplicates;vii)
432
the accuracy is plotted against the varying cutoffs and
433
the maximum accuracy value of the average resulting
434
curves is then adopted to separate stable from unstable
435
landslide conditions. Stepvii)is implemented to quan-
436
titatively determine the best probability cutoff as sev-
437
eral authors (Felic´ısimo et al.,2013;Frattini et al.,2010;
438
Heckmann et al.,2014) have pointed out that the cutoff
439
choice is not unique and may change with the sampling
440
scheme, method or dataset. Details regarding the model
441
performances are given in Sections3and4.
442
3. Results
443
Figure 5(a) shows the results of the accuracy test
444
across 500 replicates where the average maximum ac-
445
curacy of 0.71 has been reached at approximately the
446
standard cutoffof 0.5; this value was fixed for the rest
447
of the analyses for confusion matrices (e.g.Lombardo
448
et al.,2016b). Figure5(b) shows the validation ROC
449
curves across 500 replicates. The corresponding aver-
450
age AUC of 0.78 is associated to a standard deviation
451
of 0.03. This attests for an almost excellent predic-
452
tion skill (Hosmer and Lemeshow,2000) together with
453
a strong stability of the model throughout the cross-
454
validation procedure. Once the quality of the models is
455
confirmed in terms of overall prediction skill and asso-
456
ciated variance, the predictor behavior has been investi-
457
gated to infer the geomorphological reasonability of the
458
results. In particular, due to the combined random data
459
set extraction, high dimensionality of the predictor hy-
460
perspace and the LASSO-based variable selection, the
461
list of significant covariates changes from each repli-
462
cate. In order to get a unique interpretable set of pre-
463
dictors, we have selected the covariates that are present
464
within each of the 500 replicates more than 80% of the
465
time. Figure6plots the percentage of selection for each
466
predictor.
467
The eight covariates above the red line are sorted
468
in order of importance and are listed as follows: i)
469
maxD/sqrtA, ii) Sandstone/SU, iii) Elevation’s 95th per-
470
centile, iv) TWI’s 95th percentile, v) SD of Elevation,
471
vi) Plan Curvature’s 5th percentile, vii) Relative Slope
472
Position’s 5th percentile, viii) Eastness’ 95th percentile.
473
The effects of these covariates on the final susceptibility
474
are assessed analysing the regressionβcoefficient val-
475
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●●
●
●●●●●●
●
●
●●
●●●
●●●
●
●
●
●●●●
●●
●
●
●●●●●●
●
●
●
●
●
●
●
●
●
●
●●
●●●●●
●
●
●●
●●
●
●
●●●●●●●●●●
●
●
●●●●●
●
●
●
●
●●●●●●●●●
●
●
●●●●●●●
●
●
●
●●●●●●●
●
●
●●●●
●
●●●●
●●●●●●●●●
●
●●●●●●●●●●
●●●●●●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●●●●●
●
●●
●
●
●●●●●●●●●●
●●●●●●
●
●
●●●
●●●
●●●●●●●●●●
●
0 50 100 150 200
Predictor ID
020406080100Percentage
Figure 6: Variable importance measured by percentage of selection over the 500 replicates. The red horizontal dashed line represents the 80% variable-inclusion threshold.
ues (Cama et al., 2015,2016) together with response
476
plots (Pourghasemi and Rossi, 2016; Schillaci et al.,
477
2017a,b;Tziritis and Lombardo,2016). This is shown
478
in Figure 7 where the distribution of the 500 coeffi-
479
cients is summarized. The mean landslide suscepti-
480
bility (computed as the average susceptibility from the
481
500 replicates) is plotted against each covariate in Fig-
482
ure8. This approach complements the analyses of the
483
coefficients providing a more dynamic perspective of
484
variable-interaction effects on the susceptibility. More-
485
over, the mean and standard deviation of the estimated
486
susceptibility for each SU are mapped in Figure9(a)
487
and Figure 9(b). These two elements allow for visu-
488
alizing the prediction and for completing the evaluation
489
on the model performances. This evaluation consists
490
of computing the model error (Lombardo et al.,2016a;
491
Rossi et al.,2010) where the mean susceptibility value is
492
plotted against twice the standard deviation. Figure10
493
presents the model error in a density plot with an upper
494
bound of the 95% confidence interval coinciding with
495
0.2.
496
4. Discussion
497
We tested our code, LUDARA, within the Sado Is-
498
land where a well-tested inventory of deep-seated land-
499
slides was made available. The model performances
500
were evaluated at different stages. The first stage im-
501
plemented an accuracy test to select the best probability
502
cutofffor separating stable from unstable conditions in
503
the probability spectrum. This was done by maximizing
504
the accuracy as a function of a moving cutoffwindow.
505
The average maximum accuracy across replicates was
506
0.71 at a probability cutoffvalue of 0.5 which is also the
507
commonly adopted threshold. The same threshold was
508
subsequently applied to the rest of the metric calcula-
509
tions (TP, FP, TN, FN). In addition, the ROC curves and
510
corresponding average AUC value of 0.78 confirmed an
511
almost excellent prediction together with low associated
512
variance.
513
The geomorphological reasonability of the model
514
was inferred on those covariates which were selected
515
more than 400 times out of the total 500 replicates. Fig-
516
ure8summarizes their coefficients during the modeling
517
phase while Figure7shows the predictor response over
518
the fitted probabilities for each of the SU in the island.
519
MaxD/sqrt(A) appeared to be the most significant co-
520
variate. This shape index carries the information on the
521
elongation of a given SU which decreases as the ratio
522
increases. The sign of the correspondingβcoefficient is
523
negatively associated with landslide presences (median
524
β=−0.33) indicating that the less elongated the SU the
525
more unstable it would be. This can be interpreted both
526
as a classification effect, as most of the landslides in the
527
area are large, or as a lower shear resistance along the
528
sliding surface against greater unstable volumes. The
529
same relation can be seen on the response plots where
530
the probability rapidly decays as the maxD/sqrt(A) in-
531
creases.
532
The Sandstone to SU areal ratio similarly shows a
533
negative effect on landslide occurrences (medianβ =
534
−0.01) which is confirmed over the response with two
535
evident clusters, the main one being strictly unstable
536
where Sandstone/SU is absent and a second cluster with
537
very low probabilities at high Sandstone proportions. It
538
is worth noting that the Sandstone was shown in Fig-
539
ure3(g) to be the third most representative outcropping
540
lithotype for the landslides. Despite this, the regres-
541
sion βcoefficient was estimated to be negative. This
542
can depend on the limited thickness of these sandstone
543
deposits when compared to the sliding surface of a deep-
544
seated landslide together with the prevalent presence of
545
this covariate along the central graben where the topog-
546
raphy is flat. The Elevation’s 95th percentile has a pos-
547
itive effect on instabilities (medianβ=0.11) which re-
548
sults in a rapid increase in the SU susceptibility. This
549
can be explained by the fact that this covariate repre-
550
sents the highest elevation values within a SU which
551
can also be translated into a greater exposition to rainfall
552
discharges, which in turn destabilize the slope unit.
553
The TWI’s 95th percentile contributes to SU instabil-
554
ity (medianβ=0.01). The corresponding response plot
555
distribution shows a peak at 8 where the maximum in-
556
stability is reached. This distribution can be interpreted
557
as the destabilizing effect of water infiltration where the
558
landslide susceptibility is high while the decay to sta-
559