Handling high predictor dimensionality in slope- unit-based landslide susceptibility models through LASSO-penalized Generalized Linear Model

(1)

Handling high predictor dimensionality in slope- unit-based landslide susceptibility models through

LASSO-penalized Generalized Linear Model

Item Type Article

Authors Camilo, Daniela Castro;Lombardo, Luigi;Mai, Paul Martin;Dou, Jie;Huser, Raphaël

Citation Camilo DC, Lombardo L, Mai PM, Dou J, Huser R (2017) Handling high predictor dimensionality in slope-unit-based landslide susceptibility models through LASSO-penalized Generalized Linear Model. Environmental Modelling & Software 97: 145–156.

Available: http://dx.doi.org/10.1016/j.envsoft.2017.08.003.

Eprint version Post-print

DOI 10.1016/j.envsoft.2017.08.003

Publisher Elsevier BV

Journal Environmental Modelling & Software

Rights NOTICE: this is the author’s version of a work that was accepted for publication in Environmental Modelling & Software. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document.

Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Environmental Modelling & Software, [97, , (2017-08-30)] DOI:

10.1016/j.envsoft.2017.08.003 . © 2017. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://

creativecommons.org/licenses/by-nc-nd/4.0/

Download date 2023-12-01 17:26:52

(2)

Link to Item http://hdl.handle.net/10754/626008

(3)

Castro Camilo et al. 2017 - Environmental Modelling and Software - doi={10.1016/j.envsoft.2017.08.003}

Handling high predictor dimensionality in slope-unit-based landslide susceptibility models through LASSO-penalized Generalized Linear Model

Daniela Castro Camiloâ, Luigi Lombardoâ,b,c,∗, P. Martin Mai^b, Jie Dou^d, Raphaël Huserâ

aCEMSE Division, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia.

bPSE Division, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia.

cDepartment of Geosciences, T¨ubingen University, T¨ubingen, Germany.

dPublic Works Research Institute, Japan.

Abstract

Grid-based landslide susceptibility models at regional scales are computationally demanding when using a fine grid resolution. Conversely, Slope-Unit (SU) based susceptibility models allows to investigate the same areas offering two main advantages: 1) a smaller computational burden and 2) a more geomorphologically-oriented interpretation. In this contribution, we generate SU-based landslide susceptibility for the Sado Island in Japan. This island is characterized by deep-seated landslides which we assume can only limitedly be explained by the first two statistical moments (mean and variance) of a set of predictors within each slope unit. As a consequence, in a nested experiment, we first analyse the distributions of a set of continuous predictors within each slope unit computing the standard deviation and quantiles from 0.05 to 0.95 with a step of 0.05. These are then used as predictors for landslide susceptibility. In addition, we combine shape indices for polygon features and the normalized extent of each class belonging to the outcropping lithology in a given SU. This procedure significantly enlarges the size of the predictor hyperspace, thus producing a high level of slope-unit characterization. In a second step, we adopt a LASSO-penalized Generalized Linear Model to shrink back the predictor set to a sensible and interpretable number, carrying only the most significant covariates in the models. As a result, we are able to document the geomorphic features (e.g., 95% quantile of Elevation and 5% quantile of Plan Curvature) that primarily control the SU-based susceptibility within the test area while producing high predictive performances. The implementation of the statistical analyses are included in a parallelized R script (LUDARA) which is here made available for the community to replicate analogous experiments.

Keywords: High predictor dimensionality, Slope Unit-based landslide susceptibility, Lasso penalized variable selection, R scripting, Open sourcing

1. Introduction

1

The landslide susceptibility (LS, hereafter) repre-

2

sents the relative spatial probability of landslide occur-

3

rence (Chung and Fabbri,1999;Guzzetti et al.,2006b)

4

that is typically computed through different statistical

5

approaches and over a geographic space which can be

6

discretized into different mapping units (Carrara et al.,

7

1995;Guzzetti et al.,1999).

8

A mapping unit refers to a portion of land surface

9

with analogous geologic and/or geomorphic properties

10

that differ from its surroundings across definable bound-

11

aries (Hansen, 1984). The advancements in LS as-

12

sessment over the last three decades have primarily re-

13

flected developments on the two aforementioned top-

14

ics where: 1) various statistical approaches have been

15

∗Corresponding author

Email address:[email protected](Luigi Lombardo)

tested (Cama et al., 2015; Conoscenti et al., 2016),

16

compared (Felic´ısimo et al., 2013; S¨uzen and Doyu-

17

ran,2004) and reviewed (Aleotti and Chowdhury,1999;

18

Brenning,2005;Van Westen et al.,2006) while 2) dif-

19

ferent mapping units have been adopted (Galli et al.,

20

2008;Van Westen,2000), their respective effects eval-

21

uated (Arnone et al., 2016; Carrara et al., 2008; Van

22

Den Eeckhaut et al., 2009) and broadly summarized

23

into four categories: grid cells (Lombardo et al.,2015),

24

terrain units (Calvello et al., 2013), unique condition

25

units (Chung et al.,1995) and slope units (SU) (Guzzetti

26

and Reichenbach,1994). Each of these mapping units

27

offer advantages and drawbacks which were already de-

28

scribed byCarrara et al. (1995). Landslide automatic

29

recognition (McKean and Roering, 2004; Nichol and

30

Wong,2005;Tarolli et al.,2012), remote sensing (Ab-

31

dulwahid and Pradhan,2016;Lombardo et al.,2016a)

32

and deterministic (An et al.,2016;Iovine et al.,2007)

33

based predictions are also regarded as important topics

34

&

(4)

within the geomorphological community for suscepti-

35

bility purposes.

36

Despite the long-term efforts and the rich literature,

37

the adoption of the best mapping unit when assessing

38

the LS is still unclear (Guzzetti et al.,2000). Few ex-

39

amples exist where different mapping units were tested

40

for the same area. Van Den Eeckhaut et al. (2009)

41

compared grid-based and SU-based structures obtaining

42

analogous spatial patterns and performances. In par-

43

ticular, the grid structure had a slightly higher predic-

44

tion skill but a lower robustness. Conversely,Erener and

45

D¨uzg¨un(2012) denoted an increase in predictive perfor-

46

mances when using the SU. The reason behind these

47

results may be suggested by Rotigliano et al. (2012)

48

who highlighted that single cell values are less repre-

49

sentative when considering phenomena involving por-

50

tion or whole slopes, thus making the SU the correct

51

spatial domain to operate upon. Assuming that SU is a

52

reliable mean for partitioning a given area for landslide

53

spatial prediction purposes, we conducted a literature

54

review on the topic. We found that the whole commu-

55

nity operates by summarizing then-pixels contained in a

56

given SU by using just their mean and variance or stan-

57

dard deviation (e.g.,Guzzetti et al.,2006a;Rossi et al.,

58

2010;Zhou et al.,2015) and in fewer cases adding the

59

range (Reichenbach et al.,2014). The distribution of the

60

adopted predictors within each SU has never been inves-

61

tigated more in detail, despite the fact that thousands of

62

pixels may be contained in a SU, making their represen-

63

tation through just two values potentially inadequate.

64

Following this observation, we first hypothesize that

65

the two aforementioned statistical moments within a

66

given slope unit can only offer part of the information

67

for modelling large landslides. To be able to capture

68

potentially non-linear effects, and to better assess the in-

69

fluence of thewholepredictors’ distribution, we repre-

70

sent each continuous predictor by their 5th to 95th per-

71

centiles (computed for each SU) using a 5% quantile

72

step. As a result of this procedure, the size of the pre-

73

dictor hyperspace proportionally increased by one order

74

of magnitude and high multicollinearity among predic-

75

tors was induced. In order to reduce and extract only the

76

most relevant covariates, we secondly applied a LASSO

77

(Least Absolute Shrinkage and Selection Operator) pe-

78

nalized Generalized Linear Model. This is also different

79

from the other contributions in the literature as the com-

80

munity unanimously adopts a stepwise framework for

81

variable selection (Baeza and Corominas,2001;Cama

82

et al.,2016; Capitani et al.,2013; Goetz et al.,2015;

83

S¨uzen and Kaya,2012). The advantage of LASSO is

84

that model selection and parameter estimation are per-

85

formed simultaneously, while the LASSO penalty can

86

be adjusted for optimal prediction.

87

As a result, the main novelty in this paper consists

88

of testing the general assumption that mean and vari-

89

ance are good indicators of the whole distribution of any

90

continuous covariate in the model within a given slope

91

unit. If this assumption holds we expect median or near-

92

median quantiles to emerge from the variable selection

93

process. However, if quantiles significantly different

94

from the median are selected, then the final suscepti-

95

bility might differ from the one generated with the tra-

96

ditional mean values. In this framework, the introduc-

97

tion of an untested variable selection tool also represents

98

a further improvement in the literature as the stepwise

99

selection procedure is already known for many weak-

100

nesses in the statistical community despite its wide use

101

in the geomorphological one.

102

2. Materials and methods

103

2.1. Study area

104

The study area is located in a mountainous region of

105

Sado Island of Niigata Prefecture, in the Japan Sea (Fig-

106

ure1(a)). Sado Island is composed of two parallel, elon-

107

gated ridges approximately in a northeast to southwest

108

direction. These are locally named Osado and Kosado

109

and cover approximately 856km². The climate is hu-

110

mid, subtropical with warm summers and cold winters.

111

The average temperature is 13.7^◦C and the mean annual

112

rainfall is around 1550mm (Yamagishi,2008). The el-

113

evation varies from sea level to 1172m with a mean of

114

333m. The highest point of the island coincides with

115

the peak of Mt. Kimpoku in Osado. The local geol-

116

ogy is primarily represented by Neogene terrestrial and

117

marine volcanic rocks of rhyolitic, dacitic and andesitic

118

nature associated with pyroclastites and intrusives rhy-

119

olites. Most of the coastal slopes are rocky, and some

120

shores are recently formed by semi-consolidated and

121

unconsolidated sand and gravel. Being characterized by

122

mountains, hills upheaved benches and active faults, the

123

study area is prone to landsliding and several damages

124

are reported (Yamagishi,2008). In the Osado area, the

125

primary cause for landslides is represented by rainfall

126

and partly by snow melting.

127

The exposure to high intensity precipitation is mainly

128

concentrated between June-July and November-January

129

as shown in Figure 2 where TRMM rainfall esti-

130

mates (https://pmm.nasa.gov/trmm) are summa-

131

rized over 14 years to highlight seasonal trends. The

132

aforementioned wet intervals coincide with times when

133

displacements have been measured by Ayalew et al.

134

(2005a) for specific landslide cases coinciding with

135

(5)

Figure 1: Study area: Geographic location (a), Sado Island (b), detail of the landslide inventory (red polygons) and head scarps (yellow polylines).

strong groundwater level rise. In particular, Ayalew

136

et al. (2005a) highlight the role of heavy rainfall be-

137

tween October, 2000, and January, 2001, for the mo-

138

bilization of Shiidomari landslide located to the East of

139

the Island.

140

2.2. Landslide inventory

141

In this contribution we test our method using a well-

142

studied landslide archive. The archive for Sado Island

143

(Figure 1(b)) provided such conditions as it has been

144

validated over several articles (Ayalew et al.,2005a,b;

145

Dou et al.,2015b,a). The landslide inventory was built

146

by the National Research Institute for Earth Science

147

and Disaster Prevention (NIED) in Japan, being mapped

148

through multiple aerial photographs since the year 2000.

149

The NIED digitized the landslides scars and made them

150

available as polygon shapefiles (Figure 1(c)). 1538

151

landslides are mapped within the island affecting both

152

the ridges with analogous distributions along the rocky

153

coastline whilst a greater number of occurrences char-

154

acterizes the northern area. They account for a to-

155

tal surface of 91.5km² which represents 11.1% of the

156

whole surface area. From a geomorphometric perspec-

157

tive (Figure 3(a)-(f)) the landslide distribution is pos-

158

itively skewed and heavy-tailed, with the majority of

159

the mass movements having similar characteristics and

160

fewer much more extreme cases due to their size. In

161

term of individual sizes (Figure3(a)), 84% of the land-

162

slides extent for more than 10000m². This attests for

163

the significant area exposed to failures and indicates the

164

almost unique landslide class considered in the present

165

study.

166

The landslide types are reported as deep-seated, ei-

167

ther rotational or translational slides, and a limited num-

168

ber of the rock falls which were excluded for the present

169

research. The geological control over landslides appears

170

evident when intersecting the landslide scars with the

171

underlying geology (Figure3(g)). Overall, 65% of the

172

landslides occur on volcanic andesite followed by 23%

173

on dacite and 7.6% on sandstone. These lithotypes are

174

severely fractured as the whole island is subjected to ac-

175

tive faulting since the early and middle Miocene when

176

(6)

Figure 2: Average rainfall cumulative curves computed between years 2000-2014 (a). Average monthly rainfall computed between years 2000-2014 (b). Average weekly rainfall computed between years 2000-2014 (c). Light blue colors refer to the TRMM pixel falling to the North of the Sado island whereas red represents TRMM data from the south. Lines represent mean data across 14 years while areas are 95% confidence intervals.

Overlapping confidence intervals are shown in a purple color.

the extensional stress due to the Japan Sea opening (Jo-

177

livet et al., 1992) reached the Island producing the

178

graben structure between the Osado and Kosado ridges.

179

These lineaments cuts through Oligocene to Middle

180

Miocene volcanics and volcanoclastic rocks (Ganzawa,

181

1983). Towards the end of the Miocene, the stress

182

regime even switched to compressive in conjunction

183

with the interruption of the sea opening (Jolivet et al.,

184

1991) giving rise to thrusts that further weakened the

185

rock mass. The whole inventory provided by NIED

186

lacks systematic classification of landslide typologies

187

as their sizes hinder investigations for each case. This

188

is actually available for few landslides while the ma-

189

jority is defined as deep-seated without discriminating

190

between rotational or translational mechanisms. Fur-

191

thermore, the inventory lacks the temporal information

192

necessary for evolution assessments in relation to the

193

precipitation regime.

194

2.3. Model building strategy

195

2.3.1. Slope Units and status assignment

196

Among the mapping units, a SU represents the re-

197

gion of space delimited between ridges and valleys un-

198

der the constraint of homogeneous slope aspect and

199

steepness distributions (Carrara et al.,1991;Tian et al.,

200

2010). Thus, it physically describes either the left or

201

right side of a sub-basin of any order into which a wa-

202

tershed is subdivided (Erener and D¨uzg¨un,2012). As

203

a consequence, it naturally reflects the morphodynamic

204

response of a given slope in case of landslide occur-

205

rence (Huabin et al.,2005) making it possible to avoid

206

the shortcomings of low geomorphological represen-

207

tativeness of grid-based susceptibility mapping. This

208

property makes the SU particularly suitable for mod-

209

elling slope instabilities due to large landslides (Be-

210

guer´ıa and Lorente,2007). Therefore, we adopt the SU

211

as a mean to represent the deep-seated instabilities dom-

212

inating the failure mechanisms within the Sado Island.

213

It is worth mentioning that SU has been commonly used

214

for rainfall triggered landslides whereas few contribu-

215

tions can be found on their seismically induced counter-

216

part (e.g., Zhou et al.,2015). The reason for this may

217

be due to the actual size of the latter type spanning over

218

more than one slope unit, thus adding complexity to the

219

mapping unit choice.

220

The SUs (Figure 4) have been computed by us-

221

ing ther.slopeunitssoftware developed at the Geomor-

222

phology Research Group (http://geomorphology.

223

irpi.cnr.it/tools/slope-units) and recently ex-

224

plained in details byAlvioli et al.(2016). This Python

225

code is accessible through GRASS GIS (Neteler and

226

Mitasova, 2013) and allows for automatic delineation

227

of SUs from a Digital Elevation Model (DEM). The

228

code initially uses the r.watershed script (Metz et al.,

229

2011) to obtain a first-order delineation of half-basins.

230

Subsequently, it iteratively tries to subdivide the par-

231

ent half-basins into smaller half-basin children and ac-

232

cepts or refuses the new partition under a rule-based

233

criterion. This criterion comprises two distinct aspects.

234

First, the homogeneity of terrain aspect within a SU is

235

controlled through a parameter defined ascircular vari-

236

ance. Second, the acceptable size of the SU is deter-

237

mined as a function offlow accumulation (FA, Jenson

238

and Domingue,1988)thresholdandminimum SU area.

239

Finally the units below a given extent (orcleansize) are

240

dissolved by automatically merging them to the adjacent

241

units.

242

The final parameterization of r.slopeunits has been

243

(7)

Figure 3: Landslide characteristics: area (a), perimeter (b), maximum distance within a polygon (c), perimeter to area ratio (d), maximum distance to area ratio (e) shape Index (f) computed followingForman and Gordon(1986) and landslide to lithotype surfaces ratio (g) whose extent is swhon in the small map and overimposed with red landslides.

chosen among 9 possible combinations of circular vari-

244

ance (0.2, 0.4, 0.6), flow accumulation threshold inm²

245

(100·10³, 250·10³, 500·10³) keeping the minimum

246

area and the cleansize parameters as constant (50·10³

247

and 25·10³ m², respectively). Figure SM1 shows the

248

aforementioned combinations, including the selected

249

one which is computed with a FA threshold of 100·10³

250

m²and a circular variance of 0.2. The final parameter-

251

ization has been decided on a qualitative level by com-

252

paring the partition to the underlying slope aspect which

253

resulted into 8239 SUs. One of the advantages of adopt-

254

ing slope units is that the high dimensionality of grid-

255

based prediction is summarized from million pixels to

256

thousands SUs thus reducing the associated computa-

257

tional burden.

258

We extract the Landslide Identification Points (LIPs,

259

hereafter) by calculating the centroid of each of the

260

1538 mass movements in the Island. The LIP calcu-

261

lation must follow a geomorphological criterion which

262

for the case of deep-seated landslides is commonly

263

adopted to be the centroid (Davis and Sims, 2013;

264

Oliveira et al.,2015;Xu et al.,2012). However, other

265

(8)

Figure 4: Slope-unit subdivision of Sado Island is underlain by the slope aspect. The right sub-panel shows the detail for a small region. The legend is valid for the whole island and the zoomed area.

landslide classes require different representation ap-

266

proaches (Hussin et al.,2016; Lombardo et al.,2014;

267

S¨uzen and Doyuran, 2004). We then intersect the

268

LIPs with the SU and assign them with a positive sta-

269

tus. Therefore, each landslide belongs to one and only

270

one SU. Conversely, every SU that has not crossed a

271

LIP within its boundary has been assigned with a neg-

272

ative status. This operation ensures the creation of a

273

SU-based dichotomous dependent variable expressing

274

landslide presences/absences across the island.

275

2.3.2. Independent variables

276

We represent the predictor distribution within a given

277

SU through quantiles and use them as predictors. In

278

particular, we compute for each SU the p-quantiles

279

with probabilitiesp=0.05,0.10, . . . ,0.95. This should

280

provide a near-complete description of the distribu-

281

tion of each predictor within SUs, but induces high-

282

multicollinearity which we deal with using a LASSO

283

penalty; see Section 2.4 and Supplementary material

284

for details. In terms of predictor choice, we adopt

285

part of the covariates which were used by Dou et al.

286

(2015b,a) to explain the landslide occurrences in the

287

same area. This is done to enable comparisons across

288

different studies as the landslide data is the same. In

289

particular, we adopt the following covariates: i) El-

290

evation (10m DEM), ii and iii) Eastness and North-

291

ness (Leempoel et al., 2015), iv and v) Plan and Pro-

292

file Curvature (Heerdegen and Beran, 1982), vi) Dis-

293

tance to Faults (Pourghasemi et al., 2012), vii) Dis-

294

tance to Geological Boundaries (Dou et al.,2015a), viii)

295

Relative Slope Position (B¨ohner and Selige,2006), xi)

296

Slope (Zevenbergen and Thorne,1987), x) Topographic

297

Wetness Index (Beven and Kirkby(1979), TWI, here-

298

after). Most of the covariates are derived from the

299

DEM obtained from the Geospatial Information Author-

300

ity of Japan, GSI (https://www.hcc.co.jp/work/

301

gismap/youtobetsu/terrain.html). The actual

302

data provider is Hokkaido-chizu Co. Ltd., a private

303

company that reports a 0.1m vertical resolution for the

304

whole Japan. Furthermore, we have included the out-

305

cropping lithology (scale 1:200,000) produced by the

306

GeoNavi of the Geological Survey of Japan (GSJ). For

307

the latter, we compute the proportion between each

308

lithotype outcropping extent within a given SU and the

309

full extent of the SU itself. This procedure differs

310

from the common practice of getting the predominant

311

lithotype only within a SU (e.g.Tian et al.,2010) and

312

provides a transformation from the original categorical

313

variable to its continuous counterpart expressed through

314

(9)

the lithotype to SU surface-ratio. Finally, we have also

315

computed two morphometric indices: the maximum

316

possible distance in a given SU divided by the square

317

root of the area within a given SU (maxD/√ A, here-

318

after) and the perimeter divided by the square root of

319

the area within a given SU (P/√

A, hereafter), which

320

are measures of roundness of each SU. These indices

321

are computed using the Polygon Shape Indices Module

322

available in SAGA GIS (Conrad et al.,2015) Overall,

323

from an initial set of 13 predictors we generated 273 of

324

them. This number is equal to 13 covariates times 20

325

(19 quantiles+standard deviation) plus 2 morphomet-

326

ric indices and 11 out of 12 lithology to SU ratios. We

327

excluded the ratio of the most representative lithotype in

328

the island, as it can be expressed as a linear combination

329

of the other lithotype ratios, and this avoids producing

330

a rank-deficient design matrix. In the next section, we

331

discuss how to assess the importance of each predictor

332

and how to perform variable selection, in order to get

333

back to a reasonable and interpretable number of signif-

334

icant and relevant predictors. The physical meaning and

335

interpretation of these covariates may be summarised as

336

follows: i) slope orientation and subsequent potential

337

evapotranspiration effects through Eastness and North-

338

ness. These are the continuous counterpart of the As-

339

pect which, being a cyclic number has historically be-

340

ing used as a categorical covariate. Recent contributions

341

(e.g.,Steger et al.,2016) are using those by respectively

342

calculating sin(Aspect) and cos(Aspect) to produce con-

343

tinuous covariates whose domain are bounded between

344

-1 and 1. For Eastness values equal to 1 indicate expo-

345

sition facing East whereas -1 faces West. In analogy,

346

for Northness values equal to 1 are facing North and -

347

1 are exposed to the South. The elevation is typically

348

used asii)a proxy for rainfall intensities while Plan and

349

Profile curvature are interpretable foriii)their influence

350

in directing overland water flows (Ohlmacher, 2007).

351

Distance to Fault indicates iv) pre-existing weakness

352

planes whereas Distance to Geological Boundaries may

353

complement this structural information by indicatingv)

354

contacts between lithotypes with different geotechnical

355

properties. Relative Slope Position represent the contin-

356

uous counterpart to the more general Landform Classifi-

357

cation, by expressingvi)pixel positions along the land-

358

scape with values equal to 0 for lowland areas and val-

359

ues equal to 1 for ridges. Finally, Topographic Wetness

360

Index expressesvii)the tendency of a given pixel to re-

361

tain water and subsequently increase pore water pres-

362

sures.

363

2.4. LASSO-penalized generalized linear model

364

In terms of modelling, our goal is to link the dichoto-

365

mous dependent variable expressing landslide pres-

366

ences/absences with the set of predictors (or indepen-

367

dent variables) described in Section 2.3.2. A natu-

368

ral framework for this is provided by logistic regres-

369

sion models, which are constructed using a function

370

that links the probabilities of the binary responses (i.e.,

371

the probabilities of landslide presence or absence) with

372

a linear function of the predictors. These models are

373

framed in terms of the generalized linear model (GLM)

374

family, and are able to assess the strength of associa-

375

tion between a particular predictor and a landslide oc-

376

currence, whilst adjusting for all the other predictors.

377

Details regarding model formulation and estimation can

378

be found in the Supplementary Material.

379

To make sense of the results obtained from the fit-

380

ted landslide susceptibility model and to avoid statisti-

381

cal complexities, we need to find an interpretable sub-

382

set of predictors that are associated with a landslide

383

event. In other words, we need a method to reduce

384

the large number of highly correlated predictors with-

385

out losing parameter interpretability. Although stepwise

386

variable selection has been a very popular technique to

387

face this issue (seeCama et al.(2017)), several authors

388

(see for instanceCopas and Long(1991);Derksen and

389

Keselman(1992);Harrell(2015)) have reported impor-

390

tant statistical deficiencies related to the correct distri-

391

bution of the test statistics, size of the p-values, bias

392

of estimates and standard errors, among other prob-

393

lems. To avoid these issues, we choose to work with

394

the LASSO (Least Shrinkage and Selection Operator)

395

proposed by Tibshirani (1996), designed to deal with

396

multicollinearity in the estimation of large generalized

397

linear models. The LASSO improves prediction error

398

by appropriately shrinking (or penalizing) large regres-

399

sion coefficients towards zero in order to reduce overfit-

400

ting, while performing simultaneous variable selection

401

and parameter estimation with any generalized linear re-

402

gression model. Details regarding the LASSO penaliza-

403

tion applied to logistic regression can be found in the

404

Supplementary Material.

405

Regarding the practical implementation of these

406

methods, we have structured a code (called LUDARA)

407

to fit and validate our model parallelizing over a certain

408

number of replicates N. Specifically, our code (which

409

can be found in the Supplementary Material) proceeds

410

as follows: i) it randomly extracts 75% of landslide

411

presences in the whole region, as well as the same num-

412

ber of landslide absences according to a balanced sam-

413

pling strategy. The logistic regression model is then fit-

414

ted using a LASSO penalty optimally chosen by cross-

415

validation.ii)the code then stores the predictor IDs and

416

corresponding coefficents. iii)the complementary 25%

417

(10)

0.2 0.4 0.6 0.8

0.00.20.40.60.81.0

0.2 0.4 0.6 0.8

Cutoff

0.00.20.40.60.81.0Accuracy

(a)

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

0.0 0.2 0.4 0.6 0.8 1.0 False positive rate

0.00.20.40.60.81.0True positive rate

(b)

Figure 5: Accuracy test (a) and Receiver Operating Characteristic curves (b) for each of the 500 replicates. Red lines represent the corresponding means.

of presences is then used for validation together with

418

an equal number of (randomly chosen) absences; using

419

the validation subset, iv) the Accuracy or (True Posi-

420

tives+True Negatives)/Total Population is computed

421

varying the probability cutoffvalue that separates land-

422

slide from no-landslide cases; v)the Receiver Operat-

423

ing Characterisitic (ROC) curve and the AUC (area un-

424

der the ROC curve) are computed;vi)the whole proce-

425

dure is repeatedN =500 times and therefore produces

426

N different models with their corresponding parameter

427

estimates, ROC curves and AUCs. The predictor im-

428

portance may be assessed by counting the actual selec-

429

tions of the 273 predictors across theNmodels. Mean

430

fitted probabilities, mean ROC curve and mean AUC

431

are obtained by averaging over all theNreplicates;vii)

432

the accuracy is plotted against the varying cutoffs and

433

the maximum accuracy value of the average resulting

434

curves is then adopted to separate stable from unstable

435

landslide conditions. Stepvii)is implemented to quan-

436

titatively determine the best probability cutoff as sev-

437

eral authors (Felic´ısimo et al.,2013;Frattini et al.,2010;

438

Heckmann et al.,2014) have pointed out that the cutoff

439

choice is not unique and may change with the sampling

440

scheme, method or dataset. Details regarding the model

441

performances are given in Sections3and4.

442

3. Results

443

Figure 5(a) shows the results of the accuracy test

444

across 500 replicates where the average maximum ac-

445

curacy of 0.71 has been reached at approximately the

446

standard cutoffof 0.5; this value was fixed for the rest

447

of the analyses for confusion matrices (e.g.Lombardo

448

et al.,2016b). Figure5(b) shows the validation ROC

449

curves across 500 replicates. The corresponding aver-

450

age AUC of 0.78 is associated to a standard deviation

451

of 0.03. This attests for an almost excellent predic-

452

tion skill (Hosmer and Lemeshow,2000) together with

453

a strong stability of the model throughout the cross-

454

validation procedure. Once the quality of the models is

455

confirmed in terms of overall prediction skill and asso-

456

ciated variance, the predictor behavior has been investi-

457

gated to infer the geomorphological reasonability of the

458

results. In particular, due to the combined random data

459

set extraction, high dimensionality of the predictor hy-

460

perspace and the LASSO-based variable selection, the

461

list of significant covariates changes from each repli-

462

cate. In order to get a unique interpretable set of pre-

463

dictors, we have selected the covariates that are present

464

within each of the 500 replicates more than 80% of the

465

time. Figure6plots the percentage of selection for each

466

predictor.

467

The eight covariates above the red line are sorted

468

in order of importance and are listed as follows: i)

469

maxD/sqrtA, ii) Sandstone/SU, iii) Elevation’s 95th per-

470

centile, iv) TWI’s 95th percentile, v) SD of Elevation,

471

vi) Plan Curvature’s 5th percentile, vii) Relative Slope

472

Position’s 5th percentile, viii) Eastness’ 95th percentile.

473

The effects of these covariates on the final susceptibility

474

are assessed analysing the regressionβcoefficient val-

475

(11)

●

●●

●

●●

●

●●●

●

●●●●●●

●

●●

●●●

●

●●●●

●●

●

●●●●●●

●

●●

●●●●●

●

●●

●

●●●●●●●●●●

●

●●●●●

●

●●●●●●●●●

●

●●●●●●●

●

●●●●●●●

●

●●●●

●

●●●●

●●●●●●●●●

●

●●●●●●●●●●

●●●●●●●

●

●●●●

●

●●●●●

●

●●

●

●●●●●●●●●●

●●●●●●

●

●●●

●●●●●●●●●●

●

0 50 100 150 200

Predictor ID

020406080100Percentage

Figure 6: Variable importance measured by percentage of selection over the 500 replicates. The red horizontal dashed line represents the 80% variable-inclusion threshold.

ues (Cama et al., 2015,2016) together with response

476

plots (Pourghasemi and Rossi, 2016; Schillaci et al.,

477

2017a,b;Tziritis and Lombardo,2016). This is shown

478

in Figure 7 where the distribution of the 500 coeffi-

479

cients is summarized. The mean landslide suscepti-

480

bility (computed as the average susceptibility from the

481

500 replicates) is plotted against each covariate in Fig-

482

ure8. This approach complements the analyses of the

483

coefficients providing a more dynamic perspective of

484

variable-interaction effects on the susceptibility. More-

485

over, the mean and standard deviation of the estimated

486

susceptibility for each SU are mapped in Figure9(a)

487

and Figure 9(b). These two elements allow for visu-

488

alizing the prediction and for completing the evaluation

489

on the model performances. This evaluation consists

490

of computing the model error (Lombardo et al.,2016a;

491

Rossi et al.,2010) where the mean susceptibility value is

492

plotted against twice the standard deviation. Figure10

493

presents the model error in a density plot with an upper

494

bound of the 95% confidence interval coinciding with

495

0.2.

496

4. Discussion

497

We tested our code, LUDARA, within the Sado Is-

498

land where a well-tested inventory of deep-seated land-

499

slides was made available. The model performances

500

were evaluated at different stages. The first stage im-

501

plemented an accuracy test to select the best probability

502

cutofffor separating stable from unstable conditions in

503

the probability spectrum. This was done by maximizing

504

the accuracy as a function of a moving cutoffwindow.

505

The average maximum accuracy across replicates was

506

0.71 at a probability cutoffvalue of 0.5 which is also the

507

commonly adopted threshold. The same threshold was

508

subsequently applied to the rest of the metric calcula-

509

tions (TP, FP, TN, FN). In addition, the ROC curves and

510

corresponding average AUC value of 0.78 confirmed an

511

almost excellent prediction together with low associated

512

variance.

513

The geomorphological reasonability of the model

514

was inferred on those covariates which were selected

515

more than 400 times out of the total 500 replicates. Fig-

516

ure8summarizes their coefficients during the modeling

517

phase while Figure7shows the predictor response over

518

the fitted probabilities for each of the SU in the island.

519

MaxD/sqrt(A) appeared to be the most significant co-

520

variate. This shape index carries the information on the

521

elongation of a given SU which decreases as the ratio

522

increases. The sign of the correspondingβcoefficient is

523

negatively associated with landslide presences (median

524

β=−0.33) indicating that the less elongated the SU the

525

more unstable it would be. This can be interpreted both

526

as a classification effect, as most of the landslides in the

527

area are large, or as a lower shear resistance along the

528

sliding surface against greater unstable volumes. The

529

same relation can be seen on the response plots where

530

the probability rapidly decays as the maxD/sqrt(A) in-

531

creases.

532

The Sandstone to SU areal ratio similarly shows a

533

negative effect on landslide occurrences (medianβ =

534

−0.01) which is confirmed over the response with two

535

evident clusters, the main one being strictly unstable

536

where Sandstone/SU is absent and a second cluster with

537

very low probabilities at high Sandstone proportions. It

538

is worth noting that the Sandstone was shown in Fig-

539

ure3(g) to be the third most representative outcropping

540

lithotype for the landslides. Despite this, the regres-

541

sion βcoefficient was estimated to be negative. This

542

can depend on the limited thickness of these sandstone

543

deposits when compared to the sliding surface of a deep-

544

seated landslide together with the prevalent presence of

545

this covariate along the central graben where the topog-

546

raphy is flat. The Elevation’s 95th percentile has a pos-

547

itive effect on instabilities (medianβ=0.11) which re-

548

sults in a rapid increase in the SU susceptibility. This

549

can be explained by the fact that this covariate repre-

550

sents the highest elevation values within a SU which

551

can also be translated into a greater exposition to rainfall

552

discharges, which in turn destabilize the slope unit.

553

The TWI’s 95th percentile contributes to SU instabil-

554

ity (medianβ=0.01). The corresponding response plot

555

distribution shows a peak at 8 where the maximum in-

556

stability is reached. This distribution can be interpreted

557

as the destabilizing effect of water infiltration where the

558

landslide susceptibility is high while the decay to sta-

559