저작자표시-비영리-변경금지 2.0 대한민국 이용자는 아래의 조건을 따르는 경우에 한하여 자유롭게
l 이 저작물을 복제, 배포, 전송, 전시, 공연 및 방송할 수 있습니다. 다음과 같은 조건을 따라야 합니다:
l 귀하는, 이 저작물의 재이용이나 배포의 경우, 이 저작물에 적용된 이용허락조건 을 명확하게 나타내어야 합니다.
l 저작권자로부터 별도의 허가를 받으면 이러한 조건들은 적용되지 않습니다.
저작권법에 따른 이용자의 권리는 위의 내용에 의하여 영향을 받지 않습니다. 이것은 이용허락규약(Legal Code)을 이해하기 쉽게 요약한 것입니다.
Disclaimer
저작자표시. 귀하는 원저작자를 표시하여야 합니다.
비영리. 귀하는 이 저작물을 영리 목적으로 이용할 수 없습니다.
변경금지. 귀하는 이 저작물을 개작, 변형 또는 가공할 수 없습니다.
이 학 석 사 학 위 논 문
A Study on a Semiparametric Method to Point Source Modelling and its Application
점 근원에 의한 모델링의 준모수적 방법과 그 응용에 대한 연구
2016년 8월
서울대학교 자연과학대학원 통계학과
송 호 승
A Study on a Semiparametric Method to Point Source Modelling and its Application
지도교수 박 병 욱
이 논문을 이학석사 학위논문으로 제출함
2016년 6월서울대학교 자연과학대학원 통계학과
송 호 승
송호승의 이학석사 학위논문을 인준함
2016년 6월위 원 장 김 용 대
(인)부위원장 박 병 욱
(인)위 원 오 희 석
(인)A Study on a Semiparametric Method to Point Source Modelling and its Application
by
Hoseung Song
A Thesis
submitted in fulfillment of the requirement for the degree of
Master of Science in
Statistics
The Department of Statistics College of Natural Sciences
Seoul National University August, 2016
Abstract
Diggle has been analysed point source problems, and he suggested a inde- pendent Poisson Process model for disease case and control with each intensity functions (1990). Also, Diggle and Rowlingson (1994) suggested a non-linear binary model for that bivariate poisson process model by conditioning case and control locations. By this method, they can extend a model to include several point sources and explanatory variables. Especially, Rodrigues, Dig- gle and Assucao (2009) suggested a semiparametric model as a Generalized Additive Model(GAM) to analyse point source effect in Epidemiology and Criminology. This paper introduces semiparametric approach for point source modelling and its application. Also, we apply that method to asthma data and compare its result to previous parametric result in 1994.
Keywords: Point Source, Poisson Process, Generalized Additive Model, Semi- parametric Method, Spatial Statistics
Student Number: 2014-22359
Contents
1 Introduction 1
2 The Conditional Approach to Point Source Model 3 2.1 Poisson Process Model for a Point Source . . . 3 2.2 Conditional Approach . . . 4
3 Semiparametric Approach 6
4 Application 8
4.1 Larynx and Lung Cancers in Lancashire, UK . . . 8 4.2 CCTV in Belo, Brazil . . . 9
5 Simulation 15
6 Conclusion 20
List of Figures
4.1.1 Location of Larynx and Lung Cancers, and a Incinerator in
South Lancashire . . . 12
4.1.2 Results . . . 12
4.2.1 Location of street robberies in Belo . . . 13
4.2.2 Ratio between the proportion of crimes after and before the installation of cameras . . . 13
4.2.3 Estimates in two crowded streets . . . 14
4.2.4 Estimates in two less-crowded streets . . . 14
5.0.1 Location of asthmatic in North Derbyshire, UK . . . 17
5.0.2 Semiparametric estimate with source 1 . . . 17
5.0.3 Semiparametric estimate with source 2 . . . 18
5.0.4 Semiparametric estimate with source 3 . . . 18
5.0.5 Modelling with each variable cases . . . 19
Chapter 1
Introduction
There have been many attempts to explain the possible elevation in risk of one or more point sources which provide potential hazard, especially in Epidemiology and Criminology. For example, in Lancashire, UK, there existed lung and larynx cancers which came from an old incinerator according to the distance from an incinerator. Also, the spatial distribution of crimes in the city of Belo Horizonte, Brazil is different by the CCTV in Belo.
Our starting point is the method that was proposed by Diggle (1990), a Poisson process model to provide the distribution of disease cases and controls with intensity functionsλ(x), λ0(x) repectively. And Diggle and Rowlingson (1994) suggested this bivariate Poisson process model into non-linear binary model by conditioning on the case and control locations to avoid difficulty from uncertainty of kernel estimate. Furthermore, Rodrigues, Diggle and Assucao (2009) proposed to formulate the problem semiparametrically as a generalized additive model(GAM) to construct a elaborate model and explain interaction between explanatory variables and point sources.
The purpose of this paper is to introduce and study semiparametric ap-
proach to point source modelling and its application in detail. In addition, we apply this method to other data to compare former parametric approach.
The remainder of this paper is organized as follows. Chapter 2 reviews con- ditional approach to a point source model with a Poisson process model for the elevation in risk near a point source. Chapter 3 contains a semiparamet- ric method to point source modelling. In chapter 4, two applications of the point source problems are presented. The first is lung and larynx cancers data in Lancashire, Uk. The second application is the Belo Horizonte crime data.
In chapter 5, we suggest another application to validate the semiparamtric method and compare its result from parametric approach by using asthma data in North Derbyshire, UK. We discuss and summarize our results in chap- ter 6.
2
Chapter 2
The Conditional Approach to Point Source Model
2.1 Poisson Process Model for a Point Source
Diggle (1990) suggested a Poisson process model for the distribution or change of risk according to the distance from point sources. Suppose that the location of a point source is x0 and that n events are observed xi ∈ A, i = 1, ..., n. In this case, Diggle proposed that these events had Poisson point process with a intensity function λ(x) given by
λ(x) =ρλ0(x)f(x−x0|θ) (2.1.1) λ0(x) =intensity f unction of the population risk
f(x−x0|θ) =spatial variation in risk that is associated with point source x0 ρ=scaling f actor relating to overall prevalence of disease
Also, Diggle (1990) suggested the function f(·) according to the distance from a point source given by
f(x−x0|θ) = 1 +θ1exp{(∥x−x0∥ θ2 )2}
∥ · ∥=distance
θ1=elevation in risk at the point source
θ2=rate of decay in risk with increasing distance f rom the source In addition to this model, Diggle and Rowlingson (1994) suggested a model including explanatory variables (zj(x), j=1,..p) when the model accomodates multiple point sources x0i, i=1,..,r. Considering log-linear adjustments for explanatory variables and multiplicative effect for separate sources, they sug- gested a multiple sources’ version for the function f(·) given by
f(x) =
r
Y
i=1
gi(∥x−x0i∥) exp{
p
X
j=1
ϕjzj(x)} (2.1.2)
gi=model the point source ef f ects ϕj =regression parameters
2.2 Conditional Approach
To fit the model, Diggle (1990) substituted unknown λ0(x) to a kernel estimate which is constructed from m control locations xi ∈ A, i = n+1,..., n+m. However, though ignoring the uncertainty of a kernel estimate, this method can not give credit for asymptotic result in standard likelihood. To avoid this difficulty, Diggle and Rowlingson (1994) devised the method by
4
conditioning case and control locations. In a Poisson process model, n+m case and control events follow a independent Bernoulli model given by
• n+m eventsxiare determined by a set of independent Bernoulli random variables with the probability that an event at x is a case given by
p(x) = ρf(x−x0|θ) 1 +ρf(x−x0|θ)
• This is shown by
p(x) = λ0(x)ρf(x−x0|θ)
λ0(x) +λ0(x)ρf(x−x0|θ) = ρf(x−x0|θ) 1 +ρf(x−x0|θ)
By this result, the effect of λ0(x) disappeared and likelihood inference for ρ andθ becomes easy. Log-likelihood for ρ andθ is given by
L(ρ, θ) =
n
X
i=1
log{p(xi)}+
n+m
X
i=n+1
log{1−p(xi)}
Chapter 3
Semiparametric Approach
As a semiparametric method, Rodrigues, Diggle and Assucao (2009) not only use a conditional method but use a generalized additive model for mod- elling p(x). In the case of model (2.1.1), one point source and no explanatory variable, they represented logit{p(x)}= log{ρf(∥x−x0∥)}as a nonparametric function s(·). They used a penalized regression spline
s(x) =β0+β1x+
K
X
k=1
uk|x−wk|3
wherew1< w2 < .. < wk are knots.
For the extended form of the model (2.1.2), with multiple sources and explanatory variables, it is represented by
logit{p(x)}= log{ρf(x)}
= log(ρ) +s(∥x−x0,1∥, ..,∥x−x0,r∥) +
p
X
j=1
hj{zj(x)} (3.0.1) where s is now a surface in ℜr, hj(·) are smooth, but otherwise arbitrary, regression functions.
6
But the model is not useful because s contains a lot of variables and this model allows at most two or three sources. However, if s can decompose as a multiplicative form, that is covariate effect can be log-linear, model (3.0.1) simplifies to
logit{p(x)}= log{ρf(x)}
= log(ρ) +
r
X
i=1
s(∥x−x0,i∥) +
p
X
j=1
ϕjzj(x) (3.0.2) The model (3.0.1) and (3.0.2) are helpful for fitting parametric form about source-related part of the model.
Chapter 4
Application
4.1 Larynx and Lung Cancers in Lancashire, UK
This example handles cases of lung and larynx cancers during the years 1973-1983 in South Lancahire, UK shown by Figure 4.1.1. In this period, they indicated 58 cases of larynx cancer and 978 cases of lung cancer as case and control, so they analysed relative risk of larynx and lung cancers according to the distance from a forbidden incinerator. The objective is to compare nonparametric estimate for a point source to parametric estimate suggested in 1994 by using a new semiparametric method.
In figure 4.1.2, a horizontal line is in case of y=n/m, a estimated prevalence when f(x)=1 which shows there is no point source effect. In a graph, both of two estimates show that the shorter distance from a source(incinerator) is, the more risk of cancer there is. Also, an estimate in a parametric model has a higher reduced velocity than a nonparametric estimate when the distance from a source increases. In other words, we can notice that a parametric
8
estimate get stronger effect from a point source than a nonparametric estimate.
Also, recognizing that a parametric estimate is in 95% confidence limit of a nonparametric estimate, it can suggest the criteria for fitting a parametric model.
4.2 CCTV in Belo, Brazil
This example is the analysis for increase and decrease of crime rate based on the installation of CCTV in Belo, Brazil. 60 CCTV were installed on streets of Belo on December 2004 shown by Figure 4.2.1(The number represents the location and indices of the cameras). In this example, they considered 12280 crimes from January 1st, 2002 to December 12, 2004 and 4329 crimes from December 13, 2004 to December 31, 2006 as control and case.
There are two ways to model cameras’ effect. First, assume that f(x) in model (2.1.1) is influenced by all cameras. However, this consideration has a weak point ignoring geographical location of camera effect. Second is assuming that f(x) is influenced only by effect of the nearest camera. Even though this way can see the change of discontinuous f on a point which has the same distance from two points having different effects, this is unrealistic because effect of all cameras are different shown by Figure 4.2.2.
Thus, if we assume each cameras has their own effects, we mix two ways to reflect two cameras’ effect which are the nearest from x given by
f(x) =gi(x)(∥x−x0,i(x)∥)gj(x)(∥x−x0,j(x)∥)
(where, for each x, i(x) and j(x) identify the cameras at either end of the segment(street) to which x belongs)
sidered independent random sample with densities of case and control given by
λc(x) = λ0(x)
R f(x)λ0(x)dx and f(x)λc(x) Rf(x)λc(x)dx In this case, as R
f(x)λc(x)dx = E[f(x)], expectation is a density for λc(x).
Also,ρ disappears. So, the density for case is represented by λc(x)[ f(x)
Ef(x)] , and our interest becomes f(x)/E[f(x)].
When we simply represent i(x) and j(x) as i and j, semiparametric approach follows that
logit[p(x)] = log{n m
f(x) E[f(x)]}
= log(n
m) +α+wi(x)si(∥x−x0,i∥) +wj(x)sj(∥x−x0,j∥) wherewi(x)and wj(x) = weights, α=−log{E[f(x)]},
si(∥x−x0,i∥) = log{gi(∥x−x0,i∥)} which represents source related part mod- elled by penalized regression splines as in (6) with 10 knot).
Figure 4.2.3 shows two streets which have a high floating population in Belo. On f(x) = 1 basis, there is not strong reason that the installation of CCTV decreases crime rate, but at least locations having cameras show less crime events and there are more crimes as distance from a camera increases.
Figure 4.2.4 shows the analysis of two less crowded streets. We can recognize that there are less events than former result. However, in Figure 4.2.4.(a), cameras 36, 37, 38 show high tendency and this is because they are interaction with crowded streets in Figure 4.2.3.
Brown (1995) claimed that it is difficult to show camera effect in crowded 10
area. This analysis coincides with his opinion by showing that there is no significant CCTV effect on decreasing crimes on complex streets.
Figure 4.1.1: Location of Larynx and Lung Cancers, and a Incinerator in South Lancashire
Figure 4.1.2: Results
12
Figure 4.2.1: Location of street robberies in Belo
Figure 4.2.2: Ratio between the proportion of crimes after and before the installation of cameras
Figure 4.2.3: Estimates in two crowded streets
Figure 4.2.4: Estimates in two less-crowded streets
14
Chapter 5
Simulation
We try to apply a semiparametric method to new data, called Asthma data. This data set consists of the incidence of asthma in children in North Derbyshire, UK in 1992. This data set has been studied by Diggle and Rowl- ingson (1994) to research the relationship between asthma and the proximity to three putative pollution sources(a coking works, chemical plant, and waste treatment centre). Children who suffered from asthma are regarded as cases while others included in the study are considered controls.
Figure 5.0.1 shows the point pattern, including the location of pollution sources and the boundary of the region. We notice that there are many asth- matics around source 1 and source 3(more asthmatics in source 1 than 3) but less asthmatics around source 2. The objective is to obtain nonparametric estimate for point sources and to compare with parametric estimate studied in 1994.
Figure 5.0.2 shows that the nearer the distance from source 1 is, the higher risk of asthma becomes. And Figure 5.0.3 and 5.0.4 show that the risk of asthma increases when they far away from source 2 and 3. First, we can
notice that estimate for source 3 is something wrong because there are many asthmatics around source 3 in Figure 5.0.1.
Figure 5.0.5 is the result of study by Diggle and Rodrigues (1994) using a parametric method. This shows maximized log-likelihood for various sub- model, including several point sources and covariates. In our paper, we ignore the effect of other covariates and focus only on the effect of three point sources.
We can realize that there is significant evidence for an association with source 1, however, there is less modest evidence for relationship with source 2 and 3.
When we added either or both of source 2 and 3 to the model including source 1, we obtained negligible increases in the log-likelihood. This results are not shown in Figure.
In comparison with the previous study in 1994, the effect of source 1 is significant and our result also shows resonable result which represents many asthmatics exist around source 1. However, source 2 and 3 do not have strong evidence, so our result for source 2 and 3 also do not give modest evidence.
Actually, semiparametric estimate for source 3 gives inaccurate result. Thus, we can conclude that semiparametric approach using a generalized addititve model corresponds with a parametric method in previous study.
16
Figure 5.0.1: Location of asthmatic in North Derbyshire, UK
Figure 5.0.2: Semiparametric estimate with source 1
Figure 5.0.3: Semiparametric estimate with source 2
Figure 5.0.4: Semiparametric estimate with source 3
18
Figure 5.0.5: Modelling with each variable cases
Chapter 6
Conclusion
In this paper, we review semiparametric approach to point source modelling using a generalized additive model, especially a semiparametric method to a point process model for point source intervention. They treated conditional approach studied in 1994 as a generalized additive model. To understand this approach, we study some examples about the effect of an incinerator in Lancashire, UK and CCTV in Belo, Brazil. The first example is used to validate or develop a parametric model for point source effect. The second example is used to show advancement of a semiparametric method when there exists many point source effects.
Furthermore, we apply semiparametric approach to other data, Asthma data studied by Diggle and Rowlingson in 1994. So, we try to compare its result to the result analysed by using parametric method, and our result re- veals the similar evidence to the previous study. Using this semiparametric approach, we can build a more elaborate model. Also, we can explain the interaction between point sources and explanatory variables well, and we can know mixed effect of point sources.
20
References
Diggle, P.J., Rodrigues, A. and Assuncao, R. (2009), “Semiparametric ap- proach to point source modelling in epidemiology and criminology”, Journal of the Royal Statistical Society, Series C,59(3), 533–542.
Diggle, P.J., Rowlingson, B.S. (1994), “A Conditional Approach to Point Process Modelling of Elevated Risk”, Journal of the Royal Statistical Society, Series A,157(3), 433–440.
Bivand, R.S., Pebesma, E.J. and Gomez-Rubio, V. (2008), “Applied Spatial Data Analysis with R”, Springer, New York
Wood, S. (2006), “Generalized Additive Models : an introduction with R”, Chapman and Hall-CRC
국문초록
점근원의영향으로인한시공간적인분석에대한방법은그동안 여러가지가제시 되어왔다. 본논문은역(疫)학과범죄학에서의점근원에대한효과를분석하는 준모수적 접근 방법에대해살펴본다. 특히,이변수로조건을 걸어 만든 푸아송 과정 모형에서일반화 가법 모형을 적용시킨준모수적방법을검토하고그것의 실제 적용 사례들을 살펴본다. 더 나아가 천식을 다룬 새로운 자료를 토대로 준모수적방법을적용해기존에분석되었던모수적방법에의한결과와비교하여 그타당성을검증하고예측해보았다. 이러한방법들은점근원에대한시공간적 효과들에 대해 추론할수 있고, 그것들을통해 현상의 원인을 파악하는데 많은 도움을줄것이다.
주요 어: 점근원,푸아송과정모형,준모수적 접근,조건부접근,일반화가법 모형,공간통계학
학번: 2014-22359