Reliability is defined as “a measure of the

(1)

Unit VI

Reliability

Reliability is defined as “a measure of the

Reliability is defined as a measure of the

success with which the system conforms to some

authoritative specification of its behavior…p

When the behavior deviates from that which is

(2)

Basic

Concept

• The reliability can be divided into two parts

• The reliability can be divided into two parts.

– Application Dependent.

Application Independent

– Application Independent.

• The Application Independent specification of reliability consists in requiring that transaction maintain atomicity consists in requiring that transaction maintain atomicity, durability, serializability & isolation properties.

• Application dependent part consists of requiring thatApplication dependent part consists of requiring that transaction fulfill the general system’s specifications.

• We emphasize two aspects of reliability :We emphasize two aspects of reliability :

– Correctness.

(3)

• Example :‐ Consider the DD consisting of two sites 1 & site 1 (the coordinator) to site 2

site 1 (the coordinator) to site 2.

There are two possible strategies to handle the problem.

¾First considers the correctness requirement by keeping X2 locked until failure is repaired.

¾Second maximizes the availability at the risk of

(4)

Following are the problems when we try to design a reliable distributed database system.

• Commitment of transaction :‐ If we use 2‐Phase commitment protocol, we lose availability.

We can use different protocols which allow a transaction to terminate properly even in presence of failures. These called Termination Protocols.

• Multiple copies of data & robustness of concurrency control :‐

• Determining the state of the network :‐

• Detection & resolution of inconsistencies :‐

• Checkpoints & Cold restart :‐

(5)

Nonblocking

Commitment

Protocols

• A commitment protocol is called blocking if

A

commitment

protocol

is

called

blocking

if

occurrence

of

some

kinds

of

failures

forces

some

of

the

participating

p

g

sites

to

wait

until

failure

is

repaired.

• A

transaction

which

can

not

be

terminated

at

a

site

is

called

pending

at

this

site.

• The

2 ‐

Phase

commitment

p

protocol

is

blocking

g

if

coordinator

fails

&

some

participant

has

at

the

same

time

declared

itself

ready

to

commit.

(6)

I

- / PM

I PM / RM

ua / PM / AAM

U

tm / ACM R

A C

AAM / ACM RM / CM

C A

CM / - ACM /

-A C

Coordinator

C A

Participant

(7)

Notes Messages

= Transitions which are due to an exchange of messages = Transitions which are due to an exchange of messages.

= Unilateral Transitions (Unilateral abort or timeout)( )

α / ββ = α is the incoming message or local condition, g g ,

(8)

• If

a

state

diagram

of

this

kind

is

used

for

analyzing

reliability aspects of a protocol care must be

reliability

aspects

of

a

protocol,

care

must

be

taken

in

assuming

that

transitions

from

one

state

to another are atomic

to

another

are

atomic.

• For

example,

consider

a

transition

from

state

X

to

Y

i h i

I &

O

state

Y

with

input

I

&

output

O.

• The

following

behavior

is

assumed.

1. The

input

message

I

is

received.

(9)

Nonblocking

Commitment

Protocols

with Site Failures

with

Site

Failures

• We

are

interested

in

designing

a

termination

protocol

for

the

2 ‐

Phase

Commitment

protocol

which

allows

the

transaction

to

be

terminated

at

all

operational

sites,

when

a

failure

of

the

coordinator

site

occurs.

• This

is

possible

only

in

theses

two

cases

1. At least one of the participant has received the

1. At

least

one

of

the

participant

has

received

the

command.

(10)

The

3 ‐

phase

commitment

protocol

(11)

I

State diagram for the 3-Phase-Commitment Protocol

New States PC = Prepared-to-Commitp

(12)

• This new protocol eliminates the blocking problem of

the 2-phase-commitment protocol because

p

1. If one of the operational participants has received the

command and the command was ABORT then the

operational participants can abort the transaction

2. If one of the operational participants has received the

d

d h

d

ENTER

commands and the command was

ENTER-PREPARED-STATE, then all the operational

participants can commit the transaction

3. If none of the operational participants has received

the ENTER-PREPARED-STATE command , we have

the ENTER PREPARED STATE command , we have

the case which can not be terminated for a 2PC

(13)

Termination

protocols

for

3 ‐

phase

‐

commitment

• The design of termination protocols is based on the following • The design of termination protocols is based on the following

property.

• If at least one operational participant has not entered If at least one operational participant has not entered the Prepared‐to‐Commit state, then the transaction can be safely aborted.

• If at least one operational participant has entered the

Prepared‐to‐Commit state, then the transaction can be safely committed.

• Since the above conditions are not mutually exclusive, in several cases the termination protocol can decide whether to commit or cases the termination protocol can decide whether to commit or abort.

(14)

• The simplest termination protocol is the centralized , nonprogressive protocol.

• First the coordinator is elected by the operational participant. • The new coordinator behaves as follows.

1 If the new coordinator is in the Prepared to Commit State it issues 1. If the new coordinator is in the Prepared‐to‐Commit State, it issues

to all operational participants the command to enter also in the state ; when it has received all the OK messages , it issues the COMMIT command

2. If the new coordinator is in the commit state , i.e. , it has

committed the transaction it issues the COMMIT command to all committed the transaction , it issues the COMMIT command to all the participants

3. If the new coordinator is in the abort state, it issues the ABORT command to all the participants

4. Otherwise , new coordinator orders all participants to go back to a state previous to the Prepared to Commit and after it has

state previous to the Prepared‐to‐Commit , and after it has

(15)

• This protocol is similar to 3‐Phase‐Commitment protocol.

• In case of failure of new coordinator, the same

termination protocol can be reentered by the remaining

operational sites b electing a ne coordinator

operational sites by electing a new coordinator.

• Disadvantage :‐ It is nonprogressive.

Th l i hi h di b

• There are several ways in which a new coordinator can be selected.

O f th t l t di t i t i

(16)

Restart

Protocols

for

3 ‐

Phase

‐

Commitment

• A restart protocol is e ec ted b a site hen it reco er

• A restart protocol is executed by a site when it recover from a failure.

• In the case of 2 Phase Commitment the restart

• In the case of 2‐Phase‐Commitment, the restart protocol requires accessing remote recovery

information if the participant failed while it was in information, if the participant failed while it was in ready state.

• With 3With 3 Phase Commitment‐Phase‐Commitment & termination protocol the & termination protocol, the restart procedure will have to access remote recovery information if pparticipantp has completedp the first phase,p , independently of whether it has reached the prepared‐ to‐commit state or not, because at restart it is not

(17)

Existence of nonblocking protocols for partitions

Commitment

Protocols

&

Network

Partitions

Existence of nonblocking protocols for partitions

• The

problem

of

the

existence

of

nonblocking

protocol

in case of partition can be addressed by considering a

in

case

of

partition

can

be

addressed

by

considering

a

different

problem

:

the

existence

of

protocols

which

allow

independent

p

recovery

y

in

case

of

site

failures.

• Suppose

that

we

can

build

the

protocol

such

that

if

one site, say site2, fails, then

one

site,

say

site2,

fails,

then

1. The

other

site,

site1,

terminates

the

transaction

2 Site2 at restart terminates the transaction correctly

2. Site2

at

restart

terminates

the

transaction

correctly

without

requiring

any

additional

information

from

site1

(18)

• The

modified

protocol

is

based

on

the

following

assumptions:

‐

assumptions:

1. A

site

discovers

that

another

site

is

down

by

not

receiving a required message within a given

receiving

a

required

message

within

a

given

timeout

2 A

b l

l b

f

i

2. A

message

can

be

lost

only

because

of

a

site

failure

3. Each

site

receives

a

message

,

changes

,

and

sends

the

required

answer

as

an

atomic

(19)

Protocol

which

can

deals

with

partitions

Primary

approach:

• If

the

2PC

protocol

is

used

together

with

a

primary

site

approach

,

then

it

is

possible

to

terminate

all

the

transactions

of

the

group

of

the

primary

site

,

if

and

only

if

the

coordinators

of

all

pending

transactions

belong

to

this

group

(20)

Majority approach and quorum‐based protocols

The basic rules of a quorum based protocols are The basic rules of a quorum based protocols are:

1. Each site i has associated with a number of votes Vi , Vi being a positive integer

being a positive integer.

2. Let V indicate the sum of the votes of all sites of the

(21)

(22)

• A centralized termination protocol for the quorum‐ based 3PC has the following structure:

1. A new coordinator is elected

2. The coordinator collects state information and acts 2. The coordinator collects state information and acts

according to the following rules :

a. If at least one site has committed (aborted) , send a a. If at least one site has committed (aborted) , send a

COMMIT(ABORT) command to the other sites

b. If the number of votes of sites which have reached the

b t e u be o otes o s tes c a e eac ed t e

prepared‐to‐commit state is greater than equal to Vc , send a COMMIT command.

c. If the number of votes of sites in the prepare to abort state reaches about quorum , send an ABORT

(23)

d. If

the

number

of

votes

of

sites

which

have

reached

the prepare to commit state plus number of votes

the

prepare

to

commit

state

plus

number

of

votes

of

uncertain

sites

is

greater

than

or

equal

to

Vc

,

send a PREPARE

‐

TO

‐

COMMIT command to

send

a

PREPARE TO COMMIT

command

to

uncertain

sites

and

wait

for

condition

2b

occur

e If the number of votes which have reached the

e. If

the

number

of

votes

which

have

reached

the

prepare

‐

to

‐

abort

state

plus

number

of

votes

of

uncertain sites is greater than or equal to Va, send

uncertain

sites

is

greater

than

or

equal

to

Va,

send

a

PREPARE

‐

TO

‐

ABORT

command

and

wait

for

condition

2c

occur

(24)

Reliability

&

Concurrency

Control

• Suppose

that

there

is

a

failure.

• How

can

we

maximize

the

number

of

transactions

which

are

executed

during

this

failure

by

operational

part

of

the

system?

(25)

Nonredundant

Databases

• If the database is nonredundant,, then it is veryy simplep to

determine which transactions can be executed.

(26)

Redundant

Databases

• There are two reasons to have redundancy

There

are

two

reasons

to

have

redundancy

– To increase locality of reads.

To increase availability & reliability of system

– To increase availability & reliability of system.

• We

have

seen

three

main

approaches

to

t l b

d

2 PL

concurrency

control

based

on

2 ‐

PL

– Write‐locks‐all

– Majority locking

– Primary copy locking.

(27)

Example :‐ Consider a distributed databases consists of

(28)

Group 1 Group 2 Group 3

A) 1 2, 3 ‐‐‐

B) 2 1, 3 ‐‐‐

C) 3 1, 2 ‐‐‐

)

D) 1 2 3

(29)

(30)

• Write

Write locks all.

‐

locks

‐

all

• Weighted

majority

locking.

(31)

Determining

a

Consistent

View

of

the

Network

• There are two aspects for this

• There are two aspects for this.

– Monitoring the state of the network.

Propagating a new state information to all sites

– Propagating a new state information to all sites consistently.

• We can use timeouts in the algorithm to discover if site is

• We can use timeouts in the algorithm to discover if site is down.

• But use of timeout may lead to inconsistent view of theBut use of timeout may lead to inconsistent view of the network.

(32)

• We assume that a generalized networkwide mechanism is built such that all higher‐level programs are provided with the following facilities.

1. There is at each site a state table containing an entry for each site. The entry can be up or down.

2. Any program can set a “watch” on any site, so that it receives an interrupt when a site changes state.

• A site considers up only those sites with which it can

i h f ll h d i hi h b l

communicate, therefore all crashed sites which belong to a different group in case of partitions are considered

down down.

• We will consider separately the problem of monitoring & propagating state information

(33)

Monitoring

the

State

of

the

Network

• Generally basic mechanism for deciding whether a site is

• Generally basic mechanism for deciding whether a site is up or down is to request a message from it & wait for a timeout.

• Let us call requesting site the controller & other site the

controlled site.

• In a monitoring algorithm, instead of having controller

request messages from the controlled site, it is more easy to have controlled site send I‐AM‐UP message periodically to the controller.

• Using this mechanism for detecting whether a site is up or down, the problem consists of assigning controllers to

(34)

• A possible solution is to assign circular ordering to the sites and to assign to each site the function of controller of its predecessor.

• In absence of failures, each site periodically sends I‐AM‐ UP message to its successor & controls that I‐AM‐UP message from its predecessor arrives in time.

• If I‐AM‐UP message from the predecessor does not arrive in time, then the controller assumes that the controlled site has failed updates the state table & controlled site has failed, updates the state table & broadcasts the updated state table to all other sites.

• If the predecessor of the site is down then the site has

(35)

. . . . . . . .

K-3 K-2 K-1 K (Sites)

UP DOWN DOWN UP (States)

UP DOWN DOWN UP (States)

(36)

Broadcasting

a

New

State

E

h ti

th

it

f

ti

d t

t

t t

• Each

time

the

monitor

function

detects

a

state

change,

it

broadcasts

the

new

state

table

so

that

ll it

f th

h

t t t bl

all

sites

of

the

same

group

have

same

state

table.

• Since

this

function

could

be

activated

by

several

sites

in

parallel,

some

mechanism

in

needed

to

control

interference.

• A

possible

mechanism

is

to

attach

a

globally

(37)

Detection

&

Resolution

of

Inconsistency

• When a partition of the network occurs, transactionsWhen a partition of the network occurs, transactions

should be run at most in one group of sites if we want to

preserve consistency of the database.

• But in some applications transactions are allowed to run in all partitions where there is at least one copy of the necessary data to get more availability.

• When a failure is repaired, one can try to eliminate

i i

inconsistency.

• To do this it is necessary first to discover which portions

f th d t b i i t (D t ti f

of the data become inconsistency (Detection of

inconsistency) & then to assign these portions a value

which is most reasonable (Resolution of inconsistency)

(38)

Detection

of

Inconsistency

• Let us assume that during a partition, transactions haveLet us assume that during a partition, transactions have been executed in two or more groups of sites &

independent updates may have been performed on different copies of the same fragment.

• The general approach consisting of comparing the contents of copies to check that they are identical or not is inefficient & incorrect.

• A correct approach is the detection of inconsistencies can be based on version numbers.

(39)

• During normal operation all copies are master copies & mutually consistent.

• For each copy an Original version number & Current version number are maintained.

• Initiall Original ersion n mber is set to 0 & c rrent

• Initially Original version number is set to 0 & current version number is set to 1.

• Each time an updatep is performedp on the copypy onlyy current version number is incremented.

• When a partition occurs, the original version number of each isolated copy is set to the value of its current

each isolated copy is set to the value of its current version number.

• The originalg version number records the current version number of the isolated copies before any “partitioned updates” are performed on it.

• The original version number is not altered until the

(40)

• Example

:

‐

Let

us

consider

copies

x1,

x2

&

x3

of

data item x are stored at three different sites

data

item

x

are

stored

at

three

different

sites.

• Let

V1,

V2

&

V3

are

version

number.

I iti ll

ll

i

i t

tl

d t d

• Initially

all

copies

are

consistently

updated.

• Assume

that

one

update

is

performed,

so

V1

=

(0,2)

V2

=

(0,2) V3

=

(0,2)

• Now

a

partition

occurs

separating

x3

from

the

other

two

copies.

• Let

x1

&

x2

as

master

copies.

p

(41)

• Suppose that only master copies are updated V1 = (0 3) V2 = (0 3) V3 = (2 2)

V1 = (0,3) V2 = (0,3) V3 = (2,2)

• After repair it is possible to see that x3 has not been modified,, since its current & originalg version number are same.

• In this case, no inconsistency occurred & it is sufficient to perform the updates on x3.

• Now suppose that only x3 is updated during partition V1 = (0,2) V2 = (0,2) V3 = (2,3)

• Since original version number of x3 is not equal to x1 &

2 th t i h t b d t d

x2, the master copies have not been updated.

• If there are no other copies then we can apply to the master copies the updates of x3

(42)

Checkpoints

&

Cold

Restart

• Cold restart is required after some catastrophic failure q p which has caused the loss of log information on stable storage.

• In DDB cold restart is difficult because if one site has toIn DDB, cold restart is difficult because if one site has to establish an earlier state, then all other sites also have to establish earlier state.

Th i l b l ff ti ll it f th

• The recovery process is global, affecting all sites of the database.

• A consistent global restart C is characterized by the g y following properties.

– For each transaction T, C contain the updates

performed by all subtransactions of T at any site or it performed by all subtransactions of T at any site or it does not contain any of them.

– If a transaction T is contained in C, then all conflicting

i hi h h d d i h i li i

(43)

• The

simplest

way

to

reconstruct

a

global

consistent state in a DD is to use local dumps

consistent

state

in

a

DD

is

to

use

local

dumps,

local

logs

&

global

checkpoints.

• A

global

checkpoint

is

a

set

of

local

checkpoints

which

are

performed

at

all

sites

of

the

network

&

are

synchronized

by

the

condition

“If

a

subtransaction

of

a

transaction

T

is

contained

in

the

local

checkpoint

at

some

site,

then

all

other

subtransaction

of

T

must

be

contained

in

the

(44)

• If

global

checkpoints

are

available

then

reconstruction problem is solved as follows

reconstruction

problem

is

solved

as

follows.

• At

the

failed

site

the

latest

local

checkpoint

which

b

id

d

f i d t

i

d

can

be

considered

safe

is

determined.

• This

determines

which

earlier

global

state

has

to

be

d

reconstructed.

• Then

all

other

sites

are

requested

to

reestablish

the

local

states

of

the

corresponding

local

checkpoints.

(45)

• There

are

three

possible

solutions

are

1 To find less expensive ways to record global

1. To

find

less

expensive

ways

to

record

global

checkpoints,

so

called

loosely

synchronized

checkpoints

checkpoints.

All

sites

are

asked

by

a

coordinator

to

record

a

global

checkpoint.

p

2. To

avoid

building

global

checkpoints

at

all,

let

the

recovery procedure take the responsibility of

recovery

procedure

take

the

responsibility

of

reconstructing

a

consistent

global

state

at

cold

restart.

3. To

use

2 ‐

Phase

‐

Commitment

protocol

for

guaranteeing

that

the

local

checkpoints

created

by

g

p

y

(46)

DDB

Administration

• It deals with a variety of activities for

It

deals

with

a

variety

of

activities

for

development,

control,

maintenance

&

testing

of

the software of database application

the

software

of

database

application.

• The

two

important

issue

in

database

administration is the degree of site autonomy

administration

is

the

degree

of

site

autonomy.

1.Absence

of

Local

Autonomy

:

‐

The

functions

of

a

l b l DBA

i il

li d DBA

global

DBA

are

similar

to

centralized

DBA.

(47)

Catalog

Management

in

DDB

• Catalogs are used for

Catalogs

are

used

for

1.Translating

application

:

‐

Data

referenced

by

application at different levels of transparency are

application

at

different

levels

of

transparency

are

mapped

to

physical

data.

2 O i i i

A

li

i

D

ll

i

2.Optimizing

Applications

:

‐

Data

allocation,

access

methods

available

at

each

site

&

statistical

i f

i

i d f

d

i

information

are

required

for

producing

access

plans.

(48)

Content

of

Catalog

1. Global

Schema

Description

2. Fragmentation

g

Description

p

3. Allocation

Description

4 Mapping to Local Names

4. Mapping

to

Local

Names

5. Access

Method

Description

6. Statistics

on

the

Database

(49)

Distribution

of

Catalog

• Catalogs can be allocated in DDB in many different

Catalogs

can

be

allocated

in

DDB

in

many

different

ways.

The

basic

ways

are

1 Centralized Catalogs

1.Centralized

Catalogs

2.Fully

Replicated

Catalogs.

3.Local

Catalogs.

(50)

Object

Naming

&

Catalog

Management

with

Site

Autonomy

y

• The

major

requirement

is

to

allow

each

local

user

to create & name his data independently as well

to

create

&

name

his

data

independently

as

well

as

allowing

several

users

to

share

data.

Therefore

Data definition sho ld be performed locall

–

Data

definition

should

be

performed

locally.

–

Different

users

should

be

able

to

give

same

name

to

different

data.

(51)

1. Systemwide

Names

• Unique name given to each object in the system

Unique

name

given

to

each

object

in

the

system

consists

of

1 ID of the user who creates the object

1. ID

of

the

user

who

creates

the

object.

2. The

site

of

that

user.

3. The

object

name.

4. The

birth

site

of

the

object

j

User_1@BVRIT.STU@HYD

2. Print

Names

:

‐

are

shorthand

names

for

t

id

(52)

Authorization

&

Protection

• Site

Site to Site

‐

to

‐

Site Protection

Protection.

• User

Identification.

E f

i

A h

i

R l

• Enforcing

Authorization

Rules.