• Tidak ada hasil yang ditemukan

PDF Virtualisation for the Masses

N/A
N/A
Protected

Academic year: 2023

Membagikan "PDF Virtualisation for the Masses"

Copied!
37
0
0

Teks penuh

(1)

Virtualisation for the Masses:

Exposing KVM on Android

Will Deacon <[email protected]>

KVM Forum, October 2020

(2)

Introduction

Who am I and what am I hacking on?

० Active upstream kernel developer, co-maintaining the arm64 architecture port, locking, atomics, memory model, TLB, SMMU, ...

० Joined Android Systems Team at Google last year ० Leading the “Protected KVM” project to enable KVM

on Android

Top contributors to KVM/arm64 for 5.9 and 5.10

Lots more to come (seems to be a hot topic)

Disclaimer: very much a work-in-progress!

Upstreaming as we go.

(3)

The state of modern

Android

(4)

Generic Kernel Image (GKI)

● Problem: Separate kernel for each device does not scale and leads to fragmentation:

○ Difficult/expensive to provide updates

○ In-field release upgrades can be impossible

○ Bad for upstream

● GKI aims to maintain subset of kernel ABI within a given Android release and kernel version (e.g.

android11-5.4)

○ GKI branches forked from android-mainline

■ Close to upstream

■ Updated with regular LTS merges

○ Vendors/OEMs can provide modules

https://source.android.com/devices/architecture/kernel/generic-kernel-image

(5)

tl;dr: It’s the Wild West of fragmentation

When present, the hypervisor is treated as part of the device firmware and is typically supplied by the SoC vendor or OEM:

Security enhancements for protecting the kernel

“Mitigations are attack surface, too”

- Jann Horn, Project Zero

Coarse-grained memory partitioning between devices using basic IOMMU-like hardware

Running code outside of Android

Most of the time, there aren’t even any virtual machines!

Virtualisation on Android today

(6)

Security and functionality both lose out

Security

Increased TCB and difficulty/cost in

providing

streamlined updates across devices

Unable to leverage hardware virtualisation capabilities from

within Android

Functionality

(7)

The Armv8 exception model on paper

“bit on the bus”

App

Kernel

Hypervisor

App App

Kernel

App Trusted App

Trusted OS

Trusted Partition Manager / Hypervisor *

Trusted App

Trusted App *

Trusted OS *

Trusted App *

Firmware / Secure monitor EL0

EL1 EL2

EL3

Non-secure (“normal”) world Secure world

* From Arm v8.4A

(8)

The Armv8 exception model sorted by privilege

App

Kernel

Hypervisor

App App

Kernel App

Trusted App

Trusted OS

Trusted Partition Manager / Hypervisor *

Trusted App

Trusted App *

Trusted OS *

Trusted App *

Firmware / Secure monitor EL0

EL1 EL2

EL3

Non-secure worldSecure world

sEL0 sEL1

sEL2 Increasing privilege

(9)

Mapping this to a modern Android system

App

Android Kernel (GKI)

Hypervisor

App App

Kernel App

Trusted App

Trusted OS

Trusted Partition Manager / Hypervisor *

Trusted App

Trusted App *

Trusted OS *

Trusted App *

Firmware / Secure monitor EL0

EL1 EL2

EL3

Secure world

sEL0 sEL1

sEL2 Increasing privilege

Non-secure world

(10)

Mapping this to a modern Android system

App

Android Kernel (GKI)

Hypervisor

App App

Kernel App

Trusted App

Trusted OS

Trusted Partition Manager / Hypervisor *

Trusted App

Trusted App *

Trusted OS *

Trusted App *

Firmware / Secure monitor EL0

EL1 EL2

EL3

Secure world

sEL0 sEL1 sEL2

● Android kernel (GKI)

● Vendor modules

● System and libraries

● Apps

● “Android”

Increasing privilege

Non-secure world

(11)

Mapping this to a modern Android system

App

Android Kernel (GKI)

Hypervisor

App App

Kernel App

Trusted App

Trusted OS

Trusted Partition Manager / Hypervisor *

Trusted App

Trusted App *

Trusted OS *

Trusted App *

Firmware / Secure monitor EL0

EL1 EL2

EL3

Secure world

sEL0 sEL1 sEL2

● DRM, crypto, ...

● Third party OSes

● Opaque blobs

● Per-device integration

Increasing privilege

Non-secure world

(12)

Trusted software

Definitions of trust (verb):

“To feel that something is safe and reliable”

“To expect, hope or suppose”

Android hopes that this software isn’t malicious or compromised.

We need a way to de-privilege third party code and provide a portable environment in which to isolate services from each other and also from the rest of Android.

(13)

Virtualisation to the rescue?

App

Android Kernel (GKI)

Hypervisor

App App

Kernel App

Trusted App

Trusted OS

Trusted Partition Manager / Hypervisor *

Trusted App

Trusted App *

Trusted OS *

Trusted App *

Firmware / Secure monitor EL0

EL1 EL2

EL3

Secure world

sEL0 sEL1 sEL2

What about that hypervisor?

Increasing privilege

Non-secure world

(14)

Virtualisation to the rescue?

App

Android Kernel (GKI)

Hypervisor

App App

Kernel App

Trusted App

Trusted OS

Trusted Partition Manager / Hypervisor *

Trusted App

Trusted App *

Trusted OS *

Trusted App *

Firmware / Secure monitor EL0

EL1 EL2

EL3

Secure world

sEL0 sEL1 sEL2

What about that hypervisor?

We can leverage the GKI effort in order to deploy KVM as the common hypervisor.

Increasing privilege

Non-secure world

(15)

Virtualisation on arm64

(and how it’s used by KVM)

(16)

Availability

Virtualisation checklist

All arm64 Android devices have support for hardware virtualisation and a two-stage MMU.

Isolation Stage-2 page-tables provide memory isolation

Security Move third-party code out of EL2/Trustzone and into non-secure virtual machines

Portability

Implement a common hypervisor in Android enabling new applications and virtual machines that require confidentiality of data and integrity of computation.

(17)

KVM on arm64

० Supported upstream on arm64 since v3.11 ० Host kernel may reside at either EL1 or EL2:

० Threat model places the entire host kernel (and VMM via ioctl()s) into the TCB; host has full access to guest memory.

This is a bit like “inverse Trustzone”

VMM

Host kernel

“World switch”

App App

Guest kernel EL0 App

EL1 EL2

VMM

Host kernel

App App

Guest kernel EL0 App

EL1 EL2

nVHE (v8.0) VHE (v8.1)

Blazingly fast!!1!

(18)

Big problem!

The threat model of Android is not aligned with the

current design of KVM.

(19)

Revisiting nVHE with “Protected KVM”

Android’s security model requires that guest data remains private even if the host kernel has been compromised. Maybe nVHE isn’t so bad after all...

० Extend world-switch code at EL2 to manage stage-2 page-tables and guest state

० Install a stage-2 translation for the host kernel during boot before loading vendor modules ० Message passing between host and VM

० Template bootloader which accepts only signed VM images ० Formal verification techniques to reason about EL2 code Q: Why not run Android in a VM instead?

(20)

AVB

Generic

Kernel Vendor

Modules GKI Modules

Generic Kernel and Hypervisor

Generic Kernel

Protected KVM Protected KVM

1. Initialization 2. Boot 3. Runtime

De-privilege Kernel

Load Modules

Kernel uptime

Legend: Vendor Specific AOSP AVB - Android Verified Boot Verified

Boot

(21)

The taming of EL2

(22)

Executing at EL2

The KVM nVHE EL2 environment is a pretty horrible place: it has its own limited virtual address space and cannot run general kernel code:

० Not preemptible/interruptible and unable to block/schedule ० Can access all of normal memory if mapped

० Very limited device access; typically no console

० Basically just context-switches EL1 and allows host kernel to run functions with elevated privilege

० Tight coupling with host kernel is optimal for KVM’s threat model

Prior to 5.9, Linux offered #define kvm_call_hyp(f, ...) to run kernel functions annotated with __hyp_text at EL2.

(23)

Executing at EL2 (< 5.9)

// C code to run at EL2 (arch/arm64/kvm/hyp/tlb.c) void __hyp_text __kvm_flush_vm_context(void)

{

dsb(ishst);

__tlbi(alle1is);

if (icache_is_vpipt())

asm volatile("ic ialluis");

dsb(ish);

}

// EL2 entry dispatcher kern_hyp_va x0

do_el2_call // Indirect call to arbitrary address!!!

// EL1 hypercall (HVC #0)

#define kvm_call_hyp(f, ...) __kvm_call_hyp(kvm_ksym_ref(f), ##__VA_ARGS__);

// Callsite

kvm_call_hyp(__kvm_flush_vm_context);

(24)

The EL2 object in 5.9/5.10

New threat model needs EL2 code to be self-contained & safe against compromised host kernel:

● Embed EL2 payload using separate ELF sections and symbol prefixing (similar to EFI stub)

● Fixed set of hypercalls rather than arbitrary function pointers

● Prior to de-privilege, host sets static keys and applies alternatives (one way switch)

● Following de-privilege, EL2 object no longer mapped for EL1

“Who needs namespaces when you have underscores?”

$ aarch64-linux-gnu-objdump -t -j .hyp.text arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o SYMBOL TABLE:

0000000000002000 g F .hyp.text 00000000000000b4 __kvm_nvhe___host_exit

00000000000019e0 g F .hyp.text 00000000000000c0 __kvm_nvhe___kvm_tlb_flush_vmid_ipa 0000000000004288 g F .hyp.text 0000000000000048 __kvm_nvhe___vgic_v3_deactivate_traps 0000000000000048 g F .hyp.text 00000000000000f0 __kvm_nvhe___sysreg_save_state_nvhe 00000000000043d0 g F .hyp.text 0000000000000044 __kvm_nvhe___vgic_v3_init_lrs

0000000000001b30 g F .hyp.text 0000000000000034 __kvm_nvhe___kvm_flush_vm_context

Symbol aliases created from “allowlist” of kernel symbols for use at EL2.

(25)

Executing at EL2 (5.9/5.10)

// C code to run at EL2 (arch/arm64/kvm/hyp/nvhe/tlb.c) void __kvm_flush_vm_context(void)

{

[...]

}

// EL2 entry dispatcher (now in C!) switch (func_id) {

case KVM_HOST_SMCCC_FUNC(__kvm_flush_vm_context):

__kvm_flush_vm_context();

break;

// EL1 hypercall (HVC #0)

#define __KVM_HOST_SMCCC_FUNC___kvm_flush_vm_context 2 arm_smccc_1_1_hvc(KVM_HOST_SMCCC_FUNC(f), ##__VA_ARGS__, &res);

#define kvm_call_hyp(f, ...) kvm_call_hyp_nvhe(f, ##__VA_ARGS__);

// Callsite

kvm_call_hyp(__kvm_flush_vm_context);

(26)

Virtual memory at EL2 (without pKVM)

Today, the host kernel is trusted and therefore in control of the hypervisor virtual memory:

● Hypervisor stage-1 mappings created by the host

Hypervisor pages also mapped by the host linear mapping

KVM data structures (e.g struct kvm) mapped directly to EL2

More of the kernel gets mapped in over time! (no hyp_unmap())

● Homebrew per-cpu implementation

Local CPU only

Directly reuses host per-cpu region

● Guest stage-2 page-tables also managed by the host kernel

Blindly installed by EL2 during VM world switch

● All page-tables constrained by host page-table configuration

Makes it trivial for a compromised host kernel to bypass new hypervisor restrictions.

(27)

EL2 MM bootstrap (5.11?)

Allowing the host kernel to manipulate these page-tables breaks the revised security model:

● Page-tables must only be accessible to EL2

○ Stand-alone page-table walker merged for 5.10

http://lkml.kernel.org/r/[email protected]

● Memory allocator needed for page-table pages

Hypervisor carveout donated from the host during boot

Trying to keep things as simple as possible

Host bootstraps EL2 prior to de-privilege

EL2 then transitions off temporary page-tables

https://android-kvm.googlesource.com/linux/+/refs/heads/pkvm

● Instantiate new per-cpu implementation at EL2 (also merged for 5.10)

https://lkml.kernel.org/r/[email protected]

● IOMMUs need to be kept in sync with the stage-2 page-tables

(28)

EL2 MM bootstrap (5.11?)

#define hyp_vmemmap ((struct hyp_page *)__hyp_vmemmap)

struct hyp_pool { nvhe_spinlock_t lock;

struct list_head free_area[HYP_MAX_ORDER + 1];

};

// Buddy allocator for page allocation at EL2

void *hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask, unsigned int order);

// kvm_hyp_setup() [Complete bootstrap and de-privilege host]

kvm_call_hyp_nvhe(__kvm_hyp_setup, base, kern_hyp_va(__va(base)), size, kvm_get_bp_vect_pa(), num_possible_cpus(),

kern_hyp_va(per_cpu_base));

free_hyp_pgds();

// cpu_init_hyp_mode() [Install host vectors with temporary MMU setup]

vector_ptr = kern_hyp_va(kvm_ksym_ref(__kvm_hyp_host_vector));

arm_smccc_1_1_hvc(KVM_HOST_SMCCC_FUNC(__kvm_hyp_init), pgd_ptr, tpidr_el2, hyp_stack_ptr, vector_ptr, &res);

(29)

Stage-2 for the host

० Memory will disappear from the host as it is assigned to a guest

“KVM protected memory extension” from Kirill Shutemov

Inject stage-1 abort or treat as RAZ/WI?

० IOMMU support

Unfortunate reliance on SoC design and sensible hardware

० Kernel self protection?

Looks to the host like the permissions have changed for physical memory

Allow host to change permissions for RW memory it owns?

० VMM will not be able to access guest state (including CPU registers and memory)

Negotiate shared memory regions with guest for virtio

Q: How is a guest initialised to begin with?

(30)

Template bootloader

Requiring VM images to reside in pre-populated carve-outs doesn’t scale…

० … but we also need to ensure that guest payloads haven’t been tampered with by the host

० Small first-stage bootloader installed in carve-out memory during host boot

Exploring bare-metal rust implementation

० Accessible only to EL2 and used as initial entry point for protected guests

Performs signature check on guest payload

Q: Does this really need to be arm64-specific?

(31)

The virtual platform

(32)

Adapt crosvm as the VMM

Reuse the Chrome OS Virtual Machine Monitor:

● Part of ChromeOS and now included in AOSP

● Lots of related talks at KVM Forum!

● Modern codebase written in Rust

● Focus on security and sandboxing

● Many virtio devices implemented

● Cross-architecture (surprisingly important!)

○ https://source.android.com/setup/create/cuttlefish

We provide a fairly basic arm64 virtual platform:

● Fixed memory map

● CPUs onlined via PSCI calls

● Arm architected timer

● RNG/entropy service

PV interrupt controller (rVIC) [Marc Zyngier’s talk]

● What about I/O?

(33)

Just use virtio, stupid!

Virtio is the best thing since sliced bread and we should just use it for everything.

Job done?

० Strong desire to avoid changes to the spec

MMIO traps to the hypervisor

Re-use existing device/driver implementations

० Guest must use crypto (e.g. fs-verity) as host can intercept data due to lack of hardware memory encryption

० No shared-memory device?

○ Virtio assumes guest memory is shared with the host

https://www.kernel.org/doc/html/latest/filesystems/fsverity.html

(34)

Bounce-buffering via shared windows

The Virtio 1.1 specification introduces the VIRTIO_F_ACCESS_PLATFORM reserved feature bit:

“[...] indicates that the device can be used on a platform where device access to data in memory is limited and/or translated.”

When set, causes a Linux guest to use the DMA API for virtio allocations.

० We can force the use of bounce-buffers by passing swiotlb=force ० We then just need to allow the host to access the bounce buffer pages

Expose SHARE/UNSHARE hypercalls to the guest to update host stage-2.

Hook the set_memory_{decrypted,encrypted}() API to share/unshare bounce buffer pages

(35)

Zero-copy transfers using shared memory

Bounce buffers force the copying of all I/O data through a shared window:

० This is fine for many use-cases, but introduces undesirable overhead/incompatibility for others

e.g. Binder shared memory

० Need a handshake to share memory from host to guest

Don’t want the guest to have access to all of host memory

Don’t want the host to silently change guest stage-2

० Arm FF-A specification aims to solve this problem but is fairly heavyweight and not cross-architecture

Q: What should we be using here?

(36)

What’s next?

Still loads to do:

Complete mm bootstrap

● Stabilise user ABIs

● Settle on solution for zero-copy I/O

● Move more guest state up to EL2

● Memory poisoning

● SMC proxying

● Attestation

● Ballooning

● Integration with rest of Android

● Continue upstreaming...

(37)

Questions?

<[email protected]>

Referensi

Dokumen terkait

As the sentiment against the stage was growing, the Corporation of the City, goaded by the puritan moralists, attempted to convert their power of regulating plays into a... power

1053 PROCUREMENT OFFICE POSTING CERTIFICATION This is to certify that the Bulacan State University has posted its FY 2023 APP on Bulacan State University website and can be