Cloud Native Applications Profiling using a Graph Neural Networks Approach

(1)

Cloud Native Applications Profiling using a Graph Neural Networks Approach

Item Type Conference Paper

Authors Boukhtouta, Amine;Madi, Taous;Pourzandi, Makan;A., Hyame Alameddine

Citation Boukhtouta, A., Madi, T., Pourzandi, M., & A., H. A. (2022). Cloud Native Applications Profiling using a Graph Neural Networks Approach. 2022 IEEE Future Networks World Forum (FNWF).

https://doi.org/10.1109/fnwf55208.2022.00046 Eprint version Post-print

DOI 10.1109/fnwf55208.2022.00046

Publisher IEEE

Rights This is an accepted manuscript version of a paper before final publisher editing and formatting. Archived with thanks to IEEE.

Download date 2023-12-02 20:17:52

Link to Item http://hdl.handle.net/10754/690258

(2)

Cloud Native Applications Profiling using a Graph Neural Networks Approach

1^st Amine Boukhtouta GFTL Security Research

Ericsson Montreal, Canada [email protected]

2^nd Taous Madi

Resilient Computing and Cybersecurity Center KAUST University

Djeddah, Saudi Arabia [email protected] 3^rd Makan Pourzandi

GFTL Security Research Ericsson Montreal, Canada [email protected]

4^th Hyame Alameddine A.

GFTL Security Research Ericsson Montreal, Canada

[email protected]

Abstract—The convergence of Telecommunication and indus- try operational networks towards cloud native applications has enabled the idea to integrate protection layers to harden security posture and management of cloud native based deployments.

In this paper, we propose a data-driven approach to support detection of anomalies in cloud native application based on a graph neural network. The essence of the profiling relies on capturing interactions between different perspectives in cloud native applications through a network dependency graph and transforming it to a computational graph neural network. The latter is used to profile different deployed assets like micro-service types, workloads’ namespaces, worker machines, management and orchestration machines as well as clusters. As a first phase of the profiling, we consider a fine-grained profiling on micro- service types with an emphasis on network traffic indicators.

These indicators are collected on distributed Kubernetes (K8S) deployment premises. Experimental results shows good trade-off in terms of accuracy and recall with respect to micro-service types profiling (around 96%). In addition, we used predictions entropy scores to infer anomalies in testing data. These scores allow to segregate between benign and anomalous graphs, where we identified19out of23anomalies. Moreover, by using entropy scores, we can conduct a root cause analysis to infer problematic micro-services.

Index Terms—anomaly detection, profiling, cloud native applications, Kubernetes (K8S), Graph Neural Networks (GNNs).

I. INTRODUCTION

5G advances toward a thoroughly flexible and technology enabler Platform-as-a-Service (PaaS), it drives to an outstand- ing performance package, including throughput and latency as well as turning on value in cloud native architecture. There- fore, operators shift telecommunication networks operations alongside a miscellaneous medley of services, going beyond mobile connectivity, messaging and voice. This shift encom- passes a continuous adoption of virtualization, triggering an establishment of an underpinning cornerstone for flexible and market timely services, while tuning optimal resources usage.

Virtualization is rudimentary to fill needs of network operators

in terms of adaptability and orchestration of service’s emerging demands. Adoption of virtualization has been expanding towards cloud-native paradigm, where virtual network functions are considered as virtual application assets instead of virtual fully-fledged operating systems hosting diverse applications.

The cloud native applications are biddable to distribution, decentralization and flexible points of presence in cloud ecosystem. Such evolution unlocks reduction of ownership total cost through reduced time-to-market, swift innovation and efficient dynamic enabling of service on demand.

Fulfillment of virtualization benefits requires the establishment of automatic control and management capabilities. 5G networks are meant to bear diverse scenarios and business applications while increasing the complexity in network management. Therefore, there is a requirement to monitor autonomous or automatic behaviors of cloud native applications. The management layer is of paramount importance considering that network operators allows tuning service customization as well as customers personalization. The last feature, namely, protection is coined to deploy different security controls to counter sophisticated attacks driven by external or insider malicious threats in virtual environments. The main target is to attain the ability to prevent, detect and mitigate them. Security by itself is no more the setup of black-box hardware but a solid axiom of a sustainable trust notion to maintain system performance, integrate different business models including critical ones and fulfill the user experience.

In the prevailing of such facts, there is a keen interest to integrate intelligence security control to harden the security posture of cloud native applications. Being motivated by stan- dardization efforts like ETSI NFV Security Management and Monitoring specification [1] and 3GPP Network Data Analytic Function (NWDAF) [2], we aim to integrate a data-driven intelligence through machine learning algorithms to support security in 5G virtual networks. We target to leverage advantage of data analytic to render a security baseline to detect

(3)

anomalies. These anomalies are fed to cloud management layer to initiate a root cause analysis for diverse security problems.

One important aspect of modeling cloud native applications will help profiling diverse micro-services interaction in generic network deployments, specifically, 5G networks. In this paper, we target to tackle the following research questions:

• How to sketch a data driven approach to improve security posture for cloud native applications?

• How to model the environment in a way to properly capture its dynamically changing and interacting com- ponents?

• How to use the security baseline to detect disturbances and attacks on cloud native applications?

To answer aforementioned questions, we considered the use of a neural network approach. The democratization of neural networks and underlying deep learning, is sketched through multi-layers computational networks to handle diverse complex patterns illustrated through structure or unstructured data. Characteristically, the desideratum for graph neural networks (GNNs) [3] arose mainly from its ability to handle heterogeneous and unstructured data. To extract sense out of such unstructured data and derive inference from it, there is a need of a modeling that emulates relationships between data points. GNN takes advantage out of tremendous success of modeling unstructured data through graph data structures, which define relationships between nodes (entities). A node is analogously defined by its intrinsic data point as well as its connections with neighbors. GNNs have a lot of real- life applications since their architectures make a parallel with irregularity of data. For sake of illustration, research threads like social network analysis [4], [5], citations networks and knowledge graphs [6], molecular interactions (e.g., protein- protein interactions) [4], [5] and recommender systems [7]

have adopted GNNs. Therefore, the ability of GNNs to handle highly unstructured data is of a preponderant significance, introducing amalgamation of graphs and machine learning, which clutches high potential in diverse ML/AI related topics.

In this paper, we model interactions between micro-services deployed on K8S target application through a network graph (dependencies network graph) illustrating near real-time interaction. The target application integrates a logging functionality in 5G networks. The dependencies network graph is captured on a timely manner and trained through GNNs to profile micro-services, IP addresses, K8S management machines and different worker machines. Taking advantage out of GNNs ability to catch network graph structures, we build computational graphs to characterize the TCP protocol network activities of a K8S cloud native target application. By doing so, we capture profiles of different deployed entities as a security ground truth, where we apply training and validation phases to profile the different types of micro-services. The underlying F1 scores and loss convergence of GNNs’ models are used to gauge the ability to predict micro-services’ types as well as the trade-off between accuracy and recall. Then, we test the anomaly detection on the GNN models using

a test-set consisting of a mix of benign network data and injected network anomalies, where predictions’ entropy scores are used to infer problematic graphs out of benign ones.

Regarding the training, validation and testing, we use different aggregation algorithms defined in GraphSAGE [8], namely, Graph Convolution Network (GCN), Mean average, Mean and Max pooling as well as Long-Short Term Memory (LSTM). A comparative analysis is applied on these different algorithms.

In summary, our contributions in this paper are as follows:

• To the best of our knowledge, this is a first attempt toward the use of GNNs to profile a cloud native application deployed over a cloud tool, i.e., K8S.

• A novel micro-service-aware GNNs modeling considering different training aggregation algorithms, namely, convolution (GCN), mean averaging (Mean), pooling (Max-pool, Mean-pool) and Long-Short-Term -Memory (LSTM).

• A methodology to infer anomalies out of GNNs’ profiles using classification and convergence metrics.

The rest of the paper is organized as follows: Section II provides relevant background on how GNNs are applied to profile a target cloud native application. Section III puts forward considered anomalies scenarios injected in the data.

Experimental evaluation results are described in Section IV.

Section V discusses key related works, Section VI concludes the paper along with future research directions.

II. CLOUDNATIVEAPPLICATIONPROFILING

A. General Approach

In our work, we aim to transform cloud application workloads inner-workings and interactions into an intelligence.

A workload is coined to describe a set of actions done at the system and/or network levels for any entity acting to fulfill a certain execution context done by containers, K8S management, worker machines and can be also extended to distributed clusters. By Intelligence, we mean: (1) the ability to turn workloads seen from different perspectives into a security baseline, (2) then using it as a reference point to detect misbehavior of cloud native applications. The security baseline is defined by turning running workloads into digital fingerprints, which is expressed in three forms: workload as a profile (service), type of workload (type of service) as a profile and set of workloads (namespaces, worker machines, clusters) as a profile. Profiles can be gathered from different perspectives including clusters, worker machines including masters and slaves, namespaces and micro-services. These profiles are used as references to detect deviations, such as disturbances, mis-configurations or attacks, and commonly referred to as anomalies. The detection of anomalies can be ingested in the security posture management layer. If the events are identified, a mitigation through security orchestration can be applied. If they are not identified, a root cause analysis through Security information and event management (SIEM) can be performed.

To do so, we characterize workloads with a set of attributes based on running system observations including resource

(4)

usage, communicating peers and orchestrator’s workload meta- information. As a first attempt in our research, we consider mainly profiling TCP protocol based connections including 10 attributes listed in Table I.

Attribute Description

Traffic Intensity Sum of incoming and outgoing traffic from a node [bytes/chunk-length]

Degree Number of node’s neighbors (communicating peers) Top 5 intensities

(5 attributes)

Traffic intensity on the 5 heaviest edge (ordered) [bytes/chunk-length]

TCP connects Number of TCP connects (the node is either a client or server in the relation)

TCP accepts Number of TCP accepts (the node is either a client or server in the relation)

TCP closes Number of TCP connection closes (the node is either a client or server in the relation)

TABLE I WORKLOADATTRIBUTES

B. Workloads Towards Intelligence

In cloud native applications, different services defined on pods interact together to implement a certain business logic.

It is useful to lift the workload concept to enclose different levels of granularity. K8S wise, considering micro-service architecture, levels can be mapped to micro-service instance, micro-service type, namespace (including management one), worker machines and clusters. These clusters naturally consist of worker nodes, which are mapped to pods and controlled through PoD labels or affinity rules. If the mapping between PoDs and worker machines is well-defined, it is meaningful to consider worker machines also as a workload.

C. Cloud Native Workloads Observability

Observability mechanisms in the operating system provide means to collect running processes information. The latter might encompass system call statistics, resource usage of different kinds. In virtualized systems, it might be possible to collect running workloads data from the hypervisor layer.

Network communication between workloads can also be in- spected with a multitude of tools (sysTap [9], sysdig [10], etc). Collected data might encompass communicating peers (IP addresses, ports), protocols, flow summaries, fault statistics, etc. Workload-related metadata defined or maintained within the orchestrator can be accessed via management interfaces.

For example in K8S, data is accessible via the “K8S API server”. Collected data can be labels, container image prop- erties, data denoting the type of the workload/microservice, K8S namespace, assigned IP addresses, etc. Neither of these approaches require modifying or instrumenting the application/workload code. It is possible to passively monitor the workloads’ behavior with the necessary granted permissions.

D. Solution

Given different point of presences constituting a distributed cloud native application, deployed agents collect workload features as well as dependencies to shape and build dependency graphs. This step constitutes a phase to gather a

ground truth for a training phase. The building of dependency graphs can be done on a certain view or perspective (e.g., microservice instances, types, clusters and worker machines).

In addition, dependencies building can be either a composition of sub-graphs into a bigger one or snapshots of sub-graphs’

dependencies collected periodically. Any dependency graph is used to build a neural network as a set of computational graphs that characterizes workloads interactions. The trained computational graphs can be stored as graph embeddings, which represent a workload type classifier. The embeddings can be indexed based on training workload perspective. Figure 1 depicts the solution overview.

A workloads dependencies graph G = (V, E) describes interactions between workloads. V is a set of vertices representing workloads indexed per a perspective (services, type of services, namespaces, worker machines or clusters). E is a set of dependencies between workloads (undirected edges).

Workloadsvi andvj, are dependent if they interact with each other through system operations or network communications.

For instance, a workload from a certain service needs to send TCP or UDP data to another workload in another service. As such, we can build a network dependency between these workloads. Another example, a workload can produce a memory pipe, in which another workload consumes data on, therefore, memory attributes can be considered for each workload. A last example, a workload can create a file on a disk, in which another workload writes on, as such, input/output file attributes can be considered. Figure 2 depicts a TCP-based workload dependencies’ graph snapshot in a target cloud native application.

The solution uses GraphSAGE learning approach [8] to create a security baseline for workloads. This approach is different from classical node embeddings approaches (i.e., matrix factorization), where we leverage workload attributes (please refer to Table 1) to learn a generalization embedding function that predicts workloads classes (indexed per services, type of services, namespaces, worker machines, clusters). By considering workload features, the solution learns the topolog- ical structure of the dependencies that a workload has with its neighborhood workloads. Therefore, the learning is two folds:

workloads’ attributes and structural dependencies of other workloads’ neighborhoods. Each workload is characterized by an N-order neighborhood, which defines the depth of the structural dependencies to build embeddings. At the opposite of training embedding vector for each workload, the solution trains a set of aggregator functions to catch feature information from workload’s neighborhood. Each function aggregates information from a search path away from a workload to build a computational graph. The training is done by mapping embeddings to existing classes (i.e., indexed per services, type of services, namespaces, worker machines, clusters).

The core of the solution is to turn workloads view from different perspectives into a computational graph that gener- ates digital fingerprints (computational graphs and underlying embeddings) for targeted workloads. The fingerprints represent a ground truth to profile workloads, which are indexed

(5)

Fig. 1. Solution Overview

POD/snooply/snooply-cqwxt POD/neverdown/eric-cm-yang-provider-testapp-dc64899bb-jq5hz

SVC/neverdown/eric-fh-alarm-handler

POD/neverdown/eric-data-search-engine-master-78b458cf7-wzdhh

SVC/neverdown/eric-adp-gs-testapp-db-pg POD/neverdown/eric-lm-combined-server-license-consumer-handler-5565df687bd9m6

POD/neverdown/eric-tm-sctp-6f9b784fcf-mmdhz POD/snooply/snooply-49rvx

POD/neverdown/eric-dst-agent-gkmd2

POD/neverdown/eric-data-message-bus-kf-0

POD/neverdown/ldap-test-app-spider-app-neverdown-fcb58b78-9hcfl

WHOST/NONE/150.132.197.61

POD/spider/eric-log-shipper-sfqds POD/neverdown/eric-adp-ss7-db-pg-1

POD/neverdown/sctp-adp-staging-testapp-84655547bc-jzgnr

POD/neverdown/eric-log-shipper-kv8d8 POD/neverdown/eric-pm-server-0

POD/kube-system/kubernetes-dashboard-8f8d4f8bf-wp6rn

SVC/neverdown/eric-pm-bulk-reporter

POD/neverdown/eric-adp-gs-testapp-84f59756cd-gxwpx POD/neverdown/eric-fh-alarm-handler-db-pg-0

POD/neverdown/eric-dst-agent-tjwff

SVC/neverdown/eric-sec-ldap-server

POD/spider/eric-log-shipper-6gsls EXT/NONE/10.41.91.11

POD/neverdown/eric-tm-ingress-controller-cr-envoy-l54bv POD/neverdown/eric-slt-dsl-5c76b6cb58-x796x

EXT/NONE/10.41.91.203 SVC/neverdown/eric-adp-ss7-db-pg

POD/kube-system/nfs-client-provisioner-75f48b59cd-hvhsh

POD/kube-system/coredns-584b5bb8fc-7phxb

WHOST/NONE/150.132.197.57

POD/neverdown/eric-data-search-engine-master-78b458cf7-rhqdt SVC/neverdown/eric-cm-yang-provider

POD/neverdown/eric-data-distributed-coordinator-ed-agent-0 SVC/neverdown/eric-sec-access-mgmt-http

POD/spider/eric-log-shipper-tkqv9 POD/neverdown/eric-dst-agent-dd86z

EXT/NONE/147.214.114.215

POD/snooply/snooply-8gd2r

POD/neverdown/eric-sec-ldap-server-bragent-578c884487-b2z8b POD/neverdown/eric-sec-key-management-main-0

POD/neverdown/eric-log-shipper-99zl7 POD/neverdown/eric-tm-ingress-controller-cr-envoy-mhm5f POD/neverdown/eric-dst-agent-flvsz

EXT/NONE/10.41.91.204

POD/neverdown/key-management-test-app-spider-app-neverdown-6c7f9fc68c-7fwr5

POD/neverdown/eric-ctrl-bro-0

POD/cattle-prometheus/exporter-kube-state-cluster-monitoring-587cbdfd9b-v2n9w

POD/neverdown/eric-log-shipper-8mzxc

SVC/spider/spider-log-transformer POD/neverdown/eric-tm-ingress-controller-cr-envoy-m78j8

SVC/neverdown/spider-app-neverdown-pm-testapp

EXT/NONE/10.41.91.206

WHOST/NONE/150.132.197.59

WHOST/NONE/150.132.197.56 POD/neverdown/eric-dst-agent-2rr67

POD/neverdown/eric-odca-diagnostic-data-collector-c776d68db-s2df9

POD/ingress-nginx/nginx-ingress-controller-685758b759-xsfz2

POD/neverdown/eric-dst-agent-l55dh POD/neverdown/eric-tm-ingress-controller-cr-contour-59c7dbb56f-dd9fw

POD/neverdown/eric-data-coordinator-zk-0

POD/neverdown/eric-data-coordinator-zk-1 POD/neverdown/eric-fh-alarm-handler-db-pg-2

EXT/NONE/10.41.91.208

POD/neverdown/eric-log-shipper-27495 POD/kube-system/coredns-584b5bb8fc-qqccp

POD/neverdown/eric-adp-gs-testapp-db-pg-0

POD/neverdown/eric-log-test-app-6d88546494-j9p45 POD/neverdown/iam-test-app-spider-app-neverdown-7f88f6cb6c-8tc6g POD/neverdown/eric-tm-ingress-controller-cr-envoy-xmc4l

EXT/NONE/127.0.0.1

SVC/neverdown/eric-data-distributed-coordinator-ed

POD/neverdown/eric-sec-access-mgmt-db-pg-2 POD/snooply/snooply-m8j94

POD/neverdown/eric-sec-admin-user-management-8dbb8b756-859rg

POD/neverdown/eric-lcm-container-registry-registry-0 EXT/NONE/192.168.14.128

POD/spider/eric-log-shipper-sk577

POD/neverdown/eric-data-search-engine-data-0

POD/neverdown/eric-tm-ingress-controller-cr-envoy-npk95 EXT/NONE/10.210.122.180

EXT/NONE/10.120.5.24

POD/neverdown/eric-data-document-database-pg-bragent-9dcfd484-mldqc POD/neverdown/spider-app-neverdown-pm-testapp-66574f8569-fz8v2

POD/snooply/snooply-dz6vt

POD/spider/eric-log-shipper-mx48f

POD/neverdown/eric-log-shipper-lgw7m POD/spider/eric-log-shipper-2mj9c

POD/neverdown/eric-sec-access-mgmt-db-pg-0

POD/kube-system/tiller-deploy-844d4fc894-kv4kf

POD/neverdown/eric-lm-combined-server-db-pg-1

MHOST/NONE/10.120.5.21 POD/neverdown/eric-data-object-storage-mn-0

POD/neverdown/eric-log-shipper-xcl66 EXT/NONE/192.168.123.192

POD/spider/eric-log-shipper-z8jlk

POD/neverdown/eric-fh-snmp-alarm-provider-6984f89655-fpvnz SVC/neverdown/eric-cm-mediator

SVC/neverdown/eric-lm-combined-server-db-pg

POD/snooply/snooply-vk4q7

EXT/NONE/10.41.91.94

POD/neverdown/eric-tm-ingress-controller-cr-envoy-bjhkk

SVC/neverdown/eric-fh-alarm-handler-db-pg

POD/neverdown/eric-dst-agent-kszqx POD/neverdown/eric-dst-agent-jkhjc

POD/ingress-nginx/default-http-backend-79cd96998d-xctm7

POD/neverdown/cm-subscriber-6c4549569f-8dzjr POD/neverdown/eric-data-search-engine-master-78b458cf7-ml4vm

POD/neverdown/eric-dst-query-54545c796-gdjtw

POD/neverdown/eric-log-shipper-j8hmk POD/snooply/snooply-t6g7r

EXT/NONE/10.120.13.13

POD/neverdown/eric-log-shipper-ggwh8 POD/neverdown/eric-adp-ss7-db-pg-0 POD/neverdown/eric-data-wide-column-database-cd-0

POD/spider/eric-log-shipper-28scn

POD/spider/eric-log-shipper-txddk

POD/cattle-system/cattle-cluster-agent-59b7ccf449-kp66h

POD/neverdown/eric-log-shipper-jfb7n POD/kube-system/calico-kube-controllers-7bb46dccfd-ttgcw

SVC/neverdown/eric-log-transformer WHOST/NONE/150.132.197.67

SVC/neverdown/eric-tm-sctp-ipc POD/neverdown/eric-dst-agent-f4lgh

POD/snooply/snooply-twkjd

POD/neverdown/eric-data-coordinator-zk-agent-0

POD/neverdown/eric-tm-ingress-controller-cr-envoy-wxzf6

POD/neverdown/eric-data-search-engine-data-1 POD/neverdown/eric-data-document-database-pg-0

POD/neverdown/eric-data-coordinator-zk-2

SVC/neverdown/eric-ctrl-bro WHOST/NONE/150.132.197.65

POD/neverdown/eric-data-search-engine-ingest-tls-5f665885f4-mtqlt

POD/kube-system/rbd-provisioner-6fc86f59bb-cwwsw

MHOST/NONE/10.120.5.23 POD/snooply/snooply-gsnwf

POD/neverdown/eric-lcm-helm-chart-registry-65b75f7899-rhwrz

POD/neverdown/eric-log-transformer-7cc48db4dd-ch4xn POD/neverdown/eric-dst-agent-rc4v6 POD/neverdown/eric-pm-testapp-controller-6947cd4fc5-5b8q5

POD/neverdown/eric-data-distributed-coordinator-ed-0

POD/neverdown/eric-tm-ingress-controller-cr-envoy-qj8q7 EXT/NONE/10.41.91.12

POD/neverdown/eric-mesh-controller-657dfd4b4b-wnt9g POD/neverdown/eric-tm-sctp-6f9b784fcf-r7h4n

POD/neverdown/eric-data-distributed-coordinator-ed-1

POD/neverdown/eric-log-shipper-zgzxh EXT/NONE/10.41.91.205

SVC/neverdown/eric-data-coordinator-zk POD/neverdown/eric-data-object-storage-mn-2

SVC/neverdown/istio-pilot

POD/neverdown/eric-data-message-bus-kf-1

EXT/NONE/10.41.91.13 POD/neverdown/eric-stm-diameter-cm-5cff8d99b6-sfz7g

POD/neverdown/eric-sec-access-mgmt-db-pg-1

WHOST/NONE/150.132.197.53

POD/neverdown/eric-tm-ingress-controller-cr-envoy-v647j SVC/neverdown/influxdb-service

POD/neverdown/eric-data-search-engine-ingest-5f56cb44fb-gkzc5 POD/neverdown/eric-dst-agent-brbj6 POD/neverdown/eric-sec-sip-tls-main-6869f598bf-8w67m

POD/neverdown/eric-log-test-app-6d88546494-9qw2g POD/neverdown/eric-stm-diameter-proxy-grpc-57c9dbf476-bshc4

POD/spider/eric-log-shipper-tv2dv

POD/neverdown/eric-lm-combined-server-db-pg-2

POD/neverdown/eric-dst-agent-w8kdx

POD/spider/eric-log-shipper-9x7cw

POD/neverdown/eric-sec-access-mgmt-0

POD/neverdown/eric-log-shipper-l7kts POD/neverdown/eric-stm-diameter-7455d77cd5-m8z9f

POD/neverdown/eric-lm-combined-server-db-pg-0 POD/cattle-prometheus/prometheus-operator-monitoring-operator-79484b9c6f-h5vkz

SVC/neverdown/eric-pm-server

WHOST/NONE/150.132.197.55 POD/neverdown/eric-lm-combined-server-license-consumer-handler-5565df687kncv2

POD/neverdown/eric-fh-alarm-handler-db-pg-1

WHOST/NONE/150.132.197.66 SVC/neverdown/eric-data-document-database-pg

POD/neverdown/eric-dst-agent-qcwbn

POD/neverdown/eric-dst-agent-qtzwb POD/neverdown/eric-log-shipper-k8vgb MHOST/NONE/10.120.5.22 POD/neverdown/eric-adp-gs-testapp-db-pg-2

EXT/NONE/10.41.91.202 WHOST/NONE/150.132.197.64

POD/neverdown/eric-tm-ingress-controller-cr-envoy-vvcnr POD/snooply/snooply-lbqvm

POD/snooply/snooply-ztvvw

SVC/neverdown/eric-sec-key-management POD/snooply/snooply-hwm42

POD/neverdown/eric-slt-dsl-5c76b6cb58-mt224

POD/neverdown/eric-tm-ingress-controller-cr-envoy-scdft POD/neverdown/eric-ss7-with-testapp-54f7d657d7-qbcsh

SVC/neverdown/eric-tm-ingress-controller-cr-contour POD/neverdown/eric-cm-yang-provider-7b7bc8c84b-q9bdn

POD/neverdown/eric-ss7-with-testapp-54f7d657d7-gg9lg

EXT/NONE/10.41.91.207

POD/cattle-prometheus/grafana-cluster-monitoring-5d676d89c5-9jr6p

POD/spider/eric-log-shipper-rvknj

WHOST/NONE/150.132.197.54 POD/snooply/snooply-j8lhm

POD/neverdown/eric-lm-unique-object-reporter-d79dff8d4-8qk54

WHOST/NONE/150.132.197.62 POD/snooply/snooply-qhdfs

POD/kube-system/metrics-server-5946b74cfb-6fnfr POD/neverdown/eric-tm-ingress-controller-cr-contour-59c7dbb56f-r8gtd

WHOST/NONE/150.132.197.58

POD/neverdown/eric-data-search-engine-ingest-5f56cb44fb-92bck POD/neverdown/eric-lm-combined-server-license-server-client-66b4948c6d-h6zkf

POD/spider/eric-log-shipper-c7znp

POD/neverdown/eric-fh-alarm-handler-6cff4c6955-4gs6j

POD/neverdown/eric-log-shipper-779rf POD/ingress-nginx/nginx-ingress-controller-685758b759-rkmdj

SVC/neverdown/eric-sec-access-mgmt-db-pg POD/neverdown/eric-data-document-database-pg-1

WHOST/NONE/150.132.197.60 POD/cattle-prometheus/prometheus-cluster-monitoring-0

POD/neverdown/eric-data-distributed-coordinator-ed-2 POD/spider/eric-log-shipper-bs54f

SVC/neverdown/eric-adp-gs-testapp-db-pg-replica

POD/neverdown/eric-adp-gs-testapp-db-pg-1

WHOST/NONE/150.132.197.63

POD/neverdown/eric-log-shipper-bzmsx POD/neverdown/eric-log-shipper-rv5f6 EXT/NONE/10.41.91.201

POD/neverdown/eric-sec-certm-68c496d7bf-qj56b

POD/neverdown/eric-tm-ingress-controller-cr-envoy-ff5dg

POD/spider/spider-log-transformer-0 POD/neverdown/eric-tm-ingress-controller-cr-envoy-8s2bj

SVC/neverdown/eric-lm-combined-server-license-server-client

POD/neverdown/eric-dst-collector-6c6d7d7c8d-z2fxb POD/neverdown/eric-data-object-storage-mn-1 POD/neverdown/eric-tm-ingress-controller-cr-envoy-s6qqq

SVC/default/kubernetes

SVC/neverdown/cm-subscriber POD/neverdown/eric-data-search-engine-ingest-tls-5f665885f4-g7b8r

POD/neverdown/eric-data-object-storage-mn-3

POD/neverdown/eric-dst-agent-nfhdd

POD/neverdown/eric-sec-ldap-server-0

POD/neverdown/eric-tm-ingress-controller-cr-envoy-r7b5x

POD/neverdown/istio-ingressgateway-65c76d884d-492l2 POD/spider/eric-log-shipper-5z99s

POD/spider/eric-log-shipper-rzrr6

POD/neverdown/eric-tm-ingress-controller-cr-envoy-lrh78 POD/snooply/snooply-7r9tp

POD/neverdown/eric-data-message-bus-kf-2 POD/neverdown/eric-pm-bulk-reporter-bc69985b4-k4qgf

POD/cattle-prometheus/alertmanager-cluster-alerting-0 POD/neverdown/eric-adp-ss7-db-pg-2

POD/neverdown/eric-data-document-database-pg-2

POD/neverdown/eric-cm-mediator-notifier-0 POD/neverdown/eric-mesh-certificate-mgr-c8649c79f-hr4r5

WHOST/NONE/150.132.197.52

POD/neverdown/eric-log-shipper-76t47

SVC/neverdown/eric-data-search-engine POD/neverdown/eric-dst-agent-4pj5p

POD/neverdown/eric-odca-diagnostic-data-collector-manual-76877b8cbc-849gp

POD/neverdown/eric-log-transformer-7cc48db4dd-47wf2 POD/neverdown/eric-cm-mediator-66999f48bc-s4lsj

POD/neverdown/eric-fh-alarm-handler-6cff4c6955-vmkbd POD/snooply/snooply-xsvw8

POD/neverdown/cm-loadtest-687fd4967-zmk6c

Fig. 2. Workload Dependencies Graph Example

by classes (services, type of services, namespaces, worker machines, clusters). These fingerprints define the security baseline, and deviations from support the identification of anomalies. Figure 3 depicts a generation of workload A’

computational fingerprint. the latter is dependent on first order neighbors (i.e., B, C, D) and second order neighbors (i.e., A, F, E). Once, neighbors are collected, the interaction of workloads with respect to workloadA, is expressed through a neural network representing a computational graph, which has the depth order neighborhood (2 in the example). The inputs to the first layer are the attributes vector (e.g., set of attributes illustrated in Table I) of the highest order neighborhood (second order neighborhood in Figure 3). The layers represent a set of black boxes, known as aggregators (order invariant functions, e.g., mean averaging). These aggregators are used to compute transition fingerprints, which can be recursively combined until reaching the output aggregation function for a targeted fingerprint (i.e., fingerprint for workload A).

Under the assumption that aggregator functions have been trained out of information from workload neighborhood to train their parameters, a set of weight matrices W_k and B_k (where k ∈ 1, .., K, k being the chosen neighborhood depth) are used to propagate information between different

Fig. 3. Computational Graph Example

layers of the computational graphs. Within an iteration of each computational graph, vertices represent workloads aggregate information from their neighbors. Given the case of a workloads’ dependencies graph G= (V, E) and features for all workloads Xv, a feature vector xv represents the initial embedding of a workload v, namely, hv0

. The embedding in a layer k, namely, h_v^k represents the term that is learnt at a layerk within the computational graph for each workload v.

The generation of embedding of a workload is done based on non-linear activation function σ (e.g., sigmoid), which trans- forms the representations to the next iteration of embedding generation. Stochastic Gradient Descent algorithm is used to train weight matrices Wk and Bk. The former consists of using an aggregation function to aggregate representations of workload neighborhood N(v), whereas the latter is the previous embedding generated for the workload v, namely, hvk−1. An embedding of a workload is a training on aggregation of features collected from neighborhood’s (first term) and previous embedding of the workload itself (second term) [8].

Please refer to Figure 4 for different notation of terms used in the workloads’ embeddings generation. The last embedding Z_v represents a vector representation of a workload, which represents an input to the last output prediction layer, which is multi-classification (Softmax) loss function in our case.

Aggregation functions, namely, convolution, mean averaging, max-pooling, mean-pooling and LSTM [8] can be used to grasp workload neighborhood behavior. These functions are symmetric (i.e., invariant to permutations of its inputs) while

(6)

Fig. 4. Forward Propagation

still being trainable and maintaining high representational capacity. The symmetry property of the aggregation function ensures that our neural network model can be trained and applied to arbitrarily ordered node neighborhood feature sets [8]. They operate as element wise operation on neighborhood embedding vectors, except LSTM, which is used as random permutation sequences of workload neighbors’ embeddings generated at layer k −1. The collection of data is done periodically to render dependency graphs. Once, the solution builds the dependencies, the computational graphs are generated based on neighborhood embeddings to build predictive models. As a first set of experiments in our research, we consider to build a classifier that predicts workload types (multi-classification). Base on a certain labeling of workloads’

initial vectors (Section II-B), we use dependency graphs to profile workloads’ types by outputting a probabilistic embedding vector. The latter represents the prediction of workloads’

types. Figure 5 depicts an example of three types of workloads that are profiled with a computational graph encompassing three sub-computational graphs for 3targeted workloads.

Fig. 5. Workloads’ Types Profiling

III. GROUNDTRUTH ANDANOMALIESSCENARIOS

The data is collected from a long-lived deployment that runs in parallel with application customer-like staging tests where diverse cloud native common services are upgraded through HELM [11]. Traffic is always continuously running on the cluster. The node runs a set of micro-services under test including containers registry, logging, documents database mediator, monitoring, key management, Identity and Access

Management (IAM), load-balancer, TCP-based micro-service DNS resolution, LDAP, SNMP, SCTP protocols. External traffic is generated through traffic generators to test different micro-services; the setup runs through K8S management.

Based on network kernel eBPF [12] hooks, we monitor TCP events and compute different attributes listed in Table I. The data has been collected during445 minutes. In every minute, a dependency graph is built. In addition, we add additional graph dependency graphs simulating the following anomalies:

• First anomaly type: unsolicited ports scanning and connections to external IP addresses; in the former a container instance tries to connect to a certain range of other instantiated containers, whereas in the latter a container tries to download external attack utility script.

• Second anomaly type: CoreDNS K8S nodes take-down, where connections between DNS workloads and other workloads are taken down.

• Third anomaly type: a man-in-the middle, where some containers impersonate CoreDNS K8S service illicitly.

The total number of dependency graphs is 467, where 23 dependency graphs are anomalous (8 graphs representing the first anomaly, 7 graphs representing the second anomaly and 8 graphs representing the third anomaly. To infer anomalies, we use a semi-supervised approach, where the average loss values computed during training and validation phases are used as a reference point to identify potential misfitting of anomalous data. A deviation (noticeable increase) in terms of loss values can infer the potential presence of anomalous nodes in a graph. In addition, to effectively segregate between benign and anomalous graphs, we compute an overall predictions’

entropy score. This score is meant to evaluate the quality of predictions with respect to nodes’ labels (micro-service types defined in the ground truth data) in graphs. The entropy score for each node is computed through Algorithm1.

Algorithm 1: Node Entropy Score

1 FunctionCompute(label,index,predictions):

2 iflabel = indexthen

3 ent= 0

4 else

5 forpinpredictions do

6 ifp!= 0 and index(p) != label then

7 ent =ent - (p* Log(p))

8 ifp!= 0 and index(p) = label then

9 ent = (1 -p) *ent

10 returnent

For each node, if the label corresponds to the right prediction index, the entropy is equal to zero (the service type is well classified). Otherwise, for each prediction the does not correspond to the right label, we increase the entropy with a negative logarithmic prediction value. If we have a prediction assigned to the right label, we decrease the entropy

(7)

by multiplying it with one minus the prediction. The idea of the entropy is to penalize the model with bad predictions and reward the model with good predictions. The entropy scores computed for each predicted node can be used to compute average and standard deviation entropy scores for graphs.

IV. EXPERIMENTALRESULTS

The underlying experiments are done on the collected dataset, where 80% (374 graphs) of data is used to train, approximately 10% (47 graphs) of data is used to validate the training; the rest of data is used for testing. It is important to note that the testing data contains23benign graphs and23 anomalous graphs. The experiments are run a Ubuntu20server supporting TensorFlow 2.5, with 12 virtual CPUs, GeForce RTX3090GPU card and54GB of memory. The total number of nodes is713distributed through196classes (micro-service types).

To train the dependency graphs, we use GraphSAGE frame- work [8]. We modify it to support multiple graphs training instead of one graph training. We train and validate the training through 10 epochs, with a batch size of 256 nodes. The aggregation is done on a sample of 10 nodes maximum in the neighborhood. For each aggregator model, we observed that the convergence has been achieved for both training and validation. In addition, there was no big difference between training and validation convergence loss values, which means there is no over or under fitting of data and the trained models are predicting well in unseen validation data. We refer to Figure 6 for more details on convergence of different models.

(A) Training Loss Term Convergence

(B) Validation Loss Term Convergence

Fig. 6. Training and Validation Loss Term Convergence

We use macro and micro F1 average scores to evaluate the trade-off of multi-classification precision and recall. The macro F1 average score computes the metric independently for each class and used to evaluate the accuracy in generic manner, whereas the micro F1 average considers the cardinal- ity of all classes to compute the average metric. We use the latter to evaluate the accuracy in the presence of imbalanced classes (micro-service types). Figure 7 depicts the different scores obtained for training and validation phases. The results illustrate that pooling techniques outperforms clearly GCN aggregator and slightly Mean aggregator. For the sake of illustration, Mean-pool model has a macro F1 and micro F1 scores around96%, which means that GNNs can achieve high predictability of K8S service types. In addition, we observe that for Pooling techniques and LSTM both the macro and micro F1 scores are at the same range, which means their ability to predict different micro-service types despite the presence of imbalanced classes.

(A) Macro F1 Average Scores

(B) Micro F1 Average Scores

Fig. 7. Training and Validation F1 Scores

The testing phase encloses two kind of tests, namely, test on benign graphs and test on anomalous graphs. The average loss value for all benign graphs is comparable to the values obtained in the training and validation for all aggregators.

However, for the anomalous graphs the average loss values increase with respect to the values obtained in the training, the validation and the testing of benign graphs. For the sake of illustration, in Figure 8, we notice that the average loss value for anomalous graphs increase for all aggregators (e.g., Mean-pool, anomalous average loss value0.5881compared to benign average loss value0.5166). This increase means that in

(8)

part of tested graphs, the presence of anomalies resulted in the computation of predictions on misfitting data, which impacted the average loss value.

Fig. 8. Testing vs. Training and Validation Loss Values

To infer more information on anomalous graphs, we computed the average and standard deviation of predictions’

entropy scores for benign and anomalous graphs based on predictions obtained from Mean-Pool aggregator model. From Figure 9, we notice that anomalous graphs have higher average entropy compared to benign data. As such, 19 anomalous graphs can be segregated clearly from the normal ones. From the entropy scores, 7 out of 8 anomalous graphs representing scanning activities and downloads (first anomaly type, chunkA) have generated higher entropy scores. Regarding DNS take-down (second anomaly type, chunkB), 4 out of 7 anomalous graphs have been identified, this due to the fact that some of selected graphs to inject anomalies do not have high interaction between services and DNS. Regarding Man-in-the- middle (third anomaly type, chunkC), 8 out of 8 anomalous graphs have been identified, the graphs are not as easy to dis- cern like the ones representing the first anomaly type, but have a slightly higher average entropy scores with respect to the benign ones. From these experiments, we noticed if anomaly events impact massively patterns of dependency graphs, the different nodes representing different service types are hard to profile, therefore, impacting loss values and increasing entropy scores (bad quality of prediction of impacted nodes).

ChunkA140 Anomalies Normal

Fig. 9. Entropies (Benign vs. Anomalous)

Among tested anomalous graphs, “ChunkA140” represents a graph containing280nodes and1008edges, where a massive telnet scan have been done from a container on other containers leading to the change of TCP network patterns observed on profiled graphs as well as opening of new connections.

Figure 10 refers to an illustration of all connecting peers to the malicious node “646” as well as an example of a neighborhood entropy score (1.95) for an attacked service (class “194” ). Our approach showcases that based on entropy scores computed over predictions on anomalous data, we can have a root cause analysis to identify malicious or problematic nodes.

Group: 194, Entropy = 1.95

All neighborhood nodes

Fig. 10. Dependency Graph for scanning Node “646”

V. RELATEDWORK

Prior research efforts have been initiated to shift traditional intrusion detection systems from traditional detection based on indicators of compromise and known signature towards the use of machine learning techniques. For sake of illustration, works introduced [13]–[18] used Malware network traffic as a ground truth to collect diverse features based on flows, headers and signatures to train machine learning algorithms leading to very encouraging detection results. However, the move of network infrastructures towards cloud paradigm has opened the door to adapt machine learning techniques on heterogeneous traffic observed in the cloud. Supervised machines learning approaches have been proposed in [19], [20] to identify the preliminary symptoms of service level agreement (SLA) violation in virtual machines. Other works have been introduced beyond the scope of SLA violations. This requires more effort to be invested on the feature engineering by considering system level attributes/metrics (e.g., CPU and memory consumption) and network level attributes (e.g., flow counters, session statistics, etc.). In [21], log processing-based approaches are proposed to predict faulty network conditions.

In [22], authors propose to profile Network Function Virtu- alization (NFV) deployments by classifying Virtual Network Function (VNF) resource characteristics and correlating their performance indicators to optimize the resource allocation, and VNF placement policies under the context of VNF behavior uncertainty. Similar efforts on resource allocation optimization based on VNF performance profiling are presented in [23]–

[25]. Those works propose profiling techniques for resource usage optimization purposes. In [26], authors proposed a

(9)

Network Intrusion Detection System (NIDS) based on Graph- SAGE, namely, E-GraphSAGE, where they considered an approach to classify malicious peers connections through edges classification, which differs from our use-case, which aims to profile the behavior of nodes as a security baseline that can be used to infer anomalies. In addition, our work has a focus on cloud native applications, where additional system based features related to services (nodes) are considered, whereas E-GraphSAGE has a focus on IPs peers. In [27], authors proposed Malchain, a quantitative data flow graph (QDFG) modeling to identify malicious service function chains. The model is based on monitoring global features closeness, betweeness centrality and micro-service local features based on statisti- cal aggregates like averaging the mean, standard deviation, maximum and minimum. The QDFG expresses micro-service interactions through static graph vectors. In our approach, we rely on dynamic modeling of cloud native application assets interaction (e.g., PoDs, worker machines, management) based on computational graphs represented through neural networks instead of using static vectors to represent graphs.

The reason behind doing so, is related to the dynamic nature of dependency graphs representing cloud native applications, where provisioning or un-provisioning of micro-services and worker machines can increase or decrease depending on the resource load of different workloads.

VI. CONCLUSION

In this paper, we proposed a novel cloud native application profiling approach, where we profiled as a first attempt the TCP network activities within K8S setup. Our approach is based on graph neural networks to capture the interaction of cloud native application services through dependency graphs.

By ingesting collected graphs into a graph neural network model, we infer computational graphs that profiles different service types collected on K8S premises. In addition to profiling, we used predictions entropy scores to identify anomalies found on testing graphs. Our experiments have shown encouraging results by finding 19out of23anomalies.

Moreover, by analyzing more the entropy score of different nodes, our approach allows the identification of problematic nodes. In future work, we aim to improve our approach by considering a more extensive feature set including system level attributes, cloud management attributes as well as specific application protocols attributes.

REFERENCES

[1] “ETSI NFV SEC: Network Functions Virtualisation (NFV) Release 3;

Security; Security Management and Monitoring specification.”

[2] “3GPP Technical specification: 5G Security Assurance Specification (SCAS); Network Data Analytics Function (NWDAF).”

[3] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfar- dini, “The graph neural network model,”IEEE transactions on neural networks, vol. 20, no. 1, pp. 61–80, 2008.

[4] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?”arXiv preprint arXiv:1810.00826, 2018.

[5] C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe, “Weisfeiler and leman go neural: Higher-order graph neural networks,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 4602–4609.

[6] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,”arXiv preprint arXiv:1609.02907, 2016.

[7] W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin, “Graph neural networks for social recommendation,” inThe World Wide Web Conference, 2019, pp. 417–426.

[8] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 1025–

1035.

[9] “Systemtap,” https://docs.sysdig.com/en/docs/installation/sysdig-agent/, [Online; accessed 19-Septempber-2021].

[10] “Sysdig agent,” https://sourceware.org/systemtap/, [Online; accessed 19- Septempber-2021].

[11] “Helm the package manager for kubernetes,” https://helm.sh/, [Online;

accessed 19-Septempber-2021].

[12] “Barkley packet filter,” https://ebpf.io/, [Online; accessed 19- Septempber-2021].

[13] L. Carl, R. Walsh, D. Lapsley, and W. Strayer, “Using machine learning technliques to identify botnet traffic,” in Local Computer Networks, Proceedings 2006 31st IEEE Conference on. IEEE, 2006.

[14] G. Gu, J. Zhang, and W. Lee, “Botsniffer: Detecting botnet command and control channels in network traffic,” 2008.

[15] V. Golovko, S. Bezobrazov, P. Kachurka, and L. Vaitsekhovich, “Neural network and artificial immune systems for malware and network intrusion detection,” inAdvances in machine learning II. Springer, 2010, pp. 485–513.

[16] S. Nari and A. A. Ghorbani, “Automated malware classification based on network behavior,” in2013 International Conference on Computing, Networking and Communications (ICNC). IEEE, 2013, pp. 642–647.

[17] A. Boukhtouta, S. A. Mokhov, N.-E. Lakhdari, M. Debbabi, and J. Pa- quet, “Network malware classification comparison using dpi and flow packet headers,”Journal of Computer Virology and Hacking Techniques, vol. 12, no. 2, pp. 69–100, 2016.

[18] A. Boukhtouta, N.-E. Lakhdari, S. A. Mokhov, and M. Debbabi,

“Towards fingerprinting malicious traffic,”Procedia Computer Science, vol. 19, pp. 548–555, 2013.

[19] C. Sauvanaud, K. Lazri, M. Kaˆaniche, and K. Kanoun, “Anomaly detection and root cause localization in virtual network functions,” in 2016 IEEE 27th ISSRE. IEEE, 2016, pp. 196–206.

[20] ——, “Towards black-box anomaly detection in virtual network functions,” in 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop. IEEE, 2016, pp. 254–

257.

[21] C. Bertero, M. Roy, C. Sauvanaud, and G. Tr´edan, “Experience re- port: Log mining using natural language processing and application to anomaly detection,” in2017 IEEE 28th ISSRE. IEEE, 2017, pp. 351–

360.

[22] M. G. Khan, S. Bastani, J. Taheri, A. Kassler, and S. Deng, “Nfv- inspector: A systematic approach to profile and analyze virtual network functions,” in2018 IEEE 7th international conference on cloud networking (CloudNet). IEEE, 2018, pp. 1–7.

[23] S. Van Rossem, W. Tavernier, D. Colle, M. Pickavet, and P. Demeester,

“Profile-based resource allocation for virtualized network functions,”

IEEE Transactions on Network and Service Management, vol. 16, no. 4, pp. 1374–1388, 2019.

[24] M. Peuster and H. Karl, “Understand your chains: Towards performance profile-based network service management,” in 2016 Fifth European Workshop on Software-Defined Networks (EWSDN). IEEE, 2016, pp.

7–12.

[25] Z. Li, Z. Ge, A. Mahimkar, J. Wang, B. Y. Zhao, H. Zheng, J. Emmons, and L. Ogden, “Predictive analysis in network function virtualization,”

inProceedings of the Internet Measurement Conference 2018, 2018, pp.

161–167.

[26] W. W. Lo, S. Layeghy, M. Sarhan, M. Gallagher, and M. Portmann, “E- graphsage: A graph neural network based intrusion detection system,”

arXiv preprint arXiv:2103.16329, 2021.

[27] M. M. Ghorbani, F. F. Moghaddam, M. Zhang, M. Pourzandi, K. K.

Nguyen, and M. Cheriet, “Malchain: Virtual application behaviour profiling by aggregated microservice data exchange graph,” in 2020 IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 2020, pp. 41–48.