Cheat Sheet Data Engineer

(1)

Prerequisites

❖ Always use Google Chrome browser in Incognito window (ctrl+shift+n)

❖ Login to https://console.cloud.google.com/ with you GMAIL ID

Install Cloud SDK

Before installing Cloud SDK check whether it is already installed by typing gcloud --version in command prompt. If it is not installed then click on this Link

Introduction to GCP

Gcloud commands

Set Project ID in Cloud Shell

gcloud config set project PROJECT_ID

Provision a VM in mumbai region

gcloud compute instances create learn-gcp-gce-shell --zone=asia-south1-a

Create Storage bucket

gsutil mb gs://learn-gcp-pde-batch1-gcs-shell

Storage & Database

Create Multi-regional, Regional, Nearline, Coldline storage class gsutil mb -c multi_regional -l asia gs://[BUCKET-NAME]-mr

gsutil mb -c regional -l asia-south1 gs://[BUCKET-NAME]-r

gsutil mb -c standard -l asia gs://[BUCKET-NAME]-mr gsutil mb -c standard -l asia-south1 gs://[BUCKET-NAME]-r

gsutil mb -c nearline -l asia-south1 gs://[BUCKET-NAME]-nl gsutil mb -c coldline -l asia-south1 gs://[BUCKET-NAME]-cl gsutil mb -c archive -l asia-south1 gs://[BUCKET-NAME]-ar

gsutil -m cp -r . gs://[BUCKET-NAME]-recursive-mulithreading/

(2)

Change ACL to public for objects in a storage bucket gsutil acl ch -u AllUsers:R gs://[BUCKET-NAME]-mr/*

Remove ACL

gsutil acl ch -d Allusers gs://[BUCKET-NAME]/

Cloud Spanner - Interleaved Table

CREATE TABLE Singers ( SingerId INT64 NOT NULL, FirstName STRING(1024), LastName STRING(1024), SingerInfo BYTES(MAX), ) PRIMARY KEY (SingerId);

CREATE TABLE Albums ( SingerId INT64 NOT NULL, AlbumId INT64 NOT NULL, AlbumTitle STRING(MAX),

) PRIMARY KEY (SingerId, AlbumId),

INTERLEAVE IN PARENT Singers ON DELETE CASCADE;

Compute

Set default region and zone

gcloud config set compute/region asia-south1-a gcloud config set compute/zone asia-south1-a

Update VM

sudo apt-get update

Install GIT in VM sudo apt-get install git

Install Apache Web Server in VM sudo apt-get install apache2

Install MySQL Client in VM

sudo apt-get install mysql-client

List Compute Engines

(3)

gcloud compute instances list

Cloud IAM

Gcloud commands

Primitive Roles

gcloud iam roles describe roles/owner | less gcloud iam roles describe roles/editor | less gcloud iam roles describe roles/viewer | less

Predefined Roles

gcloud iam roles describe roles/bigquery.dataOwner gcloud iam roles describe roles/compute.admin

Policy & Bindings

gcloud projects get-iam-policy $DEVSHELL_PROJECT_ID --format json > demo.json

Service Account

gcloud iam service-accounts create demo-learn-gcp-console --display-name "Demo Learn GCP Service Account"

gcloud iam service-accounts list (only user-managed service account is listed)

Switch Service Account

gcloud auth activate-service-account --key-file=<YOUR JSON FILE>

GCE + GCS + IAM

1. Provision a GCE instance with git installed

2. Clone https://github.com/GoogleCloudPlatform/training-data-analyst

3. Change directory to courses/bdml_fundamentals/demos/earthquakevm 4. Install all the Python dependencies by running ./install_missing.sh

5. Ingest data on recent earthquakes from the USGS by running ./ingest.sh 6. Transform the raw data into an image ./transform.py

(4)

Recommendation ML using DataProc + Cloud SQL

1. Clone https://github.com/GoogleCloudPlatform/training-data-analyst 2. Change directory to CPB100/lab3a for SQL script

3. Create Private MySQL instance

gcloud beta sql instances create [INSTANCE_NAME] --database-version=MYSQL_5_7 --tier=db-n1-standard-1 --region=asia-south1 --root-password=demogcp

--network=default --no-assign-ip

4. Create recommendation db in the instance

gcloud sql databases create recommendation_spark --instance=[INSTANCE_NAME]

5. Give access to GCS for CloudSQL Service Account gcloud sql instances describe recommendation-sql

gsutil iam ch serviceAccount:[SERVICE_ACCOUNT]:roles/storage.objectViewer gs://[BUCKET_NAME]

6. Create Schema in the the DB instance

gcloud sql import sql [INSTANCE_NAME] gs://[BUCKET_NAME]/table_creation.sql 7. Import Data Rating & Accomodation table data

gcloud sql import csv [INSTANCE_NAME] gs://[BUCKET_NAME]/rating.csv \ --database=recommendation_spark --table=Rating

gcloud sql import csv [INSTANCE_NAME] gs://[BUCKET_NAME]/accomodation.csv \ --database=recommendation_spark --table=Accommodation

8. Create DataProc which is active 45mins

gcloud beta dataproc clusters create cluster-dataproc --region asia-south1 --subnet default --zone asia-south1-b --master-machine-type n1-standard-1

--master-boot-disk-size 20 --num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 20 --image-version 1.5-debian10 --max-idle 2700s

9. Change directory to CPB100/lab3b for PySpark code

10. Edit train_and_apply.py to change Cloud SQL IP address & password 11. Copy train_and_apply.py to GCS bucket using gsutil

12. Submit Job in DataProc with Job Type as PySpark

gcloud dataproc jobs submit pyspark --region asia-south1 --cluster cluster-dataproc gs://demo-gcp-mysql/train_and_apply.py

13. Login to the VM and connect to mysql and execute the below SQL with a user id

14. select r.userid,r.accoid,r.prediction,a.title,a.location, a.price,a.rooms,a.rating,a.type from Recommendation as r, Accommodation as a where r.accoid=a.id and r.userid= 10;

(5)

BigData

BigQuery - Concepts

The query which explains Execution details & Slots SELECT

COUNT(*) FROM

`bigquery-samples.wikipedia_benchmark.Wiki10M`

WHERE

title LIKE '%cloud%'

Bigquery Initiatliation in Cloud Shell bq init //Will create .bigqueryrc file

BigQuery bq utilities bq ls //List all the dataset

bq ls DATASET // List tables in a specific dataset bq show DATASET.TABLE // describe table bq ls -j -a //List all the jobs

bq show -j <JOBID> // Details of the job

BQ Bulk Loading

bq load

PROJECT_ID:DATASET.TABLENAME

bigquery_cloudshell.csv Bq query “select firstname from DATASET.TABLENAME”

BQ Bulk Loading - Buggy

bq load --max_bad_records 1

PROJECT_ID:DATASET.TABLENAME

bigquery_cloudshell_buggy.csv

BQ CLI - Query Cost using --dry_run

bq query --dry_run --use_legacy_sql=false

'SELECT * FROM DATASET.TABLENAME'

Partition Summary using Legacy SQL Dialect

SELECT * FROM [PROJECT_ID.DATASET.TABLENAME$__PARTITIONS_SUMMARY__]

(6)

Create ONE day expiring partitions:

bq load --time_partitioning_type=DAY --time_partitioning_expiration=86400 --require_partition_filter --autodetect

^DATASET

.expiresoon_partition_tb bigquery_cloudshell.csv

Partitioning on Bigger Dataset

No Partitioning. This below query takes 840MB SELECT

title FROM

`bigquery-public-data.stackoverflow.stackoverflow_posts`

WHERE

DATE(creation_date) BETWEEN "2015-01-01"

AND "2015-12-31"

Copy table from Public Dataset

First, create a dataset named stackoverflow in US region

CREATE TABLE

`PROJECT_ID.stackoverflow.partition_demo`

PARTITION BY

DATE(creation_date)

OPTIONS(require_partition_filter=true) AS

SELECT * FROM

With Partitioning. This query takes 170MB SELECT

title FROM

`learn-gcp-pde-batch1.stackoverflow.stackoverflow_questions`

WHERE

AND "2015-12-31"

(7)

No Clustering. This below query processes 1.2GB SELECT

title FROM

WHERE

AND "2015-12-31"

AND tags=’android’

CREATE TABLE

`PROJECT_ID.stackoverflow.partition_clustering_demo`

PARTITION BY

DATE(creation_date) CLUSTER BY TAGS

OPTIONS(require_partition_filter=true) AS

SELECT * FROM

With Partitioning+Clustering. This query processes 235MB SELECT

title FROM

`PROJECT_ID.stackoverflow.partition_demo`

WHERE

AND "2015-12-31"

AND tags=’android’

Connecting to BigQuery using Client Libraries 1. pip install google-cloud-bigquery

2. Create a Service Account with BigQuery User role and download the key.json 3. set GOOGLE_APPLICATION_CREDENTIALS=key.json (windows)

4. export GOOGLE_APPLICATION_CREDENTIALS=key.json (linux) 5. python BigQuery-Python.py

(8)

Connecting to BigQuery via Google Compute Engine sudo apt-get update

sudo apt-get install python-pip -y && pip install google-cloud-bigquery

Federated Queries - Cloud SQL

1. Enable BigQuery Connection API 2. Creation Connection in BigQuery 3. Run Query SELECT * FROM

EXTERNAL_QUERY('silken-eye-267917.asia-south1.demo', 'select name from city');

In-Built Functions SELECT * from unnest([

'Seattle WA', 'New York', 'Singapore' ]) AS city

--TimeZone

SELECT CURRENT_DATETIME('Asia/Calcutta')

SELECT time_zone,

DATETIME_DIFF(CURRENT_DATETIME(time_zone),

CURRENT_DATETIME(), MINUTE) AS hours_from_utc

FROM UNNEST(['America/Los_Angeles', 'Asia/Calcutta']) AS time_zone;

Data Proc

Job Type -> Spark

Main class or jar -> org.apache.spark.examples.SparkPi Arguments -> 1000

Jar files -> file:///usr/lib/spark/examples/jars/spark-examples.jar 8088 (Hadoop Job Interface)

9870 (Hadoop Admin interface)

(9)

Pub/ Sub

Create Topic, Subscription and Push the topic gcloud pubsub topics create myTopic

gcloud pubsub topics publish myTopic --message "Hello PubSub"

gcloud pubsub subscriptions create --topic myTopic mySubscription gcloud pubsub subscriptions pull --auto-ack --limit 10 mySubscription

Dataflow

Note: Make sure BQ Dataset & Regional Cloud Bucket are in the SAME region for this demo

1. Create an empty table us-central1 dataset in BigQuery with this schema using bq mk YOUR_DATASET_NAME.batch_processing

username:STRING,branch_id:STRING,product_sku:STRING,product_quantity:INT EGER,date:TIMESTAMP,product_name:STRING,store:STRING

2. Create a Regional Bucket named $DEVSHELL_PROJECT_ID-templates in us-central1 region

3. Upload dataflow.zip to cloud shell and unzip it using unzip dataflow.zip 4. Change directory to dataflow

5. Copy Metadata file named SKU_Data_metadata to gs://$DEVSHELL_PROJECT_ID-templates/templates/

6. Run pip3 install -r requirements.txt --user

7. Obtains user access credentials & authenticate using gcloud auth application-default login

8. Run the below command to create a Dataflow template python3 load-to-bq.py --runner DataFlowRunner --project

$DEVSHELL_PROJECT_ID --temp_location

gs://$DEVSHELL_PROJECT_ID-templates/temp --staging_location gs://$DEVSHELL_PROJECT_ID-templates/staging --region us-central1

--template_location gs://$DEVSHELL_PROJECT_ID-templates/templates/SKU_Data --experiment=use_beam_bq_sink

9. Upload gcf.zip to cloud shell and unzip it using unzip gcf.zip

10. Change directory to gcf and execute below to deploy cloud functions

11. Create a Regional Bucket called $DEVSHELL_PROJECT_ID-batch-processing 12. gcloud functions deploy gcs_gcf_bq_trigger

--entry-point=launch_dataflow_pipeline --region=us-central1 --runtime=python37 --trigger-bucket=$DEVSHELL_PROJECT_ID-batch-processing --set-env-vars PROJECT_ID=$DEVSHELL_PROJECT_ID,TEMPLATE_BUCKET=$DEVSHELL_PRO JECT_ID-templates,BQ_TABLE=YOUR_DATASET_NAME.batch_processing -q 13. Upload data.csv to$DEVSHELL_PROJECT_ID-batch-processing Cloud Bucket and

check the dataflow pipeline getting triggered

(10)

Cloud Functions

Authenticate a HTTP function using

curl https://REGION-PROJECT_ID.cloudfunctions.net/FUNCTION_NAME \ -H "Authorization: bearer $(gcloud auth print-identity-token)"

DataLab

datalab create demo-datalab --network=default --zone=asia-southeast1-a --quiet

Cloud Memorystore

sudo apt-get install redis-tools

redis-cli -h [MEMORY-STORE-HOST] -p 6379 incr mycounter

BigTable

Demo using HBase API’s

1. Open Cloud Shell

2. git clone https://github.com/GoogleCloudPlatform/cloud-bigtable-examples.git 3. cd cloud-bigtable-examples/quickstart

4. ./quickstart.sh

5. At the hbase prompt, create a table 6. create 'singer','personal_cf','album_cf' 7. List your tables: list

8. put 'singer', '68934a7b-5e15-4265-b114-d105d4a3ecac-john', 'personal_cf:firstname', 'john' 9. Use the scan command to scan the table and read the data you added:

10. scan 'singer'

11. scan 'singer', { LIMIT => 3, FILTER => "QualifierFilter (=,'binary:firstname')" }

12. scan 'singer', { COLUMNS => ['personal_cf:firstname','album_cf:name'], LIMIT => 3 } 13. Only 3 columns (Rowkey or Index key {column 1}, columnname {column 2} and value

{column 3})

14. Delete the table my-table:

15. disable 'singer'

16. drop 'singer'

(11)

Demo using Cloud BigTable(cbt) utility

1. Open Cloud Shell 2. gcloud components list

gcloud components install cbt (Don’t it is present)

3. echo project = $DEVSHELL_PROJECT_ID > ~/.cbtrc 4. echo instance = <BIGTABLE_INSTANCE_ID> >> ~/.cbtrc

5. cbt createtable singer2 6. cbt ls

7. cbt createfamily singer2 personal_cf 8. cbt createfamily singer2 album_cf 9. cbt ls singer2

10. cbt set singer2 630cb375-f60f-478c-9dea-48ea9f59eec9-john personal_cf:firstname=john

11. cbt set singer2 630cb375-f60f-478c-9dea-48ea9f59eec9-john album_cf:name=’rock rock’

12. cbt set singer2 630cb375-f60f-478c-9dea-48ea9f59eec9-john album_cf:name=rock2 rock2

13. cbt read singer

→

You can see two versions which was shown using HBase API 14. cbt deletetable singer2

BigQuery ML

Create a Model

CREATE OR REPLACE MODEL

`stackoverflow.sample_model` OPTIONS (model_type='logistic_reg') AS SELECT

IF (totals.transactions IS NULL,0,1) AS label, IFNULL(device.operatingsystem,"") AS os, device.isMobile AS is_mobile,

IFNULL (geoNetwork.country,"") AS country, IFNULL (totals.pageviews,0) AS pageviews FROM

`bigquery-public-data.google_analytics_sample.ga_sessions_*`

WHERE

_TABLE_SUFFIX BETWEEN '20160801' AND '20170631'

LIMIT 100000;

(12)

Evaluate a Model SELECT

* FROM

ml.EVALUATE (MODEL `stackoverflow.sample_model`, (

SELECT

WHERE

_TABLE_SUFFIX BETWEEN '20170701' AND '20170831'

));

Prediction SELECT country,

SUM(predicted_label) AS total_predicted_purchases FROM

ml.PREDICT (MODEL `stackoverflow.sample_model`, (

SELECT

WHERE

_TABLE_SUFFIX BETWEEN '20170701' AND '20170831'))

GROUP BY country ORDER BY

total_predicted_purchases DESC LIMIT

(13)

10;

IOT Core

1. Create a PubSub topic using gcloud pubsub topics create my-device-events 2. Click Create IoT registry

3. Enter my-registry for the Registry ID.

4. If you're in the US, select us-central1 for the Region. If you're outside the US, select your preferred region.

5. Select MQTT for the Protocol.

6. In the Default telemetry topic dropdown list, select Create a topic.

7. In the Create a topic dialog, enter my-device-events in the Name field.

8. Click Create in the Create a topic dialog.

9. The Device state topic and Certificate value fields are optional, so leave them blank.

10. Click Create on the Cloud IoT Core page.

11. Generate a device key pair

openssl req -x509 -newkey rsa:2048 -keyout rsa_private.pem -nodes -out rsa_cert.pem -subj "/CN=unused"

12. Create Device ID and rsa_cert.pem content generate in step# 10 as RS256_X509 13. Clone git clone https://github.com/GoogleCloudPlatform/nodejs-docs-samples 14. Change directory to cd nodejs-docs-samples/iot/mqtt_example

15. Copy rsa_private.pem to nodejs-docs-samples/iot/mqtt_example 16. Install Node JS dependency using npm install

17. Create a Subscription using

gcloud pubsub subscriptions create \

projects/$DEVSHELL_PROJECT_ID/subscriptions/my-subscription-iot \ --topic=projects/$DEVSHELL_PROJECT_ID/topics/my-device-events

18. Run the Node script to simulate data generation node cloudiot_mqtt_example_nodejs.js \

mqttDeviceDemo \

--projectId=$DEVSHELL_PROJECT_ID \ --cloudRegion=asia-east1 \

--registryId=demo \ --deviceId=my-device \

--privateKeyFile=rsa_private.pem \ --numMessages=25 \

--algorithm=RS256

19.gcloud pubsub subscriptions pull --auto-ack --limit 10 mysubscription-iot

(14)

Kubeflow Pipeline

1. Go to https://console.cloud.google.com/kubernetes/list

2. Go to Marketplace and search for Kubeflow Pipelines and deploy it 3. To get a pipeline clone Kubeflow Pipeline repo using git clone

https://github.com/kubeflow/pipelines.git

4. Install Python Kubeflow Pipelines SDK using pip3 install kfp --user. This command installs the dsl-compile and kfp binaries under ~/.local/bin

5. Add ~/local/bin to environment variable using export PATH=$PATH:~/.local/bin 6. Type which dsl-compile to confirm it added in PATH

7. Change directory to pipelines/samples/core/parallel_join

8. Each pipeline is defined as a Python program. Before you can submit a pipeline to the Kubeflow Pipelines service, you must compile the pipeline to an intermediate

representation. The intermediate representation takes the form of a YAML file compressed into a .tar.gz file. Hence use

dsl-compile --py parallel_join.py --output parallel_join.yaml

9. Download the yaml to local file system and upload it in Kubeflow Pipelines