Prerequisites
❖ Always use Google Chrome browser in Incognito window (ctrl+shift+n)
❖ Login to https://console.cloud.google.com/ with you GMAIL ID
Install Cloud SDK
Before installing Cloud SDK check whether it is already installed by typing gcloud --version in command prompt. If it is not installed then click on this Link
Introduction to GCP
Gcloud commands
Set Project ID in Cloud Shell
gcloud config set project PROJECT_ID
Provision a VM in mumbai region
gcloud compute instances create learn-gcp-gce-shell --zone=asia-south1-a
Create Storage bucket
gsutil mb gs://learn-gcp-pde-batch1-gcs-shell
Storage & Database
Create Multi-regional, Regional, Nearline, Coldline storage class gsutil mb -c multi_regional -l asia gs://[BUCKET-NAME]-mr
gsutil mb -c regional -l asia-south1 gs://[BUCKET-NAME]-r
gsutil mb -c standard -l asia gs://[BUCKET-NAME]-mr gsutil mb -c standard -l asia-south1 gs://[BUCKET-NAME]-r
gsutil mb -c nearline -l asia-south1 gs://[BUCKET-NAME]-nl gsutil mb -c coldline -l asia-south1 gs://[BUCKET-NAME]-cl gsutil mb -c archive -l asia-south1 gs://[BUCKET-NAME]-ar
gsutil -m cp -r . gs://[BUCKET-NAME]-recursive-mulithreading/
Change ACL to public for objects in a storage bucket gsutil acl ch -u AllUsers:R gs://[BUCKET-NAME]-mr/*
Remove ACL
gsutil acl ch -d Allusers gs://[BUCKET-NAME]/
Cloud Spanner - Interleaved Table
CREATE TABLE Singers ( SingerId INT64 NOT NULL, FirstName STRING(1024), LastName STRING(1024), SingerInfo BYTES(MAX), ) PRIMARY KEY (SingerId);
CREATE TABLE Albums ( SingerId INT64 NOT NULL, AlbumId INT64 NOT NULL, AlbumTitle STRING(MAX),
) PRIMARY KEY (SingerId, AlbumId),
INTERLEAVE IN PARENT Singers ON DELETE CASCADE;
Compute
Set default region and zone
gcloud config set compute/region asia-south1-a gcloud config set compute/zone asia-south1-a
Update VM
sudo apt-get update
Install GIT in VM sudo apt-get install git
Install Apache Web Server in VM sudo apt-get install apache2
Install MySQL Client in VM
sudo apt-get install mysql-client
List Compute Engines
gcloud compute instances list
Cloud IAM
Gcloud commands
Primitive Roles
gcloud iam roles describe roles/owner | less gcloud iam roles describe roles/editor | less gcloud iam roles describe roles/viewer | less
Predefined Roles
gcloud iam roles describe roles/bigquery.dataOwner gcloud iam roles describe roles/compute.admin
Policy & Bindings
gcloud projects get-iam-policy $DEVSHELL_PROJECT_ID --format json > demo.json
Service Account
gcloud iam service-accounts create demo-learn-gcp-console --display-name "Demo Learn GCP Service Account"
gcloud iam service-accounts list (only user-managed service account is listed)
Switch Service Account
gcloud auth activate-service-account --key-file=<YOUR JSON FILE>
GCE + GCS + IAM
1. Provision a GCE instance with git installed
2. Clone https://github.com/GoogleCloudPlatform/training-data-analyst
3. Change directory to courses/bdml_fundamentals/demos/earthquakevm 4. Install all the Python dependencies by running ./install_missing.sh
5. Ingest data on recent earthquakes from the USGS by running ./ingest.sh 6. Transform the raw data into an image ./transform.py
Recommendation ML using DataProc + Cloud SQL
1. Clone https://github.com/GoogleCloudPlatform/training-data-analyst 2. Change directory to CPB100/lab3a for SQL script
3. Create Private MySQL instance
gcloud beta sql instances create [INSTANCE_NAME] --database-version=MYSQL_5_7 --tier=db-n1-standard-1 --region=asia-south1 --root-password=demogcp
--network=default --no-assign-ip
4. Create recommendation db in the instance
gcloud sql databases create recommendation_spark --instance=[INSTANCE_NAME]
5. Give access to GCS for CloudSQL Service Account gcloud sql instances describe recommendation-sql
gsutil iam ch serviceAccount:[SERVICE_ACCOUNT]:roles/storage.objectViewer gs://[BUCKET_NAME]
6. Create Schema in the the DB instance
gcloud sql import sql [INSTANCE_NAME] gs://[BUCKET_NAME]/table_creation.sql 7. Import Data Rating & Accomodation table data
gcloud sql import csv [INSTANCE_NAME] gs://[BUCKET_NAME]/rating.csv \ --database=recommendation_spark --table=Rating
gcloud sql import csv [INSTANCE_NAME] gs://[BUCKET_NAME]/accomodation.csv \ --database=recommendation_spark --table=Accommodation
8. Create DataProc which is active 45mins
gcloud beta dataproc clusters create cluster-dataproc --region asia-south1 --subnet default --zone asia-south1-b --master-machine-type n1-standard-1
--master-boot-disk-size 20 --num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 20 --image-version 1.5-debian10 --max-idle 2700s
9. Change directory to CPB100/lab3b for PySpark code
10. Edit train_and_apply.py to change Cloud SQL IP address & password 11. Copy train_and_apply.py to GCS bucket using gsutil
12. Submit Job in DataProc with Job Type as PySpark
gcloud dataproc jobs submit pyspark --region asia-south1 --cluster cluster-dataproc gs://demo-gcp-mysql/train_and_apply.py
13. Login to the VM and connect to mysql and execute the below SQL with a user id
14. select r.userid,r.accoid,r.prediction,a.title,a.location, a.price,a.rooms,a.rating,a.type from Recommendation as r, Accommodation as a where r.accoid=a.id and r.userid= 10;
BigData
BigQuery - Concepts
The query which explains Execution details & Slots SELECT
COUNT(*) FROM
`bigquery-samples.wikipedia_benchmark.Wiki10M`
WHERE
title LIKE '%cloud%'
Bigquery Initiatliation in Cloud Shell bq init //Will create .bigqueryrc file
BigQuery bq utilities bq ls //List all the dataset
bq ls DATASET // List tables in a specific dataset bq show DATASET.TABLE // describe table bq ls -j -a //List all the jobs
bq show -j <JOBID> // Details of the job
BQ Bulk Loading
bq load
PROJECT_ID:DATASET.TABLENAMEbigquery_cloudshell.csv Bq query “select firstname from DATASET.TABLENAME”
BQ Bulk Loading - Buggy
bq load --max_bad_records 1
PROJECT_ID:DATASET.TABLENAMEbigquery_cloudshell_buggy.csv
BQ CLI - Query Cost using --dry_run
bq query --dry_run --use_legacy_sql=false
'SELECT * FROM DATASET.TABLENAME'
Partition Summary using Legacy SQL DialectSELECT * FROM [PROJECT_ID.DATASET.TABLENAME$__PARTITIONS_SUMMARY__]
Create ONE day expiring partitions:
bq load --time_partitioning_type=DAY --time_partitioning_expiration=86400 --require_partition_filter --autodetect
DATASET.expiresoon_partition_tb bigquery_cloudshell.csv
Partitioning on Bigger Dataset
No Partitioning. This below query takes 840MB SELECT
title FROM
`bigquery-public-data.stackoverflow.stackoverflow_posts`
WHERE
DATE(creation_date) BETWEEN "2015-01-01"
AND "2015-12-31"
Copy table from Public Dataset
First, create a dataset named stackoverflow in US region
CREATE TABLE
`PROJECT_ID.stackoverflow.partition_demo`
PARTITION BY
DATE(creation_date)
OPTIONS(require_partition_filter=true) AS
SELECT * FROM
`bigquery-public-data.stackoverflow.stackoverflow_posts`
With Partitioning. This query takes 170MB SELECT
title FROM
`learn-gcp-pde-batch1.stackoverflow.stackoverflow_questions`
WHERE
DATE(creation_date) BETWEEN "2015-01-01"
AND "2015-12-31"
No Clustering. This below query processes 1.2GB SELECT
title FROM
`bigquery-public-data.stackoverflow.stackoverflow_posts`
WHERE
DATE(creation_date) BETWEEN "2015-01-01"
AND "2015-12-31"
AND tags=’android’
CREATE TABLE
`PROJECT_ID.stackoverflow.partition_clustering_demo`
PARTITION BY
DATE(creation_date) CLUSTER BY TAGS
OPTIONS(require_partition_filter=true) AS
SELECT * FROM
`bigquery-public-data.stackoverflow.stackoverflow_posts`
With Partitioning+Clustering. This query processes 235MB SELECT
title FROM
`PROJECT_ID.stackoverflow.partition_demo`
WHERE
DATE(creation_date) BETWEEN "2015-01-01"
AND "2015-12-31"
AND tags=’android’
Connecting to BigQuery using Client Libraries 1. pip install google-cloud-bigquery
2. Create a Service Account with BigQuery User role and download the key.json 3. set GOOGLE_APPLICATION_CREDENTIALS=key.json (windows)
4. export GOOGLE_APPLICATION_CREDENTIALS=key.json (linux) 5. python BigQuery-Python.py
Connecting to BigQuery via Google Compute Engine sudo apt-get update
sudo apt-get install python-pip -y && pip install google-cloud-bigquery
Federated Queries - Cloud SQL
1. Enable BigQuery Connection API 2. Creation Connection in BigQuery 3. Run Query SELECT * FROM
EXTERNAL_QUERY('silken-eye-267917.asia-south1.demo', 'select name from city');
In-Built Functions SELECT * from unnest([
'Seattle WA', 'New York', 'Singapore' ]) AS city
--TimeZone
SELECT CURRENT_DATETIME('Asia/Calcutta')
SELECT time_zone,
DATETIME_DIFF(CURRENT_DATETIME(time_zone),
CURRENT_DATETIME(), MINUTE) AS hours_from_utc
FROM UNNEST(['America/Los_Angeles', 'Asia/Calcutta']) AS time_zone;
Data Proc
Job Type -> Spark
Main class or jar -> org.apache.spark.examples.SparkPi Arguments -> 1000
Jar files -> file:///usr/lib/spark/examples/jars/spark-examples.jar 8088 (Hadoop Job Interface)
9870 (Hadoop Admin interface)
Pub/ Sub
Create Topic, Subscription and Push the topic gcloud pubsub topics create myTopic
gcloud pubsub topics publish myTopic --message "Hello PubSub"
gcloud pubsub subscriptions create --topic myTopic mySubscription gcloud pubsub subscriptions pull --auto-ack --limit 10 mySubscription
Dataflow
Note: Make sure BQ Dataset & Regional Cloud Bucket are in the SAME region for this demo
1. Create an empty table us-central1 dataset in BigQuery with this schema using bq mk YOUR_DATASET_NAME.batch_processing
username:STRING,branch_id:STRING,product_sku:STRING,product_quantity:INT EGER,date:TIMESTAMP,product_name:STRING,store:STRING
2. Create a Regional Bucket named $DEVSHELL_PROJECT_ID-templates in us-central1 region
3. Upload dataflow.zip to cloud shell and unzip it using unzip dataflow.zip 4. Change directory to dataflow
5. Copy Metadata file named SKU_Data_metadata to gs://$DEVSHELL_PROJECT_ID-templates/templates/
6. Run pip3 install -r requirements.txt --user
7. Obtains user access credentials & authenticate using gcloud auth application-default login
8. Run the below command to create a Dataflow template python3 load-to-bq.py --runner DataFlowRunner --project
$DEVSHELL_PROJECT_ID --temp_location
gs://$DEVSHELL_PROJECT_ID-templates/temp --staging_location gs://$DEVSHELL_PROJECT_ID-templates/staging --region us-central1
--template_location gs://$DEVSHELL_PROJECT_ID-templates/templates/SKU_Data --experiment=use_beam_bq_sink
9. Upload gcf.zip to cloud shell and unzip it using unzip gcf.zip
10. Change directory to gcf and execute below to deploy cloud functions
11. Create a Regional Bucket called $DEVSHELL_PROJECT_ID-batch-processing 12. gcloud functions deploy gcs_gcf_bq_trigger
--entry-point=launch_dataflow_pipeline --region=us-central1 --runtime=python37 --trigger-bucket=$DEVSHELL_PROJECT_ID-batch-processing --set-env-vars PROJECT_ID=$DEVSHELL_PROJECT_ID,TEMPLATE_BUCKET=$DEVSHELL_PRO JECT_ID-templates,BQ_TABLE=YOUR_DATASET_NAME.batch_processing -q 13. Upload data.csv to$DEVSHELL_PROJECT_ID-batch-processing Cloud Bucket and
check the dataflow pipeline getting triggered
Cloud Functions
Authenticate a HTTP function using
curl https://REGION-PROJECT_ID.cloudfunctions.net/FUNCTION_NAME \ -H "Authorization: bearer $(gcloud auth print-identity-token)"
DataLab
datalab create demo-datalab --network=default --zone=asia-southeast1-a --quiet
Cloud Memorystore
sudo apt-get install redis-tools
redis-cli -h [MEMORY-STORE-HOST] -p 6379 incr mycounter
BigTable
Demo using HBase API’s
1. Open Cloud Shell
2. git clone https://github.com/GoogleCloudPlatform/cloud-bigtable-examples.git 3. cd cloud-bigtable-examples/quickstart
4. ./quickstart.sh
5. At the hbase prompt, create a table 6. create 'singer','personal_cf','album_cf' 7. List your tables: list
8. put 'singer', '68934a7b-5e15-4265-b114-d105d4a3ecac-john', 'personal_cf:firstname', 'john' 9. Use the scan command to scan the table and read the data you added:
10. scan 'singer'
11. scan 'singer', { LIMIT => 3, FILTER => "QualifierFilter (=,'binary:firstname')" }
12. scan 'singer', { COLUMNS => ['personal_cf:firstname','album_cf:name'], LIMIT => 3 } 13. Only 3 columns (Rowkey or Index key {column 1}, columnname {column 2} and value
{column 3})
14. Delete the table my-table:
15. disable 'singer'
16. drop 'singer'
Demo using Cloud BigTable(cbt) utility
1. Open Cloud Shell 2. gcloud components list
gcloud components install cbt (Don’t it is present)
3. echo project = $DEVSHELL_PROJECT_ID > ~/.cbtrc 4. echo instance = <BIGTABLE_INSTANCE_ID> >> ~/.cbtrc
5. cbt createtable singer2 6. cbt ls
7. cbt createfamily singer2 personal_cf 8. cbt createfamily singer2 album_cf 9. cbt ls singer2
10. cbt set singer2 630cb375-f60f-478c-9dea-48ea9f59eec9-john personal_cf:firstname=john
11. cbt set singer2 630cb375-f60f-478c-9dea-48ea9f59eec9-john album_cf:name=’rock rock’
12. cbt set singer2 630cb375-f60f-478c-9dea-48ea9f59eec9-john album_cf:name=rock2 rock2
13. cbt read singer
→You can see two versions which was shown using HBase API 14. cbt deletetable singer2
BigQuery ML
Create a Model
CREATE OR REPLACE MODEL
`stackoverflow.sample_model` OPTIONS (model_type='logistic_reg') AS SELECT
IF (totals.transactions IS NULL,0,1) AS label, IFNULL(device.operatingsystem,"") AS os, device.isMobile AS is_mobile,
IFNULL (geoNetwork.country,"") AS country, IFNULL (totals.pageviews,0) AS pageviews FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170631'
LIMIT 100000;
Evaluate a Model SELECT
* FROM
ml.EVALUATE (MODEL `stackoverflow.sample_model`, (
SELECT
IF (totals.transactions IS NULL,0,1) AS label, IFNULL(device.operatingsystem,"") AS os, device.isMobile AS is_mobile,
IFNULL (geoNetwork.country,"") AS country, IFNULL (totals.pageviews,0) AS pageviews FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170831'
));
Prediction SELECT country,
SUM(predicted_label) AS total_predicted_purchases FROM
ml.PREDICT (MODEL `stackoverflow.sample_model`, (
SELECT
IF (totals.transactions IS NULL,0,1) AS label, IFNULL(device.operatingsystem,"") AS os, device.isMobile AS is_mobile,
IFNULL (geoNetwork.country,"") AS country, IFNULL (totals.pageviews,0) AS pageviews FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170831'))
GROUP BY country ORDER BY
total_predicted_purchases DESC LIMIT
10;
IOT Core
1. Create a PubSub topic using gcloud pubsub topics create my-device-events 2. Click Create IoT registry
3. Enter my-registry for the Registry ID.
4. If you're in the US, select us-central1 for the Region. If you're outside the US, select your preferred region.
5. Select MQTT for the Protocol.
6. In the Default telemetry topic dropdown list, select Create a topic.
7. In the Create a topic dialog, enter my-device-events in the Name field.
8. Click Create in the Create a topic dialog.
9. The Device state topic and Certificate value fields are optional, so leave them blank.
10. Click Create on the Cloud IoT Core page.
11. Generate a device key pair
openssl req -x509 -newkey rsa:2048 -keyout rsa_private.pem -nodes -out rsa_cert.pem -subj "/CN=unused"
12. Create Device ID and rsa_cert.pem content generate in step# 10 as RS256_X509 13. Clone git clone https://github.com/GoogleCloudPlatform/nodejs-docs-samples 14. Change directory to cd nodejs-docs-samples/iot/mqtt_example
15. Copy rsa_private.pem to nodejs-docs-samples/iot/mqtt_example 16. Install Node JS dependency using npm install
17. Create a Subscription using
gcloud pubsub subscriptions create \
projects/$DEVSHELL_PROJECT_ID/subscriptions/my-subscription-iot \ --topic=projects/$DEVSHELL_PROJECT_ID/topics/my-device-events
18. Run the Node script to simulate data generation node cloudiot_mqtt_example_nodejs.js \
mqttDeviceDemo \
--projectId=$DEVSHELL_PROJECT_ID \ --cloudRegion=asia-east1 \
--registryId=demo \ --deviceId=my-device \
--privateKeyFile=rsa_private.pem \ --numMessages=25 \
--algorithm=RS256
19.gcloud pubsub subscriptions pull --auto-ack --limit 10 mysubscription-iot
Kubeflow Pipeline
1. Go to https://console.cloud.google.com/kubernetes/list
2. Go to Marketplace and search for Kubeflow Pipelines and deploy it 3. To get a pipeline clone Kubeflow Pipeline repo using git clone
https://github.com/kubeflow/pipelines.git
4. Install Python Kubeflow Pipelines SDK using pip3 install kfp --user. This command installs the dsl-compile and kfp binaries under ~/.local/bin
5. Add ~/local/bin to environment variable using export PATH=$PATH:~/.local/bin 6. Type which dsl-compile to confirm it added in PATH
7. Change directory to pipelines/samples/core/parallel_join
8. Each pipeline is defined as a Python program. Before you can submit a pipeline to the Kubeflow Pipelines service, you must compile the pipeline to an intermediate
representation. The intermediate representation takes the form of a YAML file compressed into a .tar.gz file. Hence use
dsl-compile --py parallel_join.py --output parallel_join.yaml
9. Download the yaml to local file system and upload it in Kubeflow Pipelines