Training jobs can be:
single node training → one worker pool distributed training → multiple worker pools
Custom job
:
Its the basic way to run your custom ML training code in Vertex AI.
Steps:
Create costum container image → push to container registery → create costum job or
Use prebuilt container
- Custom jobs can run on a persistent resource instead of creating new compute resources during job startup.
A training pipeline in Vertex AI orchestrates custom training jobs with additional steps, such as loading a dataset or uploading the model to Vertex AI after the training job is successfully completed.
Difference between aiplatform, aiplatformv1 and aiplatform v1 beta1 - aiplatform exposes the latest version of the Vertex AI API
- aiplatformv1 or aiplatformv1_beta1 used to develop applications that use older versions of the Vertex AI API.
Costum Job Python API documentation link:
https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.CustomJob#google_cloud_aiplatform_CustomJob
Vertex AI Notebooks:
1- Vertex AI Workbench: a JupyterLab experience and advanced customization capabilities.
2- Colab Enterprise: serverless, and collaborative environment, AI-powered code assistance
TensorBoard instance is a managed service that monitors your machine learning (ML) training and experiments. It provides a graphical user interface (GUI) that you can use to visualize your ML metrics and logs.
#####
One worker pool for single-node training (WorkerPoolSpec), or multiple worker pools for distributed training
staging_bucket: Bucket for produced custom job artifacts
An experiment is a collection of training jobs that are designed to test a particular hypothesis about your data or your machine learning model.
experiment_run is a specific instance of an experiment