Purpose

Save the state of ML during training.

Usage

save the snapshot of the model. restart a training job from this snapshot. analyze the intermediate state of the model during training. Use checkpoints with managed spot instances to save cost.

How does it work?

Training code runs on the training containers on EC2 instances. Uses callback() functions to create checkpoints().

The default location is /opt/ml/checkpoints within the containers. SageMaker automatically syncs this location with S3. Sagemaker copies the data from S3 to /opt/ml/checkpoints at the start of the job. The checkpoints added to S3 after the job starts are not copied. During the training, SageMaker copies takes snapshots, and syncs /opt/ml/checkpoints data to S3.

Frameworks and built-in algorithms

Supported Deep Learning Frameworks: TensorFlow, MxNet, PyTorch, Huggingface (for HF, specify the checkpoint output location using hyperparameters)

Supported built-in algorithms: Image Classification, Object Detection, Semantic Segmentation, and XGBoost (0.90-1 or later). For XGBoost in script mode, you need to manually configure it for checkpointing to work.

Managed Spot Instances

Sagemaker manages checkpoints and resumes the training job on the next spot instance. If an algorithm that does not support checkpoints is used in a managed spot training job, SageMaker does not allow a maximum wait time greater than 1 hour for the job to limit wasted training time from interrupts.

For custom training containers and other frameworks

Leave a Reply

Your email address will not be published. Required fields are marked *