SageMaker Training

Training jobs accept a set of one or more input file paths and output a model that can be deployed for inference.

  • Running locally, standard command line arguments for inputs and outputs are used as usual
  • Running remotely, data is uploaded and downloaded using S3 for tracking

Basic usage

Write a script with a main function that calls sagemaker_training_main.

from aws_sagemaker_remote import sagemaker_training_main

def main(args):
    # your code here
    pass

if __name__ == '__main__':
    sagemaker_training_main(
        main=main,
        # ...
    )

Pass function argument run=True or command line argument --sagemaker-run=True to run script remotely on SageMaker.

Path Handling

Inputs

Configure inputs by passing an inputs dictionary argument to sagemaker_training_main. See aws_sagemaker_remote.args.sagemaker_training_args()

For example, if your dictionary contains the key my_dataset:

  • The command line argument --my-dataset accepts local paths or S3 URLs
  • Local paths are uploaded to S3
  • Data downloaded from S3 to container
  • Location of data on container pulled from environment
  • Your main function is called with args.my_dataset set to EFS mount on container

Outputs

There are three output paths:

  • args.model_dir and --model-dir: Directory to export trained inference model. Used when deploying model for inference. Save everything you need for inference but don’t save optimizers to minimize inference deployment.
  • args.output_dir and --output-dir: Directory for outputs (logs, images, etc.)
  • args.checkpoint_dir and --checkpoint-dir: Directory for training checkpoints. Save model, optimizer, step count, anything else you need. This will be backed up and restored if training is interrupted.

Running locally, arguments are passed through. If running on SageMaker, arguments are automatically set to mountpoints which are uploaded to S3.

Training Job Tracking

Use the SageMaker console to view a list of all training jobs. For each job, SageMaker tracks:

  • Training time
  • Container used
  • Link to CloudWatch logs
  • Path on S3 for each of:
    • Source code ZIP
    • Each input channel
    • Model output ZIP
    • Dependencies (if used)

Configuration

Many command line options are added by this command.

Option --sagemaker-run controls local or remote execution.

  • Set --sagemaker-run to a falsy value (no,false,0), the script will call your main function as usual and run locally.
  • Set --sagemaker-run to a truthy value (yes,true,1), the script will upload itself and any requirements or inputs to S3, execute remotely on SageMaker, and save outputs to S3, logging results to the terminal.

Set --sagemaker-wait truthy to tail logs and wait for completion or falsy to complete when the job starts.

Defaults are set through code. Defaults can be overwritten on the command line. For example:

  • Use the function argument image to set the default container image for your script
  • Use the command line argument --sagemaker-image to override the container image on a particular run

See aws_sagemaker_remote.args.sagemaker_training_args() and Command-Line Arguments for details.

Environment Customization

The environment can be customized in multiple ways.

  • Instance
    • Function argument training_instance
    • Command line argument --sagemaker-training-instance
    • Select instance type of machine running the container
  • Image
    • Function argument training_image
    • Command line argument --sagemaker-training-image
    • Accepts URI of Docker container image on ECR or DockerHub to run
    • Build a custom Docker image for major customizations
  • Requirements file
  • Dependencies
    • Function argument dependencies
      • Dictionary of [key]->[value]
      • Each key will create command line argument --key that defaults to value
    • Each value is a directory containing a Python module that will be uploaded to S3, downloaded to SageMaker, and put on the PYTHONPATH
    • For example, if directory mymodule contains the files __init__.py and myfile.py and myfile.py contains def myfunction():..., pass dependencies={'mymodule':'path/to/mymodule'} to sagemaker_processing_main and then use from mymodule.myfile import myfunction in your script.
    • Use module uploads for supporting code that is not being installed from packages.

Spot Training

Save on training costs by using spot training. Rather than starting immediately, AWS runs training when excess processing is available in exchange for cost savings.

  • --sagemaker-spot-instances=yes Use spot instances
  • --sagemaker-max-run Maximum training runtime in seconds
  • --sagemaker-max-wait Maximum time to wait in seconds, must be greater than the runtime.

Additional arguments

Any arguments passed to your script locally on the command line are passed to your script remotely and tracked by SageMaker. Internally, sagemaker_processing_main uses argparse. To add additional command-line flags:

  • Pass a list of kwargs dictionaries to additional_arguments

    sagemaker_training_main(
      #...
      additional_arguments = [
        {
          'dest': '--filter-width',
          'default':32,
          'help':'Filter width'
        },
        {
          'dest':'--filter-height',
          'default':32,
          'help':'Filter height'
        }
      ]
    )
    
  • Pass a callback to argparse_callback

    from argparse import ArgumentParser
    def argparse_callback(parser:ArgumentParser):
      parser.add_argument(
      '--filter-width',
      default=32,
      help='Filter width')
      parser.add_argument(
      '--filter-height',
      default=32,
      help='Filter height')
    sagemaker_training_main(
      # ...
      argparse_callback=argparse_callback
    )
    

Command-Line Arguments

These command-line arguments were created using the following parameters. Command-line arguments are generated for each item in inputs and dependencies.

inputs={
    'input': 'path/to/input'
},
dependencies={
    'my_module': 'path/to/my_module'
}
usage: aws-sagemaker-remote-training [-h]
                                     [--sagemaker-profile SAGEMAKER_PROFILE]
                                     [--sagemaker-run [SAGEMAKER_RUN]]
                                     [--sagemaker-wait [SAGEMAKER_WAIT]]
                                     [--sagemaker-spot-instances [SAGEMAKER_SPOT_INSTANCES]]
                                     [--sagemaker-script SAGEMAKER_SCRIPT]
                                     [--sagemaker-source SAGEMAKER_SOURCE]
                                     [--sagemaker-training-instance SAGEMAKER_TRAINING_INSTANCE]
                                     [--sagemaker-training-image SAGEMAKER_TRAINING_IMAGE]
                                     [--sagemaker-training-image-path SAGEMAKER_TRAINING_IMAGE_PATH]
                                     [--sagemaker-training-image-accounts SAGEMAKER_TRAINING_IMAGE_ACCOUNTS]
                                     [--sagemaker-training-role SAGEMAKER_TRAINING_ROLE]
                                     [--sagemaker-base-job-name SAGEMAKER_BASE_JOB_NAME]
                                     [--sagemaker-job-name SAGEMAKER_JOB_NAME]
                                     [--sagemaker-experiment-name SAGEMAKER_EXPERIMENT_NAME]
                                     [--sagemaker-trial-name SAGEMAKER_TRIAL_NAME]
                                     [--sagemaker-volume-size SAGEMAKER_VOLUME_SIZE]
                                     [--sagemaker-max-run SAGEMAKER_MAX_RUN]
                                     [--sagemaker-max-wait SAGEMAKER_MAX_WAIT]
                                     [--sagemaker-output-json SAGEMAKER_OUTPUT_JSON]
                                     [--my-module MY_MODULE]
                                     [--model-dir MODEL_DIR]
                                     [--output-dir OUTPUT_DIR]
                                     [--checkpoint-dir CHECKPOINT_DIR]
                                     [--sagemaker-checkpoint-s3 SAGEMAKER_CHECKPOINT_S3]
                                     [--sagemaker-checkpoint-container SAGEMAKER_CHECKPOINT_CONTAINER]
                                     [--checkpoint-initial CHECKPOINT_INITIAL]
                                     [--input INPUT] [--input-mode INPUT_MODE]
                                     [--input-repeat INPUT_REPEAT]
                                     [--input-shuffle [INPUT_SHUFFLE]]

Named Arguments

--model-dir

Directory to save final model (default: output/model)

Default: “output/model”

--output-dir

Directory for logs, images, or other output files (default: “output/output”)

Default: “output/output”

--input-shuffle
 

Shuffle inputs

Default: False

SageMaker

SageMaker options

--sagemaker-profile
 

AWS profile for SageMaker session (default: [default])

Default: “default”

--sagemaker-run
 

Run training on SageMaker (yes/no default=False)

Default: False

--sagemaker-wait
 

Wait for SageMaker training to complete and tail logs files (yes/no default=True)

Default: True

--sagemaker-spot-instances
 

Use spot instances for training (yes/no default=False)

Default: False

--sagemaker-script
 

Script to run on SageMaker. (default: [script.py])

Default: “script.py”

--sagemaker-source
 

Source to upload to SageMaker. Must contain script. If blank, default to directory containing script. (default: [])

Default: “”

--sagemaker-training-instance
 

Instance type for training

Default: “ml.m5.large”

--sagemaker-training-image
 

Docker image for training

Default: “aws-sagemaker-remote-training:latest”

--sagemaker-training-image-path
 

Path to dockerfile if image does not exist

Default: “/home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/latest/aws_sagemaker_remote/ecr/training”

--sagemaker-training-image-accounts
 

Accounts for docker build

Default: [‘763104351884’]

--sagemaker-training-role
 

Docker image for training

Default: “aws-sagemaker-remote-training-role”

--sagemaker-base-job-name
 

Base job name for tracking and organization on S3. A job name will be generated from the base job name unless a job name is specified.

Default: “training-job”

--sagemaker-job-name
 

Job name for tracking. Use –base-job-name instead and a job name will be automatically generated with a timestamp.

Default: “”

--sagemaker-experiment-name
 Name of experiment in SageMaker tracking.
--sagemaker-trial-name
 Name of experiment trial in SageMaker tracking.
--sagemaker-volume-size
 

Volume size in GB.

Default: 30

--sagemaker-max-run
 

Maximum runtime in seconds.

Default: 43200

--sagemaker-max-wait
 

Maximum time to wait for spot instances in seconds.

Default: 86400

--sagemaker-output-json
 Output job details to JSON file.

Dependencies

Dependencies to upload to SageMaker

--my-module

Directory for dependency [my_module] (default: “path/to/my_module”)

Default: “path/to/my_module”

Checkpoints

Checkpointing options

--checkpoint-dir
 

Local directory to store checkpoints for resuming training (default: “output/checkpoint”)

Default: “output/checkpoint”

--sagemaker-checkpoint-s3
 

Location to store checkpoints on S3 or “default” (default: “default”)

Default: “default”

--sagemaker-checkpoint-container
 

Location to store checkpoints on container (default: “/opt/ml/checkpoints”)

Default: “/opt/ml/checkpoints”

--checkpoint-initial
 Initial checkpoint

Inputs

Inputs (local or S3)

--input

Input channel [input]. Set to local path and it will be uploaded to S3 and downloaded to SageMaker. Set to S3 path and it will be downloaded to SageMaker. (default: [path/to/input])

Default: “path/to/input”

--input-mode

Input channel [input] mode. (default: [File])

Default: “File”

--input-repeat

Repeat input

Default: 1

Example Code

The following example creates a trainer with one input named input.

  • Running the file without arguments will run locally. The argument --input sets the input directory.
  • Running the file with --sagemaker-run=yes will run on SageMaker. The argument --input is uploaded to S3, downloaded to SageMaker, and automatically set to a mountpoint.

The example code uploads aws_sagemaker_remote from the local filesystem using the dependencies argument. Alternatively:

  • Add aws_sagemaker_remote to your Docker image.
  • Create a requirements.txt file including aws_sagemaker_remote. Place the file in your source directory (default to the directory containing the file containing the main function)

See mnist_training.py.

import argparse
from aws_sagemaker_remote.training.main import sagemaker_training_main
import torch
from torch import nn
from torch.utils import data
from torchvision import datasets
import torchvision.transforms as transforms
import aws_sagemaker_remote
import os


class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.model = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=2),
            nn.LeakyReLU(),
            nn.Conv2d(in_channels=32, out_channels=64,
                      kernel_size=3, stride=2),
            nn.LeakyReLU(),
            nn.Conv2d(in_channels=64, out_channels=128,
                      kernel_size=3, stride=1),
            nn.LeakyReLU(),
            nn.Conv2d(in_channels=128, out_channels=10,
                      kernel_size=3, stride=1),
        )

    def forward(self, input):
        return torch.mean(self.model(input), dim=(2, 3))


def main(args):
    print("Training")
    batch_size = 32
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    dataset = data.DataLoader(
        datasets.MNIST(
            root=args.input, download=True, train=True,
            transform=transforms.ToTensor()
        ),
        batch_size=batch_size,
        shuffle=True, num_workers=2, drop_last=False)
    # Create model, optimizer, and criteria
    model = Model().to(device)
    optimizer = torch.optim.Adam(
        params=model.parameters(), lr=args.learning_rate)
    criteria = nn.CrossEntropyLoss()
    model.train()

    for i in range(args.epochs):
        for j, (pixels, labels) in enumerate(dataset):
            pixels, labels = pixels.to(device), labels.to(device)
            logits = model(pixels)
            loss = criteria(input=logits, target=labels)
            accuracy = torch.mean(
                torch.eq(torch.argmax(logits, dim=-1), labels).float())
            loss.backward()
            optimizer.step()
            if j % 100 == 0:
                print("epoch {}, step {}, loss {}, accuracy {}".format(
                    i, j,
                    loss.item(), accuracy.item()
                ))
    os.makedirs(args.model_dir, exist_ok=True)
    torch.save(
        model, os.path.join(args.model_dir, 'model.pt')
    )


def argparse_callback(parser):
    parser.add_argument(
        '--learning-rate',
        default=1e-3,
        type=float,
        help='Learning rate')
    parser.add_argument(
        '--epochs',
        default=5,
        type=int,
        help='Epochs to train')


if __name__ == '__main__':
    sagemaker_training_main(
        script=__file__,
        main=main,
        inputs={
            'input': 'output/data'
        },
        dependencies={
            'aws_sagemaker_remote': aws_sagemaker_remote
        },
        argparse_callback=argparse_callback
    )