SageMaker Training¶

Training jobs accept a set of one or more input file paths and output a model that can be deployed for inference.

Running locally, standard command line arguments for inputs and outputs are used as usual
Running remotely, data is uploaded and downloaded using S3 for tracking

Basic usage¶

Write a script with a main function that calls sagemaker_training_main.

from aws_sagemaker_remote import sagemaker_training_main

def main(args):
    # your code here
    pass

if __name__ == '__main__':
    sagemaker_training_main(
        main=main,
        # ...
    )

Pass function argument run=True or command line argument --sagemaker-run=True to run script remotely on SageMaker.

Many command-line arguments are automatically added. See Command-Line Arguments.
Parameters to sagemaker_processing_main control what command-line arguments are automatically added and the default values. See aws_sagemaker_remote.training.main.sagemaker_training_main() and aws_sagemaker_remote.training.args.sagemaker_training_args()

Path Handling¶

Inputs¶

Configure inputs by passing an inputs dictionary argument to sagemaker_training_main. See aws_sagemaker_remote.args.sagemaker_training_args()

For example, if your dictionary contains the key my_dataset:

The command line argument --my-dataset accepts local paths or S3 URLs
Local paths are uploaded to S3
Data downloaded from S3 to container
Location of data on container pulled from environment
Your main function is called with args.my_dataset set to EFS mount on container

Outputs¶

There are three output paths:

args.model_dir and --model-dir: Directory to export trained inference model. Used when deploying model for inference. Save everything you need for inference but don’t save optimizers to minimize inference deployment.
args.output_dir and --output-dir: Directory for outputs (logs, images, etc.)
args.checkpoint_dir and --checkpoint-dir: Directory for training checkpoints. Save model, optimizer, step count, anything else you need. This will be backed up and restored if training is interrupted.

Running locally, arguments are passed through. If running on SageMaker, arguments are automatically set to mountpoints which are uploaded to S3.

Training Job Tracking¶

Use the SageMaker console to view a list of all training jobs. For each job, SageMaker tracks:

Training time
Container used
Link to CloudWatch logs
Path on S3 for each of:
- Source code ZIP
- Each input channel
- Model output ZIP
- Dependencies (if used)

Configuration¶

Many command line options are added by this command.

Option --sagemaker-run controls local or remote execution.

Set --sagemaker-run to a falsy value (no,false,0), the script will call your main function as usual and run locally.
Set --sagemaker-run to a truthy value (yes,true,1), the script will upload itself and any requirements or inputs to S3, execute remotely on SageMaker, and save outputs to S3, logging results to the terminal.

Set --sagemaker-wait truthy to tail logs and wait for completion or falsy to complete when the job starts.

Defaults are set through code. Defaults can be overwritten on the command line. For example:

Use the function argument image to set the default container image for your script
Use the command line argument --sagemaker-image to override the container image on a particular run

See aws_sagemaker_remote.args.sagemaker_training_args() and Command-Line Arguments for details.

Environment Customization¶

The environment can be customized in multiple ways.

Instance
- Function argument training_instance
- Command line argument --sagemaker-training-instance
- Select instance type of machine running the container
Image
- Function argument training_image
- Command line argument --sagemaker-training-image
- Accepts URI of Docker container image on ECR or DockerHub to run
- Build a custom Docker image for major customizations
Requirements file
- Create a file named requirements.txt in your source directory
- source directory defaults to the directory containing your script but can be overridden
- Use for installing Python packages by listing one on each line. Standard requirements.txt file format [https://pip.pypa.io/en/stable/reference/pip_install/#requirements-file-format]
Dependencies
- Function argument dependencies
  - Dictionary of [key]->[value]
  - Each key will create command line argument --key that defaults to value
- Each value is a directory containing a Python module that will be uploaded to S3, downloaded to SageMaker, and put on the PYTHONPATH
- For example, if directory mymodule contains the files __init__.py and myfile.py and myfile.py contains def myfunction():..., pass dependencies={'mymodule':'path/to/mymodule'} to sagemaker_processing_main and then use from mymodule.myfile import myfunction in your script.
- Use module uploads for supporting code that is not being installed from packages.

Spot Training¶

Save on training costs by using spot training. Rather than starting immediately, AWS runs training when excess processing is available in exchange for cost savings.

--sagemaker-spot-instances=yes Use spot instances
--sagemaker-max-run Maximum training runtime in seconds
--sagemaker-max-wait Maximum time to wait in seconds, must be greater than the runtime.

Additional arguments¶

Any arguments passed to your script locally on the command line are passed to your script remotely and tracked by SageMaker. Internally, sagemaker_processing_main uses argparse. To add additional command-line flags:

Pass a list of kwargs dictionaries to additional_arguments

sagemaker_training_main(
  #...
  additional_arguments = [
    {
      'dest': '--filter-width',
      'default':32,
      'help':'Filter width'
    },
    {
      'dest':'--filter-height',
      'default':32,
      'help':'Filter height'
    }
  ]
)

Pass a callback to argparse_callback

from argparse import ArgumentParser
def argparse_callback(parser:ArgumentParser):
  parser.add_argument(
  '--filter-width',
  default=32,
  help='Filter width')
  parser.add_argument(
  '--filter-height',
  default=32,
  help='Filter height')
sagemaker_training_main(
  # ...
  argparse_callback=argparse_callback
)

Command-Line Arguments¶

These command-line arguments were created using the following parameters. Command-line arguments are generated for each item in inputs and dependencies.

inputs={
    'input': 'path/to/input'
},
dependencies={
    'my_module': 'path/to/my_module'
}

usage: aws-sagemaker-remote-training [-h]
                                     [--sagemaker-profile SAGEMAKER_PROFILE]
                                     [--sagemaker-run [SAGEMAKER_RUN]]
                                     [--sagemaker-wait [SAGEMAKER_WAIT]]
                                     [--sagemaker-spot-instances [SAGEMAKER_SPOT_INSTANCES]]
                                     [--sagemaker-script SAGEMAKER_SCRIPT]
                                     [--sagemaker-source SAGEMAKER_SOURCE]
                                     [--sagemaker-training-instance SAGEMAKER_TRAINING_INSTANCE]
                                     [--sagemaker-training-image SAGEMAKER_TRAINING_IMAGE]
                                     [--sagemaker-training-image-path SAGEMAKER_TRAINING_IMAGE_PATH]
                                     [--sagemaker-training-image-accounts SAGEMAKER_TRAINING_IMAGE_ACCOUNTS]
                                     [--sagemaker-training-role SAGEMAKER_TRAINING_ROLE]
                                     [--sagemaker-base-job-name SAGEMAKER_BASE_JOB_NAME]
                                     [--sagemaker-job-name SAGEMAKER_JOB_NAME]
                                     [--sagemaker-experiment-name SAGEMAKER_EXPERIMENT_NAME]
                                     [--sagemaker-trial-name SAGEMAKER_TRIAL_NAME]
                                     [--sagemaker-volume-size SAGEMAKER_VOLUME_SIZE]
                                     [--sagemaker-max-run SAGEMAKER_MAX_RUN]
                                     [--sagemaker-max-wait SAGEMAKER_MAX_WAIT]
                                     [--sagemaker-output-json SAGEMAKER_OUTPUT_JSON]
                                     [--my-module MY_MODULE]
                                     [--model-dir MODEL_DIR]
                                     [--output-dir OUTPUT_DIR]
                                     [--checkpoint-dir CHECKPOINT_DIR]
                                     [--sagemaker-checkpoint-s3 SAGEMAKER_CHECKPOINT_S3]
                                     [--sagemaker-checkpoint-container SAGEMAKER_CHECKPOINT_CONTAINER]
                                     [--checkpoint-initial CHECKPOINT_INITIAL]
                                     [--input INPUT] [--input-mode INPUT_MODE]
                                     [--input-repeat INPUT_REPEAT]
                                     [--input-shuffle [INPUT_SHUFFLE]]

Named Arguments¶

`--model-dir`	Directory to save final model (default: output/model) Default: “output/model”
`--output-dir`	Directory for logs, images, or other output files (default: “output/output”) Default: “output/output”
`--input-shuffle`
	Shuffle inputs Default: False

SageMaker¶

SageMaker options

`--sagemaker-profile`
	AWS profile for SageMaker session (default: [default]) Default: “default”
`--sagemaker-run`
	Run training on SageMaker (yes/no default=False) Default: False
`--sagemaker-wait`
	Wait for SageMaker training to complete and tail logs files (yes/no default=True) Default: True
`--sagemaker-spot-instances`
	Use spot instances for training (yes/no default=False) Default: False
`--sagemaker-script`
	Script to run on SageMaker. (default: [script.py]) Default: “script.py”
`--sagemaker-source`
	Source to upload to SageMaker. Must contain script. If blank, default to directory containing script. (default: []) Default: “”
`--sagemaker-training-instance`
	Instance type for training Default: “ml.m5.large”
`--sagemaker-training-image`
	Docker image for training Default: “aws-sagemaker-remote-training:latest”
`--sagemaker-training-image-path`
	Path to dockerfile if image does not exist Default: “/home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/latest/aws_sagemaker_remote/ecr/training”
`--sagemaker-training-image-accounts`
	Accounts for docker build Default: [‘763104351884’]
`--sagemaker-training-role`
	Docker image for training Default: “aws-sagemaker-remote-training-role”
`--sagemaker-base-job-name`
	Base job name for tracking and organization on S3. A job name will be generated from the base job name unless a job name is specified. Default: “training-job”
`--sagemaker-job-name`
	Job name for tracking. Use –base-job-name instead and a job name will be automatically generated with a timestamp. Default: “”
`--sagemaker-experiment-name`
	Name of experiment in SageMaker tracking.
`--sagemaker-trial-name`
	Name of experiment trial in SageMaker tracking.
`--sagemaker-volume-size`
	Volume size in GB. Default: 30
`--sagemaker-max-run`
	Maximum runtime in seconds. Default: 43200
`--sagemaker-max-wait`
	Maximum time to wait for spot instances in seconds. Default: 86400
`--sagemaker-output-json`
	Output job details to JSON file.

Dependencies¶

Dependencies to upload to SageMaker

--my-module

Directory for dependency [my_module] (default: “path/to/my_module”)

Default: “path/to/my_module”

Checkpoints¶

Checkpointing options

`--checkpoint-dir`
	Local directory to store checkpoints for resuming training (default: “output/checkpoint”) Default: “output/checkpoint”
`--sagemaker-checkpoint-s3`
	Location to store checkpoints on S3 or “default” (default: “default”) Default: “default”
`--sagemaker-checkpoint-container`
	Location to store checkpoints on container (default: “/opt/ml/checkpoints”) Default: “/opt/ml/checkpoints”
`--checkpoint-initial`
	Initial checkpoint

Inputs¶

Inputs (local or S3)

--input

Input channel [input]. Set to local path and it will be uploaded to S3 and downloaded to SageMaker. Set to S3 path and it will be downloaded to SageMaker. (default: [path/to/input])

Default: “path/to/input”

--input-mode

Input channel [input] mode. (default: [File])

Default: “File”

--input-repeat

Repeat input

Default: 1

Example Code¶

The following example creates a trainer with one input named input.

Running the file without arguments will run locally. The argument --input sets the input directory.
Running the file with --sagemaker-run=yes will run on SageMaker. The argument --input is uploaded to S3, downloaded to SageMaker, and automatically set to a mountpoint.

The example code uploads aws_sagemaker_remote from the local filesystem using the dependencies argument. Alternatively:

Add aws_sagemaker_remote to your Docker image.
Create a requirements.txt file including aws_sagemaker_remote. Place the file in your source directory (default to the directory containing the file containing the main function)

See mnist_training.py.

import argparse
from aws_sagemaker_remote.training.main import sagemaker_training_main
import torch
from torch import nn
from torch.utils import data
from torchvision import datasets
import torchvision.transforms as transforms
import aws_sagemaker_remote
import os


class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.model = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=2),
            nn.LeakyReLU(),
            nn.Conv2d(in_channels=32, out_channels=64,
                      kernel_size=3, stride=2),
            nn.LeakyReLU(),
            nn.Conv2d(in_channels=64, out_channels=128,
                      kernel_size=3, stride=1),
            nn.LeakyReLU(),
            nn.Conv2d(in_channels=128, out_channels=10,
                      kernel_size=3, stride=1),
        )

    def forward(self, input):
        return torch.mean(self.model(input), dim=(2, 3))


def main(args):
    print("Training")
    batch_size = 32
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    dataset = data.DataLoader(
        datasets.MNIST(
            root=args.input, download=True, train=True,
            transform=transforms.ToTensor()
        ),
        batch_size=batch_size,
        shuffle=True, num_workers=2, drop_last=False)
    # Create model, optimizer, and criteria
    model = Model().to(device)
    optimizer = torch.optim.Adam(
        params=model.parameters(), lr=args.learning_rate)
    criteria = nn.CrossEntropyLoss()
    model.train()

    for i in range(args.epochs):
        for j, (pixels, labels) in enumerate(dataset):
            pixels, labels = pixels.to(device), labels.to(device)
            logits = model(pixels)
            loss = criteria(input=logits, target=labels)
            accuracy = torch.mean(
                torch.eq(torch.argmax(logits, dim=-1), labels).float())
            loss.backward()
            optimizer.step()
            if j % 100 == 0:
                print("epoch {}, step {}, loss {}, accuracy {}".format(
                    i, j,
                    loss.item(), accuracy.item()
                ))
    os.makedirs(args.model_dir, exist_ok=True)
    torch.save(
        model, os.path.join(args.model_dir, 'model.pt')
    )


def argparse_callback(parser):
    parser.add_argument(
        '--learning-rate',
        default=1e-3,
        type=float,
        help='Learning rate')
    parser.add_argument(
        '--epochs',
        default=5,
        type=int,
        help='Epochs to train')


if __name__ == '__main__':
    sagemaker_training_main(
        script=__file__,
        main=main,
        inputs={
            'input': 'output/data'
        },
        dependencies={
            'aws_sagemaker_remote': aws_sagemaker_remote
        },
        argparse_callback=argparse_callback
    )