SageMaker Training¶
Training jobs accept a set of one or more input file paths and output a model that can be deployed for inference.
- Running locally, standard command line arguments for inputs and outputs are used as usual
- Running remotely, data is uploaded and downloaded using S3 for tracking
Basic usage¶
Write a script with a main
function that calls sagemaker_training_main
.
from aws_sagemaker_remote import sagemaker_training_main
def main(args):
# your code here
pass
if __name__ == '__main__':
sagemaker_training_main(
main=main,
# ...
)
Pass function argument run=True
or command line argument --sagemaker-run=True
to run script remotely on SageMaker.
- Many command-line arguments are automatically added. See Command-Line Arguments.
- Parameters to
sagemaker_processing_main
control what command-line arguments are automatically added and the default values. Seeaws_sagemaker_remote.training.main.sagemaker_training_main()
andaws_sagemaker_remote.training.args.sagemaker_training_args()
Path Handling¶
Inputs¶
Configure inputs by passing an inputs
dictionary argument to sagemaker_training_main
.
See aws_sagemaker_remote.args.sagemaker_training_args()
For example, if your dictionary contains the key my_dataset
:
- The command line argument
--my-dataset
accepts local paths or S3 URLs - Local paths are uploaded to S3
- Data downloaded from S3 to container
- Location of data on container pulled from environment
- Your main function is called with
args.my_dataset
set to EFS mount on container
Outputs¶
There are three output paths:
args.model_dir
and--model-dir
: Directory to export trained inference model. Used when deploying model for inference. Save everything you need for inference but don’t save optimizers to minimize inference deployment.args.output_dir
and--output-dir
: Directory for outputs (logs, images, etc.)args.checkpoint_dir
and--checkpoint-dir
: Directory for training checkpoints. Save model, optimizer, step count, anything else you need. This will be backed up and restored if training is interrupted.
Running locally, arguments are passed through. If running on SageMaker, arguments are automatically set to mountpoints which are uploaded to S3.
Training Job Tracking¶
Use the SageMaker console to view a list of all training jobs. For each job, SageMaker tracks:
- Training time
- Container used
- Link to CloudWatch logs
- Path on S3 for each of:
- Source code ZIP
- Each input channel
- Model output ZIP
- Dependencies (if used)
Configuration¶
Many command line options are added by this command.
Option --sagemaker-run
controls local or remote execution.
- Set
--sagemaker-run
to a falsy value (no,false,0), the script will call your main function as usual and run locally. - Set
--sagemaker-run
to a truthy value (yes,true,1), the script will upload itself and any requirements or inputs to S3, execute remotely on SageMaker, and save outputs to S3, logging results to the terminal.
Set --sagemaker-wait
truthy to tail logs and wait for completion or falsy to complete when the job starts.
Defaults are set through code. Defaults can be overwritten on the command line. For example:
- Use the function argument
image
to set the default container image for your script - Use the command line argument
--sagemaker-image
to override the container image on a particular run
See aws_sagemaker_remote.args.sagemaker_training_args()
and Command-Line Arguments for details.
Environment Customization¶
The environment can be customized in multiple ways.
- Instance
- Function argument
training_instance
- Command line argument
--sagemaker-training-instance
- Select instance type of machine running the container
- Function argument
- Image
- Function argument
training_image
- Command line argument
--sagemaker-training-image
- Accepts URI of Docker container image on ECR or DockerHub to run
- Build a custom Docker image for major customizations
- Function argument
- Requirements file
- Create a file named
requirements.txt
in yoursource
directory source
directory defaults to the directory containing your script but can be overridden- Use for installing Python packages by listing one on each line. Standard
requirements.txt
file format [https://pip.pypa.io/en/stable/reference/pip_install/#requirements-file-format]
- Create a file named
- Dependencies
- Function argument
dependencies
- Dictionary of
[key]->[value]
- Each key will create command line argument
--key
that defaults tovalue
- Dictionary of
- Each
value
is a directory containing a Python module that will be uploaded to S3, downloaded to SageMaker, and put on the PYTHONPATH - For example, if directory
mymodule
contains the files__init__.py
andmyfile.py
andmyfile.py
containsdef myfunction():...
, passdependencies={'mymodule':'path/to/mymodule'}
tosagemaker_processing_main
and then usefrom mymodule.myfile import myfunction
in your script. - Use module uploads for supporting code that is not being installed from packages.
- Function argument
Spot Training¶
Save on training costs by using spot training. Rather than starting immediately, AWS runs training when excess processing is available in exchange for cost savings.
--sagemaker-spot-instances=yes
Use spot instances--sagemaker-max-run
Maximum training runtime in seconds--sagemaker-max-wait
Maximum time to wait in seconds, must be greater than the runtime.
Additional arguments¶
Any arguments passed to your script locally on the command line are passed to your script remotely and tracked by SageMaker. Internally, sagemaker_processing_main
uses argparse
. To add additional command-line flags:
Pass a list of kwargs dictionaries to
additional_arguments
sagemaker_training_main( #... additional_arguments = [ { 'dest': '--filter-width', 'default':32, 'help':'Filter width' }, { 'dest':'--filter-height', 'default':32, 'help':'Filter height' } ] )
Pass a callback to
argparse_callback
from argparse import ArgumentParser def argparse_callback(parser:ArgumentParser): parser.add_argument( '--filter-width', default=32, help='Filter width') parser.add_argument( '--filter-height', default=32, help='Filter height') sagemaker_training_main( # ... argparse_callback=argparse_callback )
Command-Line Arguments¶
These command-line arguments were created using the following parameters.
Command-line arguments are generated for each item in inputs
and dependencies
.
inputs={
'input': 'path/to/input'
},
dependencies={
'my_module': 'path/to/my_module'
}
usage: aws-sagemaker-remote-training [-h]
[--sagemaker-profile SAGEMAKER_PROFILE]
[--sagemaker-run [SAGEMAKER_RUN]]
[--sagemaker-wait [SAGEMAKER_WAIT]]
[--sagemaker-spot-instances [SAGEMAKER_SPOT_INSTANCES]]
[--sagemaker-script SAGEMAKER_SCRIPT]
[--sagemaker-source SAGEMAKER_SOURCE]
[--sagemaker-training-instance SAGEMAKER_TRAINING_INSTANCE]
[--sagemaker-training-image SAGEMAKER_TRAINING_IMAGE]
[--sagemaker-training-image-path SAGEMAKER_TRAINING_IMAGE_PATH]
[--sagemaker-training-image-accounts SAGEMAKER_TRAINING_IMAGE_ACCOUNTS]
[--sagemaker-training-role SAGEMAKER_TRAINING_ROLE]
[--sagemaker-base-job-name SAGEMAKER_BASE_JOB_NAME]
[--sagemaker-job-name SAGEMAKER_JOB_NAME]
[--sagemaker-experiment-name SAGEMAKER_EXPERIMENT_NAME]
[--sagemaker-trial-name SAGEMAKER_TRIAL_NAME]
[--sagemaker-volume-size SAGEMAKER_VOLUME_SIZE]
[--sagemaker-max-run SAGEMAKER_MAX_RUN]
[--sagemaker-max-wait SAGEMAKER_MAX_WAIT]
[--sagemaker-output-json SAGEMAKER_OUTPUT_JSON]
[--my-module MY_MODULE]
[--model-dir MODEL_DIR]
[--output-dir OUTPUT_DIR]
[--checkpoint-dir CHECKPOINT_DIR]
[--sagemaker-checkpoint-s3 SAGEMAKER_CHECKPOINT_S3]
[--sagemaker-checkpoint-container SAGEMAKER_CHECKPOINT_CONTAINER]
[--checkpoint-initial CHECKPOINT_INITIAL]
[--input INPUT] [--input-mode INPUT_MODE]
[--input-repeat INPUT_REPEAT]
[--input-shuffle [INPUT_SHUFFLE]]
Named Arguments¶
--model-dir | Directory to save final model (default: output/model) Default: “output/model” |
--output-dir | Directory for logs, images, or other output files (default: “output/output”) Default: “output/output” |
--input-shuffle | |
Shuffle inputs Default: False |
SageMaker¶
SageMaker options
--sagemaker-profile | |
AWS profile for SageMaker session (default: [default]) Default: “default” | |
--sagemaker-run | |
Run training on SageMaker (yes/no default=False) Default: False | |
--sagemaker-wait | |
Wait for SageMaker training to complete and tail logs files (yes/no default=True) Default: True | |
--sagemaker-spot-instances | |
Use spot instances for training (yes/no default=False) Default: False | |
--sagemaker-script | |
Script to run on SageMaker. (default: [script.py]) Default: “script.py” | |
--sagemaker-source | |
Source to upload to SageMaker. Must contain script. If blank, default to directory containing script. (default: []) Default: “” | |
--sagemaker-training-instance | |
Instance type for training Default: “ml.m5.large” | |
--sagemaker-training-image | |
Docker image for training Default: “aws-sagemaker-remote-training:latest” | |
--sagemaker-training-image-path | |
Path to dockerfile if image does not exist Default: “/home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/latest/aws_sagemaker_remote/ecr/training” | |
--sagemaker-training-image-accounts | |
Accounts for docker build Default: [‘763104351884’] | |
--sagemaker-training-role | |
Docker image for training Default: “aws-sagemaker-remote-training-role” | |
--sagemaker-base-job-name | |
Base job name for tracking and organization on S3. A job name will be generated from the base job name unless a job name is specified. Default: “training-job” | |
--sagemaker-job-name | |
Job name for tracking. Use –base-job-name instead and a job name will be automatically generated with a timestamp. Default: “” | |
--sagemaker-experiment-name | |
Name of experiment in SageMaker tracking. | |
--sagemaker-trial-name | |
Name of experiment trial in SageMaker tracking. | |
--sagemaker-volume-size | |
Volume size in GB. Default: 30 | |
--sagemaker-max-run | |
Maximum runtime in seconds. Default: 43200 | |
--sagemaker-max-wait | |
Maximum time to wait for spot instances in seconds. Default: 86400 | |
--sagemaker-output-json | |
Output job details to JSON file. |
Dependencies¶
Dependencies to upload to SageMaker
--my-module | Directory for dependency [my_module] (default: “path/to/my_module”) Default: “path/to/my_module” |
Checkpoints¶
Checkpointing options
--checkpoint-dir | |
Local directory to store checkpoints for resuming training (default: “output/checkpoint”) Default: “output/checkpoint” | |
--sagemaker-checkpoint-s3 | |
Location to store checkpoints on S3 or “default” (default: “default”) Default: “default” | |
--sagemaker-checkpoint-container | |
Location to store checkpoints on container (default: “/opt/ml/checkpoints”) Default: “/opt/ml/checkpoints” | |
--checkpoint-initial | |
Initial checkpoint |
Inputs¶
Inputs (local or S3)
--input | Input channel [input]. Set to local path and it will be uploaded to S3 and downloaded to SageMaker. Set to S3 path and it will be downloaded to SageMaker. (default: [path/to/input]) Default: “path/to/input” |
--input-mode | Input channel [input] mode. (default: [File]) Default: “File” |
--input-repeat | Repeat input Default: 1 |
Example Code¶
The following example creates a trainer with one input named input
.
- Running the file without arguments will run locally. The argument
--input
sets the input directory. - Running the file with
--sagemaker-run=yes
will run on SageMaker. The argument--input
is uploaded to S3, downloaded to SageMaker, and automatically set to a mountpoint.
The example code uploads aws_sagemaker_remote
from the local filesystem using the dependencies
argument. Alternatively:
- Add
aws_sagemaker_remote
to your Docker image. - Create a
requirements.txt
file includingaws_sagemaker_remote
. Place the file in yoursource
directory (default to the directory containing the file containing the main function)
See mnist_training.py.
import argparse
from aws_sagemaker_remote.training.main import sagemaker_training_main
import torch
from torch import nn
from torch.utils import data
from torchvision import datasets
import torchvision.transforms as transforms
import aws_sagemaker_remote
import os
class Model(nn.Module):
def __init__(self):
super(Model, self).__init__()
self.model = nn.Sequential(
nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, stride=2),
nn.LeakyReLU(),
nn.Conv2d(in_channels=32, out_channels=64,
kernel_size=3, stride=2),
nn.LeakyReLU(),
nn.Conv2d(in_channels=64, out_channels=128,
kernel_size=3, stride=1),
nn.LeakyReLU(),
nn.Conv2d(in_channels=128, out_channels=10,
kernel_size=3, stride=1),
)
def forward(self, input):
return torch.mean(self.model(input), dim=(2, 3))
def main(args):
print("Training")
batch_size = 32
device = 'cuda' if torch.cuda.is_available() else 'cpu'
dataset = data.DataLoader(
datasets.MNIST(
root=args.input, download=True, train=True,
transform=transforms.ToTensor()
),
batch_size=batch_size,
shuffle=True, num_workers=2, drop_last=False)
# Create model, optimizer, and criteria
model = Model().to(device)
optimizer = torch.optim.Adam(
params=model.parameters(), lr=args.learning_rate)
criteria = nn.CrossEntropyLoss()
model.train()
for i in range(args.epochs):
for j, (pixels, labels) in enumerate(dataset):
pixels, labels = pixels.to(device), labels.to(device)
logits = model(pixels)
loss = criteria(input=logits, target=labels)
accuracy = torch.mean(
torch.eq(torch.argmax(logits, dim=-1), labels).float())
loss.backward()
optimizer.step()
if j % 100 == 0:
print("epoch {}, step {}, loss {}, accuracy {}".format(
i, j,
loss.item(), accuracy.item()
))
os.makedirs(args.model_dir, exist_ok=True)
torch.save(
model, os.path.join(args.model_dir, 'model.pt')
)
def argparse_callback(parser):
parser.add_argument(
'--learning-rate',
default=1e-3,
type=float,
help='Learning rate')
parser.add_argument(
'--epochs',
default=5,
type=int,
help='Epochs to train')
if __name__ == '__main__':
sagemaker_training_main(
script=__file__,
main=main,
inputs={
'input': 'output/data'
},
dependencies={
'aws_sagemaker_remote': aws_sagemaker_remote
},
argparse_callback=argparse_callback
)