SageMaker Processing¶
Processing jobs accept a set of one or more input file paths and write to a set of one or more output file paths. Ideal for file conversion or other data preparation tasks. S3 files can be automatically copied to local storage, so little modification to current scripts is required.
SageMaker processing is best suited for processing large files or if random-access to files is required. Alternatively:
- If the process requires more code or a custom image that cannot be used by a Lambda, but the process can be fully parallelized, use SageMaker batch processing to allocate a fleet of containers that will process each object in S3. SageMaker transform documentation
- If the process can be run in a small JavaScript package, processing can be performed faster, cheaper, and with better parallelization using S3 batch and Lambda. S3 Batch documentation
A SageMaker processing script can be run locally or remotely.
- Running locally, your command line arguments for inputs and outputs are passed to your function as usual
- Running remotely, paths referenced by command line arguments are uploaded and downloaded using S3 and your function is executed remotely with command line arguments referencing local copies of those files
Basic usage¶
Write a script with a main
function that calls sagemaker_processing_main
.
from aws_sagemaker_remote import sagemaker_processing_main
def main(args):
# your code here
pass
if __name__ == '__main__':
sagemaker_processing_main(
main=main,
# ...
)
Pass function argument run=True
or command line argument --sagemaker-run=True
to run script remotely on SageMaker.
- Many command-line arguments are automatically added. See Command-Line Arguments.
- Parameters to
sagemaker_processing_main
control what command-line arguments are automatically added and the default values. Seeaws_sagemaker_remote.processing.main.sagemaker_processing_main()
andaws_sagemaker_remote.processing.args.sagemaker_processing_args()
Processing Job Tracking¶
Use the SageMaker console to view a list of all processing jobs. For each job, SageMaker tracks:
- Processing time
- Container used
- Link to CloudWatch logs
- Path on S3 for each of:
- Script file
- Each input channel
- Each output channel
- Requirements file (if used)
- Configuration script (if used)
- Supporting code (if used)
Configuration¶
Many command line options are added by this command.
Option --sagemaker-run
controls local or remote execution.
- Set
--sagemaker-run
to a falsy value (no,false,0), the script will call your main function as usual and run locally. - Set
--sagemaker-run
to a truthy value (yes,true,1), the script will upload itself and any requirements or inputs to S3, execute remotely on SageMaker, and save outputs to S3, logging results to the terminal.
Set --sagemaker-wait
truthy to tail logs and wait for completion or falsy to complete when the job starts.
Defaults are set through code. Defaults can be overwritten on the command line. For example:
- Use the function argument
image
to set the default container image for your script - Use the command line argument
--sagemaker-image
to override the container image on a particular run
See functions and commands (todo: links)
Environment Customization¶
The environment can be customized in multiple ways.
- Instance
- Function argument
instance
- Command line argument
--sagemaker-instance
- Select instance type of machine running the container
- Function argument
- Image
- Function argument
image
- Command line argument
--sagemaker-image
- Accepts URI of Docker container image on ECR to run
- Build a custom Docker image for major customizations
- Function argument
- Configuration script
- Function argument
configuration_script
- Command line argument
--sagemaker-configuration-script
- Accepts path to a text file. Will upload text file to S3 and run
source [file]
. - Bash script file for minor customization, e.g.,
export MYVAR=value
oryum install -y mypackage
- Function argument
- Configuration command
- Function argument
configuration_command
- Command line argument
--sagemaker-configuration-command
- Accepts a bash command to run.
- Bash command for minor customization, e.g.,
export MYVAR=value && yum install -y mypackage
- Function argument
- Requirements file
- Function argument
requirements
- Command line argument
--sagemaker-requirements
- Accepts path to a text file. Will upload text file to S3 and run
python -m pip install -r [file]
- Use for installing Python packages by listing one on each line. Standard
requirements.txt
file format [https://pip.pypa.io/en/stable/reference/pip_install/#requirements-file-format]
- Function argument
- Module uploads
- Function argument
modules
- Dictionary of
[key]->[value]
- Each key will create command line argument
--key
that defaults tovalue
- Dictionary of
- Each
value
is a directory containing a Python module that will be uploaded to S3, downloaded to SageMaker, and put on the PYTHONPATH - For example, if directory
mymodule
contains the files__init__.py
andmyfile.py
andmyfile.py
containsdef myfunction():...
, passmodules={'mymodule':'path/to/mymodule'}
tosagemaker_processing_main
and then usefrom mymodule.myfile import myfunction
in your script. - Use module uploads for supporting code that is not being installed from packages.
- Function argument
Additional arguments¶
- Any arguments passed to your script locally on the command line are passed to your script remotely and tracked by SageMaker.
- Internally,
sagemaker_processing_main
usesargparse
. To add additional command-line flags:
Pass a list of kwargs dictionaries to
additional_arguments
sagemaker_processing_main( #... additional_arguments = [ { 'dest': '--filter-width', 'default':32, 'help':'Filter width' }, { 'dest':'--filter-height', 'default':32, 'help':'Filter height' } ] )
Pass a callback to
argparse_callback
from argparse import ArgumentParser def argparse_callback(parser:ArgumentParser): parser.add_argument( '--filter-width', default=32, help='Filter width') parser.add_argument( '--filter-height', default=32, help='Filter height') sagemaker_training_main( # ... argparse_callback=argparse_callback )
Note: local command-line arguments are parsed, stored on SageMaker, then used to generate a command line for your script. - All flags are serialized into a string to string dictionary. - All flags must have a single non-empty argument. - Use CSV, JSON, or other methods to use string arguments instead of repeated arguments. - Explicity passing empty arguments on the command-line is not supported.
Command-Line Arguments¶
These command-line arguments were created using the following parameters.
Command-line arguments are generated for each item in inputs
, outputs
and dependencies
.
inputs={
'input': '/path/to/input'
},
outputs={
'output': ('/path/to/output', 'default')
},
dependencies={
'my_module': '/path/to/my_module'
}
usage: aws-sagemaker-remote-processing [-h]
[--sagemaker-profile SAGEMAKER_PROFILE]
[--sagemaker-run [SAGEMAKER_RUN]]
[--sagemaker-wait [SAGEMAKER_WAIT]]
[--sagemaker-script SAGEMAKER_SCRIPT]
[--sagemaker-python SAGEMAKER_PYTHON]
[--sagemaker-job-name SAGEMAKER_JOB_NAME]
[--sagemaker-base-job-name SAGEMAKER_BASE_JOB_NAME]
[--sagemaker-runtime-seconds SAGEMAKER_RUNTIME_SECONDS]
[--sagemaker-role SAGEMAKER_ROLE]
[--sagemaker-requirements SAGEMAKER_REQUIREMENTS]
[--sagemaker-configuration-script SAGEMAKER_CONFIGURATION_SCRIPT]
[--sagemaker-configuration-command SAGEMAKER_CONFIGURATION_COMMAND]
[--sagemaker-image SAGEMAKER_IMAGE]
[--sagemaker-image-path SAGEMAKER_IMAGE_PATH]
[--sagemaker-image-accounts SAGEMAKER_IMAGE_ACCOUNTS]
[--sagemaker-instance SAGEMAKER_INSTANCE]
[--sagemaker-volume-size SAGEMAKER_VOLUME_SIZE]
[--sagemaker-output-json SAGEMAKER_OUTPUT_JSON]
[--sagemaker-input-mount SAGEMAKER_INPUT_MOUNT]
[--input INPUT]
[--input-mode INPUT_MODE]
[--sagemaker-output-mount SAGEMAKER_OUTPUT_MOUNT]
[--output OUTPUT]
[--output-s3 OUTPUT_S3]
[--output-mode OUTPUT_MODE]
[--sagemaker-module-mount SAGEMAKER_MODULE_MOUNT]
[--my-module MY_MODULE]
SageMaker¶
SageMaker options
--sagemaker-profile | |
AWS profile for SageMaker session (default: [default]) Default: “default” | |
--sagemaker-run | |
Run processing on SageMaker (yes/no default=False) Default: False | |
--sagemaker-wait | |
Wait for SageMaker processing to complete and tail logs (yes/no default=True) Default: True | |
--sagemaker-script | |
Python script to execute (default: [script.py]) Default: “script.py” | |
--sagemaker-python | |
Python executable to use in container (default: [python3]) Default: “python3” | |
--sagemaker-job-name | |
Job name for SageMaker processing. If not provided, will be generated from base job name. Leave blank for most use-cases. (default: []) Default: “” | |
--sagemaker-base-job-name | |
Base job name for SageMaker processing .Job name will be generated from the base name and a timestamp (default: [processing-job]) Default: “processing-job” | |
--sagemaker-runtime-seconds | |
SageMaker maximum runtime in seconds (default: [3600]) Default: 3600 | |
--sagemaker-role | |
AWS role for SageMaker execution (default: [aws-sagemaker-remote-processing-role]) Default: “aws-sagemaker-remote-processing-role” | |
--sagemaker-requirements | |
Requirements file to install on SageMaker (default: [None]) | |
--sagemaker-configuration-script | |
Bash configuration script to source on SageMaker (default: [None]) | |
--sagemaker-configuration-command | |
Bash command to run on SageMaker for configuration (e.g., pip install aws_sagemaker_remote && export MYVAR=MYVALUE ) (default: [None]) | |
--sagemaker-image | |
AWS ECR image URI of Docker image to run SageMaker processing (default: [aws-sagemaker-remote-processing:latest]) Default: “aws-sagemaker-remote-processing:latest” | |
--sagemaker-image-path | |
Path to Dockerfile if image does not exist yet (default: [/home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/latest/aws_sagemaker_remote/ecr/processing]) Default: “/home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/latest/aws_sagemaker_remote/ecr/processing” | |
--sagemaker-image-accounts | |
Accounts required to build Dockerfile (default: [763104351884]) Default: “763104351884” | |
--sagemaker-instance | |
AWS SageMaker instance to run processing (default: [ml.t3.medium]) Default: “ml.t3.medium” | |
--sagemaker-volume-size | |
AWS SageMaker volume size in GB (default: [30]) Default: 30 | |
--sagemaker-output-json | |
Write SageMaker training details to JSON file (default: [None]) |
Inputs¶
Input options
--sagemaker-input-mount | |
Mount point for inputs. If running on SageMaker, inputs are mounted here. If running locally, S3 inputs are downloaded here. No effect on local inputs when running locally. (default: [/opt/ml/processing/input]) Default: “/opt/ml/processing/input” | |
--input | Input [input]. Local path or path on S3. If running locally, local paths are used directly. If running locally, S3 paths are downloaded to [–sagemaker-input-mount/input]. If running on SageMaker, local paths are uploaded to S3 then S3 data is downloaded to [–sagemaker-input-mount/input]. If running on SageMaker, S3 paths are downloaded to [–sagemaker-input-mount/input]. (default: [/path/to/input]) Default: “/path/to/input” |
--input-mode | Input [input] mode. File or Pipe. (default: [File]) Default: “File” |
Output¶
Output options
--sagemaker-output-mount | |
Mount point for outputs. If running on SageMaker, outputs written here are uploaded to S3. If running locally, S3 outputs written here are uploaded to S3. No effect on local outputs when running locally. (default: [/opt/ml/processing/output]) Default: “/opt/ml/processing/output” | |
--output | Output [output] local path. If running locally, set to a local path. (default: [/path/to/output]) Default: “/path/to/output” |
--output-s3 | Output [output] S3 URI. Upload results to this URI. Empty string automatically generates a URI. (default: [default]) Default: “default” |
--output-mode | Output [output] mode. Set to Continuous or EndOfJob. (default: [EndOfJob]) Default: “EndOfJob” |
Modules¶
Module options
--sagemaker-module-mount | |
Mount point for modules. If running on SageMaker, modules are mounted here and this directory is added to PYTHONPATH (default: [/opt/ml/processing/modules]) Default: “/opt/ml/processing/modules” | |
--my-module | Directory of [my_module] module. If running on SageMaker, modules are uploaded and placed on PYTHONPATH. (default: [/path/to/my_module]) Default: “/path/to/my_module” |
Example Code¶
The following example creates a processor with no inputs and one output named output
.
- Running the file without arguments will run locally. The argument
--output
sets the output directory. - Running the file with
--sagemaker-run=yes
will run on SageMaker. The argument--output
is automatically set to a mountpoint on SageMaker and outputs are uploaded to S3. Use--output-s3
to set the S3 output path, or leave it asdefault
to automatically generate an appropriate path based on the job name.
The example code uploads aws_sagemaker_remote
from the local filesystem using the dependencies
argument. Alternatively:
- Add
aws_sagemaker_remote
to your Docker image. - Create a
requirements.txt
file includingaws_sagemaker_remote
. Pass the path of the requirements file to therequirements
function argument or the--sagemaker-requirements
command-line argument. - Create a bash script including
pip install aws-sagemaker-remote
. Pass the path of the script to theconfiguration_script
function argument or the--sagemaker-configuration-script
command-line argument. - Pass
pip install aws-sagemaker-remote
to theconfiguration_command
function argument or the--sagemaker-configuration-command
command-line argument.
See mnist_processor.py.
import argparse
import os
import pprint
from torch import nn
from torch.utils import data
from torchvision.datasets import MNIST
import torchvision.transforms as transforms
from aws_sagemaker_remote.processing import sagemaker_processing_main
import aws_sagemaker_remote
def main(args):
# Main function runs locally or remotely
dataroot = args.output
MNIST(
root=dataroot, download=True, train=True,
transform=transforms.ToTensor()
)
MNIST(
root=dataroot, download=True, train=False,
transform=transforms.ToTensor()
)
print("Downloaded MNIST")
if __name__ == '__main__':
sagemaker_processing_main(
script=__file__, # script path for remote execution
main=main, # main function for local execution
outputs={
# Add the command line flag `output`
# flag: default path
'output': 'output/data'
},
dependencies={
# Add a module to SageMaker
# module name: module path
'aws_sagemaker_remote': aws_sagemaker_remote
},
configuration_command='pip3 install --upgrade sagemaker sagemaker-experiments',
# Name the job
base_job_name='demo-mnist-processor'
)