SageMaker Processing¶

Processing jobs accept a set of one or more input file paths and write to a set of one or more output file paths. Ideal for file conversion or other data preparation tasks. S3 files can be automatically copied to local storage, so little modification to current scripts is required.

SageMaker processing is best suited for processing large files or if random-access to files is required. Alternatively:

If the process requires more code or a custom image that cannot be used by a Lambda, but the process can be fully parallelized, use SageMaker batch processing to allocate a fleet of containers that will process each object in S3. SageMaker transform documentation
If the process can be run in a small JavaScript package, processing can be performed faster, cheaper, and with better parallelization using S3 batch and Lambda. S3 Batch documentation

A SageMaker processing script can be run locally or remotely.

Running locally, your command line arguments for inputs and outputs are passed to your function as usual
Running remotely, paths referenced by command line arguments are uploaded and downloaded using S3 and your function is executed remotely with command line arguments referencing local copies of those files

Basic usage¶

Write a script with a main function that calls sagemaker_processing_main.

from aws_sagemaker_remote import sagemaker_processing_main

def main(args):
    # your code here
    pass

if __name__ == '__main__':
    sagemaker_processing_main(
        main=main,
        # ...
    )

Pass function argument run=True or command line argument --sagemaker-run=True to run script remotely on SageMaker.

Many command-line arguments are automatically added. See Command-Line Arguments.
Parameters to sagemaker_processing_main control what command-line arguments are automatically added and the default values. See aws_sagemaker_remote.processing.main.sagemaker_processing_main() and aws_sagemaker_remote.processing.args.sagemaker_processing_args()

Processing Job Tracking¶

Use the SageMaker console to view a list of all processing jobs. For each job, SageMaker tracks:

Processing time
Container used
Link to CloudWatch logs
Path on S3 for each of:
- Script file
- Each input channel
- Each output channel
- Requirements file (if used)
- Configuration script (if used)
- Supporting code (if used)

Configuration¶

Many command line options are added by this command.

Option --sagemaker-run controls local or remote execution.

Set --sagemaker-run to a falsy value (no,false,0), the script will call your main function as usual and run locally.
Set --sagemaker-run to a truthy value (yes,true,1), the script will upload itself and any requirements or inputs to S3, execute remotely on SageMaker, and save outputs to S3, logging results to the terminal.

Set --sagemaker-wait truthy to tail logs and wait for completion or falsy to complete when the job starts.

Defaults are set through code. Defaults can be overwritten on the command line. For example:

Use the function argument image to set the default container image for your script
Use the command line argument --sagemaker-image to override the container image on a particular run

See functions and commands (todo: links)

Environment Customization¶

The environment can be customized in multiple ways.

Instance
- Function argument instance
- Command line argument --sagemaker-instance
- Select instance type of machine running the container
Image
- Function argument image
- Command line argument --sagemaker-image
- Accepts URI of Docker container image on ECR to run
- Build a custom Docker image for major customizations
Configuration script
- Function argument configuration_script
- Command line argument --sagemaker-configuration-script
- Accepts path to a text file. Will upload text file to S3 and run source [file].
- Bash script file for minor customization, e.g., export MYVAR=value or yum install -y mypackage
Configuration command
- Function argument configuration_command
- Command line argument --sagemaker-configuration-command
- Accepts a bash command to run.
- Bash command for minor customization, e.g., export MYVAR=value && yum install -y mypackage
Requirements file
- Function argument requirements
- Command line argument --sagemaker-requirements
- Accepts path to a text file. Will upload text file to S3 and run python -m pip install -r [file]
- Use for installing Python packages by listing one on each line. Standard requirements.txt file format [https://pip.pypa.io/en/stable/reference/pip_install/#requirements-file-format]
Module uploads
- Function argument modules
  - Dictionary of [key]->[value]
  - Each key will create command line argument --key that defaults to value
- Each value is a directory containing a Python module that will be uploaded to S3, downloaded to SageMaker, and put on the PYTHONPATH
- For example, if directory mymodule contains the files __init__.py and myfile.py and myfile.py contains def myfunction():..., pass modules={'mymodule':'path/to/mymodule'} to sagemaker_processing_main and then use from mymodule.myfile import myfunction in your script.
- Use module uploads for supporting code that is not being installed from packages.

Additional arguments¶

Any arguments passed to your script locally on the command line are passed to your script remotely and tracked by SageMaker.: Internally, sagemaker_processing_main uses argparse. To add additional command-line flags:

Pass a list of kwargs dictionaries to additional_arguments

sagemaker_processing_main(
  #...
  additional_arguments = [
    {
      'dest': '--filter-width',
      'default':32,
      'help':'Filter width'
    },
    {
      'dest':'--filter-height',
      'default':32,
      'help':'Filter height'
    }
  ]
)

Pass a callback to argparse_callback

from argparse import ArgumentParser
def argparse_callback(parser:ArgumentParser):
  parser.add_argument(
  '--filter-width',
  default=32,
  help='Filter width')
  parser.add_argument(
  '--filter-height',
  default=32,
  help='Filter height')
sagemaker_training_main(
  # ...
  argparse_callback=argparse_callback
)

Note: local command-line arguments are parsed, stored on SageMaker, then used to generate a command line for your script. - All flags are serialized into a string to string dictionary. - All flags must have a single non-empty argument. - Use CSV, JSON, or other methods to use string arguments instead of repeated arguments. - Explicity passing empty arguments on the command-line is not supported.

Command-Line Arguments¶

These command-line arguments were created using the following parameters. Command-line arguments are generated for each item in inputs, outputs and dependencies.

inputs={
    'input': '/path/to/input'
},
outputs={
    'output': ('/path/to/output', 'default')
},
dependencies={
    'my_module': '/path/to/my_module'
}

usage: aws-sagemaker-remote-processing [-h]
                                       [--sagemaker-profile SAGEMAKER_PROFILE]
                                       [--sagemaker-run [SAGEMAKER_RUN]]
                                       [--sagemaker-wait [SAGEMAKER_WAIT]]
                                       [--sagemaker-script SAGEMAKER_SCRIPT]
                                       [--sagemaker-python SAGEMAKER_PYTHON]
                                       [--sagemaker-job-name SAGEMAKER_JOB_NAME]
                                       [--sagemaker-base-job-name SAGEMAKER_BASE_JOB_NAME]
                                       [--sagemaker-runtime-seconds SAGEMAKER_RUNTIME_SECONDS]
                                       [--sagemaker-role SAGEMAKER_ROLE]
                                       [--sagemaker-requirements SAGEMAKER_REQUIREMENTS]
                                       [--sagemaker-configuration-script SAGEMAKER_CONFIGURATION_SCRIPT]
                                       [--sagemaker-configuration-command SAGEMAKER_CONFIGURATION_COMMAND]
                                       [--sagemaker-image SAGEMAKER_IMAGE]
                                       [--sagemaker-image-path SAGEMAKER_IMAGE_PATH]
                                       [--sagemaker-image-accounts SAGEMAKER_IMAGE_ACCOUNTS]
                                       [--sagemaker-instance SAGEMAKER_INSTANCE]
                                       [--sagemaker-volume-size SAGEMAKER_VOLUME_SIZE]
                                       [--sagemaker-output-json SAGEMAKER_OUTPUT_JSON]
                                       [--sagemaker-input-mount SAGEMAKER_INPUT_MOUNT]
                                       [--input INPUT]
                                       [--input-mode INPUT_MODE]
                                       [--sagemaker-output-mount SAGEMAKER_OUTPUT_MOUNT]
                                       [--output OUTPUT]
                                       [--output-s3 OUTPUT_S3]
                                       [--output-mode OUTPUT_MODE]
                                       [--sagemaker-module-mount SAGEMAKER_MODULE_MOUNT]
                                       [--my-module MY_MODULE]

SageMaker¶

SageMaker options

`--sagemaker-profile`
	AWS profile for SageMaker session (default: [default]) Default: “default”
`--sagemaker-run`
	Run processing on SageMaker (yes/no default=False) Default: False
`--sagemaker-wait`
	Wait for SageMaker processing to complete and tail logs (yes/no default=True) Default: True
`--sagemaker-script`
	Python script to execute (default: [script.py]) Default: “script.py”
`--sagemaker-python`
	Python executable to use in container (default: [python3]) Default: “python3”
`--sagemaker-job-name`
	Job name for SageMaker processing. If not provided, will be generated from base job name. Leave blank for most use-cases. (default: []) Default: “”
`--sagemaker-base-job-name`
	Base job name for SageMaker processing .Job name will be generated from the base name and a timestamp (default: [processing-job]) Default: “processing-job”
`--sagemaker-runtime-seconds`
	SageMaker maximum runtime in seconds (default: [3600]) Default: 3600
`--sagemaker-role`
	AWS role for SageMaker execution (default: [aws-sagemaker-remote-processing-role]) Default: “aws-sagemaker-remote-processing-role”
`--sagemaker-requirements`
	Requirements file to install on SageMaker (default: [None])
`--sagemaker-configuration-script`
	Bash configuration script to source on SageMaker (default: [None])
`--sagemaker-configuration-command`
	Bash command to run on SageMaker for configuration (e.g., `pip install aws_sagemaker_remote && export MYVAR=MYVALUE`) (default: [None])
`--sagemaker-image`
	AWS ECR image URI of Docker image to run SageMaker processing (default: [aws-sagemaker-remote-processing:latest]) Default: “aws-sagemaker-remote-processing:latest”
`--sagemaker-image-path`
	Path to Dockerfile if image does not exist yet (default: [/home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/latest/aws_sagemaker_remote/ecr/processing]) Default: “/home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/latest/aws_sagemaker_remote/ecr/processing”
`--sagemaker-image-accounts`
	Accounts required to build Dockerfile (default: [763104351884]) Default: “763104351884”
`--sagemaker-instance`
	AWS SageMaker instance to run processing (default: [ml.t3.medium]) Default: “ml.t3.medium”
`--sagemaker-volume-size`
	AWS SageMaker volume size in GB (default: [30]) Default: 30
`--sagemaker-output-json`
	Write SageMaker training details to JSON file (default: [None])

Inputs¶

Input options

`--sagemaker-input-mount`
	Mount point for inputs. If running on SageMaker, inputs are mounted here. If running locally, S3 inputs are downloaded here. No effect on local inputs when running locally. (default: [/opt/ml/processing/input]) Default: “/opt/ml/processing/input”
`--input`	Input [input]. Local path or path on S3. If running locally, local paths are used directly. If running locally, S3 paths are downloaded to [–sagemaker-input-mount/input]. If running on SageMaker, local paths are uploaded to S3 then S3 data is downloaded to [–sagemaker-input-mount/input]. If running on SageMaker, S3 paths are downloaded to [–sagemaker-input-mount/input]. (default: [/path/to/input]) Default: “/path/to/input”
`--input-mode`	Input [input] mode. File or Pipe. (default: [File]) Default: “File”

Output¶

Output options

`--sagemaker-output-mount`
	Mount point for outputs. If running on SageMaker, outputs written here are uploaded to S3. If running locally, S3 outputs written here are uploaded to S3. No effect on local outputs when running locally. (default: [/opt/ml/processing/output]) Default: “/opt/ml/processing/output”
`--output`	Output [output] local path. If running locally, set to a local path. (default: [/path/to/output]) Default: “/path/to/output”
`--output-s3`	Output [output] S3 URI. Upload results to this URI. Empty string automatically generates a URI. (default: [default]) Default: “default”
`--output-mode`	Output [output] mode. Set to Continuous or EndOfJob. (default: [EndOfJob]) Default: “EndOfJob”

Modules¶

Module options

--sagemaker-module-mount

Mount point for modules. If running on SageMaker, modules are mounted here and this directory is added to PYTHONPATH (default: [/opt/ml/processing/modules])

Default: “/opt/ml/processing/modules”

--my-module

Directory of [my_module] module. If running on SageMaker, modules are uploaded and placed on PYTHONPATH. (default: [/path/to/my_module])

Default: “/path/to/my_module”

Example Code¶

The following example creates a processor with no inputs and one output named output.

Running the file without arguments will run locally. The argument --output sets the output directory.
Running the file with --sagemaker-run=yes will run on SageMaker. The argument --output is automatically set to a mountpoint on SageMaker and outputs are uploaded to S3. Use --output-s3 to set the S3 output path, or leave it as default to automatically generate an appropriate path based on the job name.

The example code uploads aws_sagemaker_remote from the local filesystem using the dependencies argument. Alternatively:

Add aws_sagemaker_remote to your Docker image.
Create a requirements.txt file including aws_sagemaker_remote. Pass the path of the requirements file to the requirements function argument or the --sagemaker-requirements command-line argument.
Create a bash script including pip install aws-sagemaker-remote. Pass the path of the script to the configuration_script function argument or the --sagemaker-configuration-script command-line argument.
Pass pip install aws-sagemaker-remote to the configuration_command function argument or the --sagemaker-configuration-command command-line argument.

See mnist_processor.py.

import argparse
import os
import pprint
from torch import nn
from torch.utils import data
from torchvision.datasets import MNIST
import torchvision.transforms as transforms
from aws_sagemaker_remote.processing import sagemaker_processing_main
import aws_sagemaker_remote

def main(args):
    # Main function runs locally or remotely
    dataroot = args.output
    MNIST(
        root=dataroot, download=True, train=True,
        transform=transforms.ToTensor()
    )
    MNIST(
        root=dataroot, download=True, train=False,
        transform=transforms.ToTensor()
    )
    print("Downloaded MNIST")


if __name__ == '__main__':
    sagemaker_processing_main(
        script=__file__, # script path for remote execution
        main=main, # main function for local execution
        outputs={
            # Add the command line flag `output`
            # flag: default path
            'output': 'output/data'
        },
        dependencies={
            # Add a module to SageMaker
            # module name: module path
            'aws_sagemaker_remote': aws_sagemaker_remote
        },
        configuration_command='pip3 install --upgrade sagemaker sagemaker-experiments',
        # Name the job
        base_job_name='demo-mnist-processor'
    )