Processing

Processing jobs accept a set of one or more input file paths and write to a set of one or more output file paths. Ideal for file conversion or other data preparation tasks.

  • Running locally, standard command line arguments for inputs and outputs are used as usual
  • Running remotely, data is uploaded and downloaded using S3 for tracking

Basic usage

Write a script with a main function that calls sagemaker_processing_main.

from aws_sagemaker_remote import sagemaker_processing_main

def main(args):
    # your code here
    pass

if __name__ == '__main__':
    sagemaker_processing_main(
        main=main,
        # ...
    )

Pass function argument run=True or command line argument --sagemaker-run=True to run script remotely on SageMaker.

Processing Job Tracking

Use the SageMaker console to view a list of all processing jobs. For each job, SageMaker tracks:

  • Processing time
  • Container used
  • Link to CloudWatch logs
  • Path on S3 for each of:
    • Script file
    • Each input channel
    • Each output channel
    • Requirements file (if used)
    • Configuration script (if used)
    • Supporting code (if used)

Configuration

Many command line options are added by this command.

Option --sagemaker-run controls local or remote execution.

  • Set --sagemaker-run to a falsy value (no,false,0), the script will call your main function as usual and run locally.
  • Set --sagemaker-run to a truthy value (yes,true,1), the script will upload itself and any requirements or inputs to S3, execute remotely on SageMaker, and save outputs to S3, logging results to the terminal.

Set --sagemaker-wait truthy to tail logs and wait for completion or falsy to complete when the job starts.

Defaults are set through code. Defaults can be overwritten on the command line. For example:

  • Use the function argument image to set the default container image for your script
  • Use the command line argument --sagemaker-image to override the container image on a particular run

See functions and commands (todo: links)

Environment Customization

The environment can be customized in multiple ways.

  • Instance
    • Function argument instance
    • Command line argument --sagemaker-instance
    • Select instance type of machine running the container
  • Image
    • Function argument image
    • Command line argument --sagemaker-image
    • Accepts URI of Docker container image on ECR to run
    • Build a custom Docker image for major customizations
  • Configuration script
    • Function argument configuration_script
    • Command line argument --sagemaker-configuration-script
    • Accepts path to a text file. Will upload text file to S3 and run source [file].
    • Bash script file for minor customization, e.g., export MYVAR=value or yum install -y mypackage
  • Configuration command
    • Function argument configuration_command
    • Command line argument --sagemaker-configuration-command
    • Accepts a bash command to run.
    • Bash command for minor customization, e.g., export MYVAR=value && yum install -y mypackage
  • Requirements file
  • Module uploads
    • Function argument modules
      • Dictionary of [key]->[value]
      • Each key will create command line argument --key that defaults to value
    • Each value is a directory containing a Python module that will be uploaded to S3, downloaded to SageMaker, and put on the PYTHONPATH
    • For example, if directory mymodule contains the files __init__.py and myfile.py and myfile.py contains def myfunction():..., pass modules={'mymodule':'path/to/mymodule'} to sagemaker_processing_main and then use from mymodule.myfile import myfunction in your script.
    • Use module uploads for supporting code that is not being installed from packages.

Additional arguments

Any arguments passed to your script locally on the command line are passed to your script remotely and tracked by SageMaker.
Internally, sagemaker_processing_main uses argparse. To add additional command-line flags:
  • Pass a list of kwargs dictionaries to additional_arguments

    sagemaker_processing_main(
      #...
      additional_arguments = [
        {
          'dest': '--filter-width',
          'default':32,
          'help':'Filter width'
        },
        {
          'dest':'--filter-height',
          'default':32,
          'help':'Filter height'
        }
      ]
    )
    
  • Pass a callback to argparse_callback

    from argparse import ArgumentParser
    def argparse_callback(parser:ArgumentParser):
      parser.add_argument(
      '--filter-width',
      default=32,
      help='Filter width')
      parser.add_argument(
      '--filter-height',
      default=32,
      help='Filter height')
    sagemaker_training_main(
      # ...
      argparse_callback=argparse_callback
    )
    

Note: local command-line arguments are parsed, stored on SageMaker, then used to generate a command line for your script. - All flags are serialized into a string to string dictionary. - All flags must have a single non-empty argument. - Use CSV, JSON, or other methods to use string arguments instead of repeated arguments. - Explicity passing empty arguments on the command-line is not supported.

Command-Line Arguments

These command-line arguments were created using the following parameters. Command-line arguments are generated for each item in inputs, outputs and dependencies.

inputs={
    'input': '/path/to/input'
},
outputs={
    'output': ('/path/to/output', 'default')
},
dependencies={
    'my_module': '/path/to/my_module'
}
usage: aws-sagemaker-remote-processing [-h]
                                       [--sagemaker-profile SAGEMAKER_PROFILE]
                                       [--sagemaker-run [SAGEMAKER_RUN]]
                                       [--sagemaker-wait [SAGEMAKER_WAIT]]
                                       [--sagemaker-script SAGEMAKER_SCRIPT]
                                       [--sagemaker-python SAGEMAKER_PYTHON]
                                       [--sagemaker-job-name SAGEMAKER_JOB_NAME]
                                       [--sagemaker-base-job-name SAGEMAKER_BASE_JOB_NAME]
                                       [--sagemaker-runtime-seconds SAGEMAKER_RUNTIME_SECONDS]
                                       [--sagemaker-role SAGEMAKER_ROLE]
                                       [--sagemaker-requirements SAGEMAKER_REQUIREMENTS]
                                       [--sagemaker-configuration-script SAGEMAKER_CONFIGURATION_SCRIPT]
                                       [--sagemaker-configuration-command SAGEMAKER_CONFIGURATION_COMMAND]
                                       [--sagemaker-image SAGEMAKER_IMAGE]
                                       [--sagemaker-image-path SAGEMAKER_IMAGE_PATH]
                                       [--sagemaker-image-accounts SAGEMAKER_IMAGE_ACCOUNTS]
                                       [--sagemaker-instance SAGEMAKER_INSTANCE]
                                       [--sagemaker-volume-size SAGEMAKER_VOLUME_SIZE]
                                       [--sagemaker-output-json SAGEMAKER_OUTPUT_JSON]
                                       [--sagemaker-input-mount SAGEMAKER_INPUT_MOUNT]
                                       [--input INPUT]
                                       [--input-mode INPUT_MODE]
                                       [--sagemaker-output-mount SAGEMAKER_OUTPUT_MOUNT]
                                       [--output OUTPUT]
                                       [--output-s3 OUTPUT_S3]
                                       [--output-mode OUTPUT_MODE]
                                       [--sagemaker-module-mount SAGEMAKER_MODULE_MOUNT]
                                       [--my-module MY_MODULE]

SageMaker

SageMaker options

--sagemaker-profile
 

AWS profile for SageMaker session (default: [default])

Default: “default”

--sagemaker-run
 

Run processing on SageMaker (yes/no default=False)

Default: False

--sagemaker-wait
 

Wait for SageMaker processing to complete and tail logs (yes/no default=True)

Default: True

--sagemaker-script
 

Python script to execute (default: [script.py])

Default: “script.py”

--sagemaker-python
 

Python executable to use in container (default: [python3])

Default: “python3”

--sagemaker-job-name
 

Job name for SageMaker processing. If not provided, will be generated from base job name. Leave blank for most use-cases. (default: [])

Default: “”

--sagemaker-base-job-name
 

Base job name for SageMaker processing .Job name will be generated from the base name and a timestamp (default: [processing-job])

Default: “processing-job”

--sagemaker-runtime-seconds
 

SageMaker maximum runtime in seconds (default: [3600])

Default: 3600

--sagemaker-role
 

AWS role for SageMaker execution (default: [aws-sagemaker-remote-processing-role])

Default: “aws-sagemaker-remote-processing-role”

--sagemaker-requirements
 Requirements file to install on SageMaker (default: [None])
--sagemaker-configuration-script
 Bash configuration script to source on SageMaker (default: [None])
--sagemaker-configuration-command
 Bash command to run on SageMaker for configuration (e.g., pip install aws_sagemaker_remote && export MYVAR=MYVALUE) (default: [None])
--sagemaker-image
 

AWS ECR image URI of Docker image to run SageMaker processing (default: [aws-sagemaker-remote-processing:latest])

Default: “aws-sagemaker-remote-processing:latest”

--sagemaker-image-path
 

Path to Dockerfile if image does not exist yet (default: [/home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/stable/aws_sagemaker_remote/ecr/processing])

Default: “/home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/stable/aws_sagemaker_remote/ecr/processing”

--sagemaker-image-accounts
 

Accounts required to build Dockerfile (default: [763104351884])

Default: “763104351884”

--sagemaker-instance
 

AWS SageMaker instance to run processing (default: [ml.t3.medium])

Default: “ml.t3.medium”

--sagemaker-volume-size
 

AWS SageMaker volume size in GB (default: [30])

Default: 30

--sagemaker-output-json
 Write SageMaker training details to JSON file (default: [None])

Inputs

Input options

--sagemaker-input-mount
 

Mount point for inputs. If running on SageMaker, inputs are mounted here. If running locally, S3 inputs are downloaded here. No effect on local inputs when running locally. (default: [/opt/ml/processing/input])

Default: “/opt/ml/processing/input”

--input

Input [input]. Local path or path on S3. If running locally, local paths are used directly. If running locally, S3 paths are downloaded to [–sagemaker-input-mount/input]. If running on SageMaker, local paths are uploaded to S3 then S3 data is downloaded to [–sagemaker-input-mount/input]. If running on SageMaker, S3 paths are downloaded to [–sagemaker-input-mount/input]. (default: [/path/to/input])

Default: “/path/to/input”

--input-mode

Input [input] mode. File or Pipe. (default: [File])

Default: “File”

Output

Output options

--sagemaker-output-mount
 

Mount point for outputs. If running on SageMaker, outputs written here are uploaded to S3. If running locally, S3 outputs written here are uploaded to S3. No effect on local outputs when running locally. (default: [/opt/ml/processing/output])

Default: “/opt/ml/processing/output”

--output

Output [output] local path. If running locally, set to a local path. (default: [/path/to/output])

Default: “/path/to/output”

--output-s3

Output [output] S3 URI. Upload results to this URI. Empty string automatically generates a URI. (default: [default])

Default: “default”

--output-mode

Output [output] mode. Set to Continuous or EndOfJob. (default: [EndOfJob])

Default: “EndOfJob”

Modules

Module options

--sagemaker-module-mount
 

Mount point for modules. If running on SageMaker, modules are mounted here and this directory is added to PYTHONPATH (default: [/opt/ml/processing/modules])

Default: “/opt/ml/processing/modules”

--my-module

Directory of [my_module] module. If running on SageMaker, modules are uploaded and placed on PYTHONPATH. (default: [/path/to/my_module])

Default: “/path/to/my_module”

Example Code

The following example creates a processor with no inputs and one output named output.

  • Running the file without arguments will run locally. The argument --output sets the output directory.
  • Running the file with --sagemaker-run=yes will run on SageMaker. The argument --output is automatically set to a mountpoint on SageMaker and outputs are uploaded to S3. Use --output-s3 to set the S3 output path, or leave it as default to automatically generate an appropriate path based on the job name.

The example code uploads aws_sagemaker_remote from the local filesystem using the dependencies argument. Alternatively:

  • Add aws_sagemaker_remote to your Docker image.
  • Create a requirements.txt file including aws_sagemaker_remote. Pass the path of the requirements file to the requirements function argument or the --sagemaker-requirements command-line argument.
  • Create a bash script including pip install aws-sagemaker-remote. Pass the path of the script to the configuration_script function argument or the --sagemaker-configuration-script command-line argument.
  • Pass pip install aws-sagemaker-remote to the configuration_command function argument or the --sagemaker-configuration-command command-line argument.

See mnist_processor.py.

import argparse
import os
import pprint
from torch import nn
from torch.utils import data
from torchvision.datasets import MNIST
import torchvision.transforms as transforms
from aws_sagemaker_remote.processing import sagemaker_processing_main
import aws_sagemaker_remote

def main(args):
    # Main function runs locally or remotely
    dataroot = args.output
    MNIST(
        root=dataroot, download=True, train=True,
        transform=transforms.ToTensor()
    )
    MNIST(
        root=dataroot, download=True, train=False,
        transform=transforms.ToTensor()
    )
    print("Downloaded MNIST")


if __name__ == '__main__':
    sagemaker_processing_main(
        script=__file__, # script path for remote execution
        main=main, # main function for local execution
        outputs={
            # Add the command line flag `output`
            # flag: default path
            'output': 'output/data'
        },
        dependencies={
            # Add a module to SageMaker
            # module name: module path
            'aws_sagemaker_remote': aws_sagemaker_remote
        },
        configuration_command='pip3 install --upgrade sagemaker sagemaker-experiments',
        # Name the job
        base_job_name='demo-mnist-processor'
    )