S3 Batch Processing

S3 batch processing is best used when you need to run a massively parallel process across files in S3 and you can pack your processing code into a small size for Lambda.

  • If the process is not parallel across files, use SageMaker processing, which will allocate a machine and make S3 files available locally for python processing SageMaker processing documentation
  • If the process requires more code or a custom image that cannot be used by a Lambda, but the process can be fully parallelized, use SageMaker batch processing to allocate a fleet of containers that will process each object in S3. SageMaker transform documentation


There are two steps to writing a process for S3 Batch:

  • Write a Lambda function that performs the parallel processing
  • Write a python wrapper that configures deploying the function and any arguments


Write a Lambda function.

  • Create a folder containing your Lambda function
  • package.json containing dependencies
  • index.js exporting a function named handler
  • See required input and output format in AWS S3 Batch documentation
  • You may use import statements and packages, the function will be automatically webpacked
// Command-line flags are passed to the Lambda as environment variables
const myCustomFlag = process.env.MY_CUSTOM_FLAG

// Handler receives list of tasks and returns a result for each one
async function handler(event) {
    let results = await Promise.all(event.tasks.map(
        async ({
        }) => {
            let bucket = s3BucketArn.split(":").pop()
            let path = `s3://${bucket}/${s3Key}`
            // do something with the file
            return {
                taskId: taskId,
                resultCode: "Succeeded",
                resultString: "Arbitrary result string for report"
    return {
        invocationSchemaVersion: "1.0",
        treatMissingKeysAs: "PermanentFailure",
        invocationId: event.invocationId,
        results: results

// Export the handler
export { handler }


Write a Python wrapper referencing the folder containing your Lambda to generate a CLI. Running this wrapper will:

  • Create roles and permissions (if necessary)
  • Build and deploy the Lambda (if necessary)
  • Tag a version of the Lambda with environment arguments specified by the command line
  • Create a batch processing job in S3 using that Lambda tag
  • Optionally confirm the job, otherwise it can be confirmed in the AWS S3 console
  • Optionally save Job ID in a JSON file for later reference
from aws_sagemaker_remote.batch.main import BatchCommand, BatchConfig
import os
import argparse

def argparse_callback(parser: argparse.ArgumentParser):
    Add any custom arguments you require
        '--my-custom-flag', default="Default value", type=str, help='My custom flag'

def env_callback(args):
    Map custom arguments to Lambda environment variables
    return {
        "MY_CUSTOM_FLAG": args.my_custom_flag

def command():
    Define defaults for your command
    return BatchCommand(
                __file__, '../lambda'
            description='Demo batch processing',

def main():
        description="Demo batch processing"

if __name__ == '__main__':

Command-Line Interface

The above code generates the following command line interface which can be used to build, deploy, and run the batch job.

usage: demo-batch [-h] [--profile PROFILE] [--output-json OUTPUT_JSON]
                  [--stack-name STACK_NAME] [--code-dir CODE_DIR]
                  [--deploy [DEPLOY]] [--deploy-only [DEPLOY_ONLY]]
                  [--confirmation-required [CONFIRMATION_REQUIRED]]
                  [--development [DEVELOPMENT]] --manifest MANIFEST
                  [--report REPORT] [--description DESCRIPTION]
                  [--timeout TIMEOUT] [--ignore IGNORE] [--memory MEMORY]
                  [--my-custom-flag MY_CUSTOM_FLAG]

Named Arguments


AWS profile name

Default: “default”

--output-json Output job information to JSON file

AWS CloudFormation stack name to which resources are deployed (default: my-unique-stack-name)

Default: “my-unique-stack-name”


Directory of Lambda code (default: /home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/latest/demo/demo_batch/lambda)

Default: “/home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/latest/demo/demo_batch/lambda”


Force Lambda deployment even if function already exists

Default: False


Deploy and exit. Use –deploy yes –deploy-only yes to force deployment and exit

Default: False


Require confirmation in console to run job

Default: True


Webpack in development mode

Default: False

--manifest File manifest to process. Must be a CSV with first column containing an S3 bucket and second column containing an S3 key.

S3 path to store report

Default: “aws-sagemaker-remote/batch-reports,sagemaker”


Description of batch job

Default: “Demo batch processing”


Hard timeout of Lambda in seconds

Default: 30


Number of columns of input CSV to ignore. Job will fail if CSV does not have 2+ignore columns. For example, if your CSV has bucket, key, and 5 more columns set ignore to 5.

Default: 0


Memory to allocate

Default: 128


My custom flag

Default: “Default value”