S3 Batch Processing¶
S3 batch processing is best used when you need to run a massively parallel process across files in S3 and you can pack your processing code into a small size for Lambda.
- If the process is not parallel across files, use SageMaker processing, which will allocate a machine and make S3 files available locally for python processing SageMaker processing documentation
- If the process requires more code or a custom image that cannot be used by a Lambda, but the process can be fully parallelized, use SageMaker batch processing to allocate a fleet of containers that will process each object in S3. SageMaker transform documentation
Usage¶
There are two steps to writing a process for S3 Batch:
- Write a Lambda function that performs the parallel processing
- Write a python wrapper that configures deploying the function and any arguments
Lambda¶
Write a Lambda function.
- Create a folder containing your Lambda function
package.json
containing dependenciesindex.js
exporting a function namedhandler
- See required input and output format in AWS S3 Batch documentation
- You may use import statements and packages, the function will be automatically webpacked
// Command-line flags are passed to the Lambda as environment variables
const myCustomFlag = process.env.MY_CUSTOM_FLAG
// Handler receives list of tasks and returns a result for each one
async function handler(event) {
let results = await Promise.all(event.tasks.map(
async ({
taskId,
s3Key,
s3BucketArn
}) => {
let bucket = s3BucketArn.split(":").pop()
let path = `s3://${bucket}/${s3Key}`
// do something with the file
return {
taskId: taskId,
resultCode: "Succeeded",
resultString: "Arbitrary result string for report"
}
}
))
return {
invocationSchemaVersion: "1.0",
treatMissingKeysAs: "PermanentFailure",
invocationId: event.invocationId,
results: results
}
}
// Export the handler
export { handler }
Python¶
Write a Python wrapper referencing the folder containing your Lambda to generate a CLI. Running this wrapper will:
- Create roles and permissions (if necessary)
- Build and deploy the Lambda (if necessary)
- Tag a version of the Lambda with environment arguments specified by the command line
- Create a batch processing job in S3 using that Lambda tag
- Optionally confirm the job, otherwise it can be confirmed in the AWS S3 console
- Optionally save Job ID in a JSON file for later reference
from aws_sagemaker_remote.batch.main import BatchCommand, BatchConfig
import os
import argparse
def argparse_callback(parser: argparse.ArgumentParser):
"""
Add any custom arguments you require
"""
parser.add_argument(
'--my-custom-flag', default="Default value", type=str, help='My custom flag'
)
def env_callback(args):
"""
Map custom arguments to Lambda environment variables
"""
return {
"MY_CUSTOM_FLAG": args.my_custom_flag
}
def command():
"""
Define defaults for your command
"""
return BatchCommand(
config=BatchConfig(
stack_name='my-unique-stack-name',
code_dir=os.path.abspath(os.path.join(
__file__, '../lambda'
)),
description='Demo batch processing',
argparse_callback=argparse_callback,
env_callback=env_callback,
webpack=True
)
)
def main():
command().run_command(
description="Demo batch processing"
)
if __name__ == '__main__':
main()
Command-Line Interface¶
The above code generates the following command line interface which can be used to build, deploy, and run the batch job.
usage: demo-batch [-h] [--profile PROFILE] [--output-json OUTPUT_JSON]
[--stack-name STACK_NAME] [--code-dir CODE_DIR]
[--deploy [DEPLOY]] [--deploy-only [DEPLOY_ONLY]]
[--confirmation-required [CONFIRMATION_REQUIRED]]
[--development [DEVELOPMENT]] --manifest MANIFEST
[--report REPORT] [--description DESCRIPTION]
[--timeout TIMEOUT] [--ignore IGNORE] [--memory MEMORY]
[--my-custom-flag MY_CUSTOM_FLAG]
Named Arguments¶
--profile | AWS profile name Default: “default” |
--output-json | Output job information to JSON file |
--stack-name | AWS CloudFormation stack name to which resources are deployed (default: my-unique-stack-name) Default: “my-unique-stack-name” |
--code-dir | Directory of Lambda code (default: /home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/latest/demo/demo_batch/lambda) Default: “/home/docs/checkouts/readthedocs.org/user_builds/aws-sagemaker-remote/checkouts/latest/demo/demo_batch/lambda” |
--deploy | Force Lambda deployment even if function already exists Default: False |
--deploy-only | Deploy and exit. Use –deploy yes –deploy-only yes to force deployment and exit Default: False |
--confirmation-required | |
Require confirmation in console to run job Default: True | |
--development | Webpack in development mode Default: False |
--manifest | File manifest to process. Must be a CSV with first column containing an S3 bucket and second column containing an S3 key. |
--report | S3 path to store report Default: “aws-sagemaker-remote/batch-reports,sagemaker” |
--description | Description of batch job Default: “Demo batch processing” |
--timeout | Hard timeout of Lambda in seconds Default: 30 |
--ignore | Number of columns of input CSV to ignore. Job will fail if CSV does not have 2+ignore columns. For example, if your CSV has bucket, key, and 5 more columns set ignore to 5. Default: 0 |
--memory | Memory to allocate Default: 128 |
--my-custom-flag | |
My custom flag Default: “Default value” |