9 reasons you should consider using Step Functions for microservices orchestration.

Siddharth Malani
5 min readOct 5, 2020

1. Ease of Integration

Using step functions makes it really easy to wire up lambda or other microservices. The biggest advantage comes from the fact that you are relieved of most of the plumbing work. With a simple declarative JSON you just wire up all of the microservices instead of having to manually configure SQS, SNS and IAM role amends for each of the lambdas to work with these services.

Here is what you would normally need to wire the microservices up. You would create and manage each of the resources below by hand. This requires good amount of time investment.

  • Lambdas
  • SNS
  • SQS
  • Kinesis
  • Kafka
  • IAM permissions

With Step functions all you need is one clean JSON. You do have to create your lambdas but the declarative JSON is really simple and nice to wire them all up together. The messaging infrastructure is taken care of for you so no need to worry about SQS, SNS etc...

{
"Comment": "A Hello World example",
"StartAt": "Pass",
"States": {
"Pass": {
"Comment": "Comments...",
"Type": "Pass",
"Next": "Hello World example?"
},
"Hello World example?": {
"Comment": "Comments...",
"Type": "Choice",
"Choices": [
{
"Variable": "$.IsHelloWorldExample",
"BooleanEquals": true,
"Next": "Yes"
},
{
"Variable": "$.IsHelloWorldExample",
"BooleanEquals": false,
"Next": "No"
}
],
"Default": "Yes"
},
"Yes": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "arn:aws:lambda:ap-southeast-2:123456789:function:MyLambdaFunction",
"Payload": {
"Input.$": "$"
}
},
"Next": "Wait 3 sec"
},
"No": {
"Type": "Fail",
"Cause": "Not Hello World"
},
"Wait 3 sec": {
"Comment": "A Wait state delays the state machine.",
"Type": "Wait",
"Seconds": 3,
"Next": "Parallel State"
},
"Parallel State": {
"Comment": "A Parallel state can be used for parallel flows.",
"Type": "Parallel",
"Next": "Hello World",
"Branches": [
{
"StartAt": "Hello",
"States": {
"Hello": {
"Type": "Pass",
"End": true
}
}
},
{
"StartAt": "World",
"States": {
"World": {
"Type": "Pass",
"End": true
}
}
}
]
},
"Hello World": {
"Type": "Pass",
"End": true
}
}
}

The above JSON gives you a nice looking flow chart below. Very easy to understand.

The example JSON above generates a flow like this

2. Less Code to Develop

Since you do not have to do the plumbing work yourself, it significantly reduces the amount of IaC such as (Terraform or Cloud Formation) that you may have to write to put your workflow together. Hence there is a big positive impact on your project timelines.

As stated above just one JSON wires up all your lambdas together.

3. Ease of coding

Passing the events object across microservices is greatly simplified.

For example your lambdas can simply pass on the event object with more data added, removed or manipulated.

# lambda 1 codedef my_handler(event, context):     
event['message'] = "pass this to next"
return event

Easily pass objects without worrying about handling serializing as long as all objects that you are passing are serializable.

# lambda 2 codedef my_handler(event, context):     
print(event['message'])
return event

If your messages are more than 256kb then there is always an option to use the messages for metadata (such as S3 location info) while keeping objects in S3 in case they are likely to exceed this size limit.

4. Ease of debugging

Debugging is greatly simplified. You can view the inputs and outputs for each flow. Also each flow can be custom named. For example if you were processing file 12345 then you could create a naming convention such as 12345-<guid> where the guid allows making the flow unique in case you had to re-process 12345 while also giving you the ability to do a wildcard search for that flow in the console.

The UI also gives you a full flow indicator with green/red signalling to identify problems. As you can see below I have the execution name prefixed with a serial. This can be triggered via some automation such as a S3 event or a lambda with naming convention of your choice.

client = boto3.client('stepfunctions')response = client.start_execution(    stateMachineArn='aws:states:.......',    name='12345-a057ac36-ad21-644a-eb89-a3bac32c77c9',    input= "{\"first_name\" : \"test\"}")

5. Error Handling

Error handling is also really easy with Step functions. More details can be found in the Step Functions documentation but the main thing I like is to view at high level the exception stack trace printed next to the step that fails. It really speeds up fixing issues.

6. Traceability of workflows

As stated above, each workflow is easy to trace for audit purposes. Instead of following logs across several microservices you get it all in one place which is very easy to track.

7. Speed

The Standard Step Functions can process upto 2000 executions and 4000 transitions per second. There is an express edition which can do upto 100,000 per second execution rate and nearly unlimited state transitions. So for most microservices architectures this is more than adequate.

8. Flexibility

Step Functions offers a very cool functionality to harness polymorphic behaviour of lambdas. There is another blog I have written which explains it in detail as it is too involved to explain here. Please follow this link to know more.

https://medium.com/@siddharthmalani/invoking-versions-of-lambda-dynamically-with-stepfunctions-fc1fdfcfb33f

9. Encryption

Messages flowing through Step Function states are encrypted by default. So no additional work is required.

https://docs.aws.amazon.com/step-functions/latest/dg/security-encryption.html

Limits

Standard workflows

  • 2,000 per second execution rate
  • 4,000 per second state transition rate

Express workflows

  • 100,000 per second execution rate
  • Nearly unlimited state transition rate

Message size limits

256 KBytes increased from 32KBytes in Sep 2020.

Step Function Instances History

There is a limit of 90 days for Step Functions executions history.

Step Function Transtitions per Instance

For very long running instances such as with conditional loops etc you need to consider if you will hit the limit of 25000 transitions within a Step Function execution. It is very unlikely you will hit this limit but there is a workaround if you do.

--

--