Getting started

Caveat: this is a rough draft and we are still working on the documentation.

Now that we have installed tc and understood the features in abstract, let's try to walk through a basic tutorial of creating an ETL (Enhance-Transform-Load) flow using serverless entities.

In this tutorial, we will attempt to learn about the core concepts in tc.

1. Bootstrap permissions

Let's create some base IAM roles and policies for your sandbox. tc maps environments to AWS profiles. There can be several sandboxes per environment/account. For the sake of this example, let's say we have a profile called dev. This dev profile/account can have several dev sandboxes. Let's name our sandbox john.

tc create -s john -e dev -c base-roles

2. Our first function

A simple function looks like this. Let's call this function enhancer. Add a file named handler.py in a directory etl/enhancer.

etl/enhancer/handler.py:

def handler(event, context):
  return {'enhancer': 'abc'}

In the etl directory, we can now create the function by running the following command.

tc create -s <sandbox-name> -e <env>

Example: tc create -s john -e dev

This creates a lambda function named enhancer_john with the base role (tc-base-lambda-role) as the execution role.

Info

AWS Lambda is the default implementation for the function entity. env here is typically the AWS profile.

3. Namespace your functions

Our etl directory now contains just one function called enhancer. Let's create the transformer and loader functions. Add the following files.

etl/transformer/handler.py

def handler(event, context):
  return {'transformer': 'ABC'}

loader/handler.py

def handler(event, context):
  return {'transformer': 'ABC'}

We should have the following directory contents:

etl
|-- enhancer
|   `-- handler.py
|-- loader
|   `-- handler.py
|-- topology.yml
`-- transformer
    `-- handler.py

Now that we have these 3 functions, we may want to collectively call them as etl. Let's create a file named topology.yml with the following contents:

name: etl

name is the namespace of these collection of functions. Now in the etl directory, we can run the following command to create a sandbox

tc create -s john -e dev

You should see the following output

Compiling topology
Resolving topology etl
1 nodes, 3 functions, 0 mutations, 0 events, 0 routes, 0 queues
Building transformer (python3.10/code)
Building enhancer (python3.10/code)
Building loader (python3.10/code)
Creating functor etl@john.dev/0.0.1
Creating function enhancer (211 B)
Creating function transformer (211 B)
Creating function loader (211 B)
Checking state enhancer (ok)
Time elapsed: 5.585 seconds

The resulting lambda functions are named 'namespace_function-name_sandbox'. If the name is sufficiently long, tc abbreviates it

We can test these functions, independently

cd enhancer
tc invoke -s john -e dev -p '{"somedata": 123}'

Info

The word service is overloaded. tc encourages the use of functor or topology to define the collection of entities.

4. Define the function DAG (flow)

Now that we have these functions working in isolation, we may want to create a DAG of these functions. Let's define that flow:

name: etl

functions:
  enhancer:
    root: true
    function: transformer
  transformer:
    function: loader
  loader:
	end: true

tc dynamically figures out the orchestrator to use. By default, it uses Stepfunction (Express) to orchestrate the flow. tc automatically generates an intimidating stepfunction definition. You can inspect that by running tc compile -c flow

tc update -s john -e dev to update and create the flow.

5. Add a REST API to invoke ETL

name: etl

routes:
  /api/etl:
    method: POST
    function: enhancer

functions:
  enhancer:
    root: true
    function: transformer
  transformer:
    function: loader
  loader:
    event: Notify

Run tc update -s john -e dev -c routes to update the routes.

6. Notify on completion

name: etl

routes:
  /api/etl:
    method: POST
    function: enhancer

functions:
  enhancer:
    root: true
    function: transformer
  transformer:
    function: loader
  loader:
    event: Notify

events:
  Notify:
    channel: Subscription

channels:
  Subscription:
    function: default

Let's make loader output an event that pushes the status message to a websocket channel. tc update -s john -e dev to create/update the events and channels.

curl https://seuz7un8rc.execute-api.us-west-2.amazonaws.com/test/start-etl -X POST -d '{"hello": "world"}'
=> {"enhancer": "abc"}

7. Implement the functions

So far, we created a topology with basic functions, events, routes and a flow to connect them all. The functions themselves don't do much. Functions have depedencies, different runtimes or languages, platform-specific shared libraries and so forth. For example, we have want the enhancer to have some dependencies specified in say pyproject.toml or requirements.txt. Let's add a file named function.json in enhancer directory

enhancer/function.json


{
  "name": "enhancer",
  "description": "Ultimate enhancer",
  "runtime": {
    "lang": "python3.12",
    "package_type": "zip",
    "handler": "handler.handler",
  },
  "build": {
    "kind": "Inline",
    "command": "zip -9 -q lambda.zip *.py"
  },
}

and let's say we had the following deps in pyproject.toml

enhancer/pyproject.toml

[tool.poetry]
name = "enhancer"
version = "0.1.0"
description = ""
authors = ["fu <foo@fubar.com>"]

[tool.poetry.dependencies]
simplejson = "^3.19.2"
botocore = "^1.31.73"
boto3 = "^1.28.73"
pyyaml = "6.0.2"

Now update the function we created by running this from the etl directory

tc update -s john -e dev -c enhancer

The above command will build the dependencies in a docker container locally and update the function code with the depedencies.

Info

-c argument takes an entity category (events, functions, mutations, routes etc) or the name of the entity itself. In this case the function name.

There are several ways to package the depedencies depending on the runtime, size of the dependencies and so forth. Layering is another kind. Let's try and build the transformer using layers. Add the following in transformer/function.json

transformer/function.json


{
  "name": "transformer",
  "description": "Ultimate Transformer",
  "runtime": {
    "lang": "python3.12",
    "package_type": "zip",
    "handler": "handler.handler",
	"layers": ["transformer-deps"]
  }
}

The layers can be built independent of creating/deploying the code, as they don't change that often.


tc build --kind layer --name transformer-deps --publish -e dev
tc update -s john -e dev -c layers

With the above command, we built the dependencies in a docker container and updated the function(s) to use the latest version of the layer. See Build for details about building functions.

8. Making it recursive

We can make loader itself another sub-topology with it's own DAG of entities and still treat etl as the root topology (or functor). Let's add a topology file in loader.

etl/loader/topology.yml

name: loader

Now we can recursrively create the topologies from the root topology directory

tc create -s john -e dev --recursive

9. Configuring infrastructure

At times, we require more infrastructure-specific configuration, specific permissions, environment variables, runtime configuration.

We can specify an infra path in the topology

name: etl
infra: "../infra/etl"
routes: ..

In the specified infra directory, we can add environment/runtime variables for let's say enhancer.

../infra/etl/vars/enhancer.json

{
  "memory_size": 1024,
  "timeout": 800,
  "environment": {
    "GOOGLE_API_KEY": "ssm:/goo/api-key",
    "KEY": "VALUE"
  },
  "tags": {
    "developer": "john"
  }
}

If we need specific IAM permissions, do

../infra/etl/roles/enhancer.json

{
    "Statement": [
        {
            "Action": [
                "s3:PutObject",
                "s3:ListBucketVersions",
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::bucket/*",
                "arn:aws:s3:::bucket"
                     ],
            "Sid": "AllowAccessToS3Bucket1"
        }

    ],
    "Version": "2012-10-17"
}

We may also need additional configuration that are specific to the provider (AWS etc). Add a key called config with the value as path to the file.

name: etl
infra: "../infra/etl"
config: "../tc.yaml"
routes: ..

See Config