Scalable ELT Pipelines with Snowpipe, Streams, and Tasks

Shruti N
May 30, 2025By Shruti N

Designing Scalable ELT Pipelines with Snowpipe, Streams, and Tasks

Snowflake provides built‑in features to automate ELT (extract, load, transform) pipelines. Snowpipe continuously loads new data files from cloud storage into a Snowflake table. Streams track the “delta” of data – they capture exactly which rows were inserted, updated, or deleted in a table since the last time checked.

Tasks run SQL code on a schedule or when triggered by an event. By combining these, we can set up a pipeline that automatically ingests data, detects new rows, and processes them without manual intervention. For example, Snowpipe ingests new files as they arrive, a stream on the raw table records those new rows, and a task can be set to run SQL to transform or move those rows into a final table.

Snowpipe: Continuous Ingestion

Snowpipe is Snowflake’s automated, serverless loading service. When we create a pipe object, we give it a COPY statement that points to a cloud stage (e.g. S3, GCS, or Azure) and a target table. Snowpipe then auto-ingests new files.

we can trigger it via cloud event notifications or API calls. In practice, Snowpipe loads data from arriving files in small batches, making it available within minutes. For example:

Here my_pipe will watch the @my_stage location and automatically load each new file into raw_table.

Snowpipe handles the compute behind the scenes (it is serverless), so we don’t need to manage any loading servers. This lets the pipeline scale: even if many files arrive, Snowpipe will queue and load them in parallel using Snowflake’s managed resources.

Streams: Capturing Changes

A stream in Snowflake tracks row-level changes in a table. Technically, a stream object “records the delta of change data capture (CDC) information” for its source table. In simpler terms, once we have a table that Snowpipe is loading into (say raw_table), we create a stream on it:

This raw_stream will see every row that’s inserted into raw_table. When you query the stream, it returns exactly the new (or changed) rows since the last read. This means our pipeline only processes new data, not the entire table.

Streams make incremental ELT easy: downstream tasks can read from the stream and automatically pick up only the new inserts or updates. (Under the hood, Snowflake remembers how far each stream has been consumed so it can present just the next set of changes.)

Tasks: Automated Transformations

A task is a Snowflake object that runs a SQL statement on a schedule or when triggered. We can create a task to perform transformations, upserts, or other business logic. For example, a task might merge data from the raw stream into a dimension table. Tasks can be defined with a CRON schedule, or with a WHEN clause to watch a stream. For instance:

This example task checks raw_stream every minute and only runs the MERGE if raw_stream has new rows. We can also chain tasks (task graphs) so that one task runs after another. Snowflake efficiently skips tasks when there’s no work: “If no data is in the stream, Snowflake skips the run without using compute resources.” In other words, idle tasks don’t waste warehouse time.

By using tasks, we can automate the transform step of ELT right inside Snowflake, and we can scale by assigning larger or multi-cluster warehouses as needed.

How They Work Together

The components combine into an automated pipeline.

  • First, put new data files in a cloud stage (for example, an S3 bucket).
  • Snowpipe’s pipe watches that stage and loads each file into a raw table as soon as it appears.
  • The stream on that raw table captures every new row that Snowpipe inserts.
  • A task (or tasks) is set up to run when the stream has data or on a schedule. The task’s SQL reads from the stream and applies whatever transformations or merges are needed into final tables.

The diagram below shows a high-level view of this flow:

Above, new data in cloud storage goes through Snowpipe into Raw Table. The stream watches Raw Table. When new rows appear, the task is triggered to process them into the final table.

Pipeline Setup Steps

  • Set up the stage and pipe. Configure an external stage pointing to your storage and create a Snowpipe pipe. For example:
    • This pipe contains the COPY command from @my_stage to raw_table. New files arriving in the stage will auto-ingest into raw_table via Snowpipe.
  • Create the raw table and stream. Make a table to hold the raw data, then create a stream on it:
    • The stream will capture all inserted rows in raw_table. (If Snowpipe loads multiple files, the stream accumulates those inserts.)
  • Define tasks for transformations. For each step we need, we create a task. We can use a time schedule or trigger it when the stream has data. For example:
    • This task runs as soon as new data is in raw_stream, and inserts/merges it into processed_table. If the stream is empty, Snowflake will simply skip the task.Finally, ALTER TASK transform_task RESUME; to activate it.
  • (Optional) Chain tasks and additional streams. We can create multiple streams on the same raw table if we want parallel pipelines or different data paths. Similarly, tasks can be chained using AFTER clauses or by having one task’s output feed another’s input. This allows building complex, multi-stage pipelines entirely in Snowflake.

Best Practices and Tips

  • Use auto-ingest for low latency. Enable AUTO_INGEST=TRUE (and cloud event notifications) so Snowpipe loads files immediately. This minimizes delay from arrival to availability.
  • Keep tasks focused. Make tasks do one clear operation (e.g. loading a table or calling a proc). This keeps warehouses from getting too hot and makes it easy to scale out with multiple tasks.
  • Idempotent loads. Write your task SQL to handle duplicates or re-runs safely (for example, using MERGE). This way if something retries, it won’t corrupt data.
  • Size files appropriately. For Snowpipe, aim for moderate file sizes (tens to hundreds of MB) to balance ingestion speed and overhead. Very tiny files can overwhelm the system with many loads.
  • Use multiple streams if needed. If one table feeds different tasks (e.g. splitting data to two targets), create separate streams so each task sees all rows it needs.
  • Monitor and scale compute. Assign a warehouse size that matches the workload. Snowflake can use multi-cluster warehouses to handle spikes. Monitor the Task History and Warehouses usage to adjust.
  • Handle failures. Use Snowflake’s error handling (like VALIDATION_MODE=RETURN_ERRORS) or task retries. You can also include error logging tables or make tasks call stored procedures with try/catch.
  • Suspend tasks if idle. If a pipeline is paused (e.g. no data for a while), you can SUSPEND TASK to save compute. Resume when you expect data.
  • Use Streams for CDC. If your sources can send deletes or updates, streams will capture those too. Ensure your tasks handle those cases if needed (e.g. delete rows in target).
  • Document your pipeline. Label pipes, streams, and tasks clearly, and comment your SQL. This makes it easier for the team to understand the flow.

By following these guidelines, we can build an automated, elastic ELT pipeline in Snowflake. Snowpipe handles ingestion without extra servers, streams lets us process just the new data, and tasks lets us run SQL workflows on demand. Together, they enable data engineers to focus on business logic instead of plumbing, while Snowflake scales the actual work under the hood.