Capturing CDC Data in Snowflake

Shruti N
Apr 29, 2025By Shruti N

Capturing CDC Data in Snowflake:

In modern data architectures, data changes constantly. Orders are updated, user profiles change, inventory levels fluctuate.

Capturing these changes efficiently is critical to building reliable analytics, reporting, and machine learning models.

This is where Change Data Capture (CDC) comes in.

Traditionally, CDC involved complex setups, third-party tools, or heavy extract-transform-load (ETL) pipelines.

Snowflake simplifies CDC with built-in features that make it easier to track changes without rebuilding entire tables.

What is Change Data Capture (CDC)?

CDC is the process of recording changes β€” inserts, updates, and deletes β€” in a table over time.

Instead of scanning an entire table to find differences:

  • CDC lets us capture only the rows that changed.
  • This minimises compute, reduces costs, and speeds up data pipelines.

In Snowflake, CDC is built around two main objects:

  • Streams: Track changes to a table.
  • Tasks: Automate processing the captured changes.

πŸ‘‰ CDC is like tracking edits in a shared Google Doc instead of re-reading the entire document each time.

How Snowflake Captures CDC Data

1. Streams: Tracking the Changes

A Stream in Snowflake records changes made to a table (inserts, updates, deletes) in real-time.

When we query a stream:

  • It shows us the changes since the last time it was queried.
  • It does not modify the source table.

Think of Streams as "change tracking tables" automatically maintained by Snowflake.

Key Points About Streams:

  • Track inserts, updates, and deletes.
  • Support metadata columns like:
    • METADATA$ACTION (INSERT/DELETE/UPDATE)
    • METADATA$ISUPDATE
    • METADATA$ROW_ID
  • Streams reference the source table’s micro-partitions instead of storing data separately.
     
    2. Tasks: Automating Change Processing

A Task in Snowflake is a scheduled process that automates SQL execution.

When used with Streams:

  • A Task can regularly check if there are new changes.
  • It can move, transform, or load CDC data to another table automatically.

Tasks allow us to:

  • Set schedules (e.g., every 5 minutes).
  • Define dependencies (e.g., only run if the stream has new data).
  • Changes happen in the source table.
  • Stream captures these changes.
  • Task reads from the stream and updates the target table.

Capturing CDC Data in Snowflake

Step 1: Create the Source Table

πŸ‘‰ Creates the main table where order data is stored.

Step 2: Create a Stream on the Table

πŸ‘‰ Enables tracking of inserts, updates, and deletes on the orders table.

Step 3: Query the Stream

πŸ‘‰ Retrieves the change records (CDC data) from the stream including metadata.

 Step 4: Process the Changes into a History Table

πŸ‘‰ Creates a table to store the change history of the orders table.

πŸ‘‰ Captures changes from the stream and logs them into the history table.

Step 5: Automate with a Task

πŸ‘‰ Creates an automated task that loads changes into the history table every hour if there are updates.

Real-World Scenario: E-commerce Platform Orders

Imagine running an e-commerce platform where:

  • Customers frequently place new orders.
  • Some orders are updated (e.g., address change, payment status).
  • Occasionally, orders are cancelled (deleted).

Every time a customer places or modifies an order, you want your Snowflake database to automatically capture and process these changes.

Using CDC:

  • We capture these activities efficiently without reloading the full orders table.
  • Our analytics dashboard updates in near real-time.
  • Compliance teams get an auditable trail of order activities.

CDC Using External Tools

Some popular tools help automate CDC without much manual coding:

  • Fivetran
  • Hevo Data
  • HVR
  • Qlik Replicate

These tools:

  • Connect to your source (like MySQL, Postgres).
  • Detect changes automatically.
  • Push changes into Snowflake tables.

Usually, you just set up connectors, schedule syncs, and you're done!

 
Real-Life Example:

Suppose your application database is PostgreSQL.

Instead of building custom scripts, you use Fivetran:

  • Fivetran captures new orders.
  • Fivetran pushes them into Snowflake every 5 minutes.
  • Your reports stay fresh with minimal work.

Manual CDC Approach (DIY)

If your source doesn’t support built-in CDC, you can still do it manually:

  • Add created_at and updated_at timestamps in your tables.
  • Query only records modified since the last sync.

Step-by-Step

Step 1: Add Timestamp Columns

Step 2: Query for Changed Data

Replace timestamp with the last sync time.

Step 3: Insert or Update into Snowflake

Using simple UPSERT/MERGE logic.

Real-Life Example:

  • You have a legacy MySQL database without CDC.
  • You add updated_at timestamps to all important tables.
  • Now you can pull only changed rows every night and load them into Snowflake.

Best Practices for Capturing CDC Data in Snowflake

  1. Monitor Stream Offsets
    • Streams maintain offsets.
    • Regularly consume the stream to avoid loss of changes.
  2. Handle Deletes Properly
    • Deletions appear in streams.
    • Design how you treat them (e.g., soft-delete, permanent delete).
  3. Optimise Stream Usage
    • Use append-only streams if you only track inserts.
    • Use full streams for inserts, updates, and deletes.
  4. Design for Reprocessing
    • Ensure Tasks are idempotent.
    • Plan for safe retries without duplicates.
  5. Audit and Alert
    • Monitor Task failures.
    • Set up alerts if a Task fails or a Stream is unconsumed.

Common CDC Pitfalls to Avoid

  1. Stream Retention Overruns
    • Problem: Streams expire after retention period.
    • Tip: Query regularly or increase DATA_RETENTION_TIME_IN_DAYS.
  2. Ignoring Deletes in the Stream
    • Problem: Deleted rows may appear active downstream.
    • Tip: Process DELETE actions properly.
  3. Task Failures Without Alerting
    • Problem: Silent Task failures.
    • Tip: Set up monitoring and alerting.
  4. Overcomplicating the CDC Pipeline
    • Problem: Complex transformations during CDC capture.
    • Tip: Keep CDC capture lightweight.
  5. Not Handling Reprocessing Carefully
    • Problem: Duplicate data when reprocessing.
    • Tip: Build idempotent pipelines.

Why CDC in Snowflake Matters

  • Saves Money: Only processes changed data, not full tables.
  • Real-time Data: Your dashboards are always up-to-date.
  • Easy to Manage: Using Streams and Tasks, CDC pipelines are clean and automated.
  • Flexible Options: Whether you want built-in tools or external ones, Snowflake supports both.

Conclusion

Capturing CDC data in Snowflake is simple yet powerful.

Whether you use built-in Streams and Tasks, external tools, or manual methods, you can:

  • Process less data
  • Get faster results
  • Save costs
  • Improve reliability

Building efficient CDC pipelines lets us unlock real-time analytics and smarter business decisions, with minimal hassle