Databricks Correlation ID: User Guide for Deliver Data

This guide explains how to configure Databricks tables that use IDENTITY primary keys (auto-generated values) so that INSERT operations can be enabled in Streamline using a Correlation ID column. It does not apply to tables without IDENTITY primary keys.

Key Points

⚠️ INSERT operations are disabled by default for all tables

⚠️ Adding a Correlation ID column only enables INSERT operations for tables with IDENTITY primary keys

⚠️ Adding a Correlation ID column does not enable INSERTs or provide any benefit for tables without IDENTITY primary keys

✅ UPDATE operations work without any special configuration (no Correlation ID needed)

🔧 This guide shows how to enable INSERTs by adding a Correlation ID column

⏱️ Setup takes 5-10 minutes per table

📊 Minimal impact: Only adds one STRING column to your table

💡 Quick Check: If you only need to UPDATE existing records (not create new ones), you can stop reading now—you don't need a Correlation ID!

Understanding the Problem

Databricks's IDENTITY columns have a unique limitation that affects data integration platforms like Streamline. When inserting new records into tables with IDENTITY primary keys, Databricks does not provide a reliable way to retrieve the generated ID value.

Why This Matters

When Streamline inserts data into your Databricks tables, it needs to know the primary key values of inserted records to:

Build relationships between parent and child tables
Track changes in the dataset
Maintain data integrity across related tables

Example Scenario

- You have two related tables
CREATE TABLE accounts (
    account_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    account_name STRING,
    streamline_corr_id STRING
) TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported');

CREATE TABLE contacts (
    contact_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    account_id BIGINT,  -- Foreign key to accounts
    contact_name STRING,
    streamline_corr_id STRING,
    FOREIGN KEY (account_id) REFERENCES accounts(account_id)
) TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported');

-- Streamline needs to:
-- 1. Insert into accounts → Get the generated account_id (e.g., 12345)
-- 2. Insert into contacts using that account_id → Link the contact to account 12345

The Technical Limitation

Databricks IDENTITY columns have an important limitation for data integration:

There is no standard way to retrieve the auto-generated IDENTITY value immediately after an INSERT through JDBC/SQL.
Functions and patterns commonly used in other databases (such as INSERT … RETURNING, “last inserted ID” helpers, or getGeneratedKeys() over JDBC) are not supported or do not behave reliably for Databricks IDENTITY.
Databricks does not allow uuid() to be used in DEFAULT expressions, so you cannot define a DEFAULT UUID primary key column.
Because of this, Streamline cannot reliably obtain the generated primary key value after inserting into an IDENTITY column. The Correlation ID pattern is the workaround: Streamline inserts a known value into a separate column and then queries back using that value to retrieve the new primary key.

Reference: Databricks IDENTITY Columns Documentation

What is a Correlation ID?

A Correlation ID is a special column you add to your Databricks table that Streamline uses to retrieve the auto-generated primary key after an INSERT operation.

ℹ️ Note: Correlation ID is ONLY needed for INSERT operations. If you only plan to UPDATE existing records, you do not need a Correlation ID column.

How It Works

Streamline generates a unique UUID (e.g., a7f3c8e1-4b2d-4c9a-8f1e-9d6b5a2c3f4e)
Inserts the record with this UUID in your Correlation ID column
Queries the table using the UUID to find the newly inserted record
Retrieves the auto-generated primary key from the query result
Uses that primary key for relationships and tracking

Analogy

Think of the Correlation ID as a "tracking number" that helps Streamline find the exact record it just inserted, similar to how you track a package with a tracking number.

Important Implementation Details for Databricks

IDENTITY Column Types

Databricks supports two types of IDENTITY columns:

1. GENERATED ALWAYS AS IDENTITY

- Database always generates the value

- Cannot provide explicit values

- Use case: Strict auto-increment only

2. GENERATED BY DEFAULT AS IDENTITY

- Database generates value if not provided

- Can provide explicit values when needed

- Use case: Flexible - supports both auto-increment and explicit values

ℹ️ Streamline does not require one option over the other.
Use GENERATED ALWAYS when you want the database to strictly control IDs, or GENERATED BY DEFAULT when you need flexibility (e.g., backfills, migrations, or tests that provide explicit IDs).

When Do You Need This?

You NEED a Correlation ID when:

Your table meets BOTH of these conditions:

Primary key uses IDENTITY (auto-generated values)


CREATE TABLE my_table (
    id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    name STRING,
    streamline_corr_id STRING
) TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported');

You want to perform INSERT operations (create new records) from Streamline

How to Know

In Streamline's Deliver data interface, if INSERT operations are disabled for your table, it means the table has an IDENTITY primary key without a Correlation ID column. Follow the setup steps in this guide to enable INSERT operations.

Common Use Cases

Creating new records in tables with IDENTITY primary keys
Parent-child table relationships (e.g., Accounts → Contacts)
Multi-table datasets requiring inserts
Tables where you don't control the primary key generation

Step-by-Step Setup Guide

Prerequisites

Access to your Databricks workspace with ALTER TABLE privileges
Knowledge of which tables will receive data from Streamline
Basic understanding of SQL

Step 1: Add Correlation ID Column to Your Databricks Table

For each table with IDENTITY primary keys that will receive data from Streamline, add a Correlation ID column.

1.1 Choose a Column Name

Pick a descriptive name for your Correlation ID column. We recommend:

streamline_corr_id (recommended)
Other names (such as integration_corr_id or sync_corr_id) are supported, but this guide and all examples assume STREAMLINE_CORRELATION_ID.

1.2 Add the Column

-- Connect to your Databricks workspace
USE CATALOG your_catalog_name;
USE SCHEMA your_schema_name;

-- Add the Correlation ID column
-- Note: STRING is used because Databricks stores UUIDs as strings
-- UUIDs are 36 characters long (e.g., 'a7f3c8e1-4b2d-4c9a-8f1e-9d6b5a2c3f4e')
ALTER TABLE your_table_name 
ADD COLUMN streamline_corr_id STRING;

Example:

-- For an accounts table
USE CATALOG sales_catalog;
USE SCHEMA public;

ALTER TABLE accounts 
ADD COLUMN streamline_corr_id STRING;

-- For a contacts table (if it also has IDENTITY)
ALTER TABLE contacts 
ADD COLUMN streamline_corr_id STRING;

1.3 Create an Index (⚠️ CRITICAL for Performance)

Since Streamline will query this column after every insert, adding an index significantly improves performance

-- Create a Bloom Filter index on the Correlation ID column
CREATE BLOOMFILTER INDEX your_table_name_corr_id_idx
ON TABLE your_table_name
FOR COLUMNS(STREAMLINE_CORRELATION_ID);

Example:

CREATE BLOOMFILTER INDEX accounts_corr_id_idx
ON TABLE accounts
FOR COLUMNS(STREAMLINE_CORRELATION_ID);

CREATE BLOOMFILTER INDEX contacts_corr_id_idx
ON TABLE contacts
FOR COLUMNS(STREAMLINE_CORRELATION_ID);

⚠️ Performance Impact: Without an index, queries can be 10-100x slower on large tables!

Note: Bloom Filter indexes are optimized for equality lookups (e.g., WHERE STREAMLINE_CORRELATION_ID = 'some-uuid').
For additional performance optimization on large tables, you can also use Z-Ordering:

OPTIMIZE your_table_name ZORDER BY (STREAMLINE_CORRELATION_ID);

Use both together for the best performance characteristics: Bloom Filter for fast equality predicate pruning, and Z-Ordering to cluster data by the correlation ID.

1.4 Verify the Column Was Added

-- Check the table structure
DESCRIBE TABLE your_table_name;

-- You should see your new column listed

Step 2: Configure Databricks Integration in Streamline

Now that your Databricks tables are prepared, configure the integration in Streamline.

2.1 Navigate to Integrations

Log in to Streamline
Go to Integrations > Data Sources
Click Add New Integration
Select Databricks

2.2 Enter Connection Details

Provide your Databricks connection information:

Field	Description	Example
Host	our Databricks workspace URL	myworkspace.cloud.databricks.com
Client ID	Auth 2.0 Client ID for authentication	your-client-id
Client Secret	Auth 2.0 Client Secret for authentication	your-client-secret
Warehouse	QL Warehouse HTTP Path ID	abc123def456
Correlation ID Field Name	he column name you added in Step 1	STREAMLINE_CORRELATION_ID

⚠️ Important: The Correlation ID Field Name must match exactly the column name you created in Step 1 (case-sensitive in Databricks).

📝 Note on Databricks Connection Details

Host: Your Databricks workspace URL typically follows one of these patterns:

{workspace-name}.cloud.databricks.com (AWS)
{workspace-name}.azuredatabricks.net (Azure)
{workspace-name}.gcp.databricks.com (GCP)

Client ID & Client Secret: These are OAuth 2.0 credentials for service principal authentication. You can create these in your Databricks workspace under Settings → Access Tokens → Service Principals.

Warehouse: This is the HTTP Path ID of your SQL Warehouse, not the warehouse name. You can find this in your Databricks workspace under SQL Warehouses → Select your warehouse → Connection details →

HTTP Path (it looks like /sql/1.0/warehouses/abc123def456). Use only the ID portion (e.g., abc123def456).

2.3 Save and Test Connection

Click Test Connection to verify connectivity
If successful, click Save Integration

Step 3: Update Existing Integration (If Applicable)

⚠️ For Customers with Existing Databricks Integrations

If you already have a Databricks data source/integration configured in Streamline and want to add Correlation ID support:

3.1 Add the Correlation ID Column

First, complete Step 1 to add the Correlation ID column to your Databricks table(s).

3.2 Update Your Integration in Streamline

You need to update your existing Databricks integration to include the Correlation ID field name.

🚧 IMPORTANT WARNING: Editing existing data sources is not currently available in Streamline. Until this feature is implemented, you will need to recreate your data source with the Correlation ID configuration.

To recreate your data source:

Document your current settings:
- Note all connection details (Host, Client ID, Client Secret, Warehouse, etc.)
- Note which tables/entities are being used
- Note any workflows or delivery steps using this integration
Remove the old integration:
- Go to Integrations > Data Sources
- Find your Databricks integration
- Delete/remove it
Create a new integration:
- Follow Step 2 above
- Use the same connection details as before
- Add the Correlation ID Field Name this time
Reconfigure workflows:
- Update any workflows that were using the old integration
- Point them to the new integration
Test thoroughly:
- Test that data is flowing correctly
- Verify INSERT operations are now enabled

💡 Future Update: The ability to edit existing data sources will be added in a future Streamline release, eliminating the need to recreate integrations.

Step 4: Use Deliver Data in Streamline

Once configured, deliver data works automatically!

4.1 Select Your Dataset

Navigate to your Workflow
Select the Entity (Dataset) that corresponds to your Databricks table
Add a Deliver data step

4.2 Automatic Handling

When Streamline performs an insert operation:

✅ Automatically generates a unique UUID for the Correlation ID

✅ Inserts the record with the UUID in your Correlation ID column

✅ Retrieves the auto-generated primary key using the UUID

✅ Handles relationships between parent and child tables seamlessly

You don't need to do anything else! The system handles everything behind the scenes.

Scenarios That Don't Require Correlation ID

✅ You DON'T need a Correlation ID when:

Your table uses one of these alternative primary key strategies:

Scenario 1: Explicit Primary Key Values

If your application provides explicit primary key values (not auto-generated):

CREATE TABLE accounts (
    account_id BIGINT PRIMARY KEY,  -- No IDENTITY - values provided by application
    account_name STRING
);

Why it works: Streamline knows the PK value before inserting because you provide it.

Scenario 2: Update-Only Operations

If you're only updating existing records (not inserting new ones), you don't need a Correlation ID regardless of your primary key type.

Example:

-- Update operations work fine without Correlation ID
UPDATE accounts
SET account_name = 'New Name'
WHERE account_id = 12345;

In Streamline: If your workflow only updates existing records and never creates new ones, you can skip the Correlation ID setup entirely.

Scenario 3: Read-Only Operations

If you're only reading data from Databricks (no inserts/updates), you don't need a Correlation ID regardless of your primary key type.

Best Practices

1. Consistent Naming

Use the same Correlation ID column name across all your Databricks tables for consistency:

-- ✅ Good: Same name everywhere
ALTER TABLE accounts ADD COLUMN streamline_corr_id STRING;
ALTER TABLE contacts ADD COLUMN streamline_corr_id STRING;
ALTER TABLE orders ADD COLUMN streamline_corr_id STRING;

2. Always Add an Index

Performance degrades significantly without an index, especially for large tables:

-- ⚠️ CRITICAL for performance
CREATE BLOOMFILTER INDEX
ON TABLE accounts
FOR COLUMNS(STREAMLINE_CORRELATION_ID);

Performance Impact (for Streamline correlation ID lookups only):

With index on `STREAMLINE_CORRELATION_ID`: ~10–50 ms per lookup
Without index: ~1–10 seconds per lookup (100x slower for these lookups)

This indexing recommendation affects only the internal queries Streamline runs against the STREAMLINE_CORRELATION_ID column (for example, WHERE STREAMLINE_CORRELATION_ID = 'some-uuid').

It does not change the performance characteristics of your other application queries that do not filter on this column.

Bloom Filter indexes are specifically optimized for equality lookups on columns like STREAMLINE_CORRELATION_ID.

For even better performance on very large tables, combine this with Z-Ordering:

OPTIMIZE accounts ZORDER BY (STREAMLINE_CORRELATION_ID);

3. Don't Populate Existing Records

The Correlation ID should remain NULL for existing records. Only Streamline populates this field during insert operations.


-- ❌ DON'T DO THIS
UPDATE accounts SET streamline_corr_id = uuid();

-- ✅ DO THIS (leave existing records as NULL)
-- Only Streamline populates this column during INSERT

4. Document Your Tables

Add a comment to document the purpose of the Correlation ID column:


COMMENT ON COLUMN accounts.streamline_corr_id IS
'Used by Streamline for data integration. Managed automatically. Do not modify.';

5. Consider Future Tables

⚠️ Important for Databricks:

When creating new tables, we recommend using IDENTITY columns with the correlation ID pattern:

-- ✅ RECOMMENDED: IDENTITY with correlation ID
CREATE TABLE accounts (
    account_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    account_name STRING,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
    STREAMLINE_CORRELATION_ID STRING
) TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported');
-- Create Bloom Filter index on correlation ID
CREATE BLOOMFILTER INDEX
ON TABLE accounts
FOR COLUMNS(STREAMLINE_CORRELATION_ID);

Reference: CREATE TABLE

Frequently Asked Questions

Q: I already have a Databricks integration in Streamline. What do I do?

A: You'll need to add the Correlation ID column to your Databricks table, then recreate your integration in Streamline with the Correlation ID field configured. See Step 3: Update Existing Integration for detailed instructions.

⚠️ Note: Editing existing integrations is not currently available. You must recreate the integration until this feature is implemented.

Q: Does the Correlation ID affect my existing data?

A: No. Existing records will have NULL in the Correlation ID column. Only new records inserted by Streamline will have values populated.

Q: Can I use the Correlation ID for my own purposes?

A: No. The Correlation ID column is exclusively managed by Streamline. Do not read from, write to, or depend on this column in your own applications.

Q: What if I have multiple Streamline integrations to the same table?

A: Use the same Correlation ID column for all integrations. Each insert operation generates a unique UUID, so there's no conflict.

Q: Can I delete or modify Correlation ID values?

A: Once inserted, you should not modify Correlation ID values. However, you can safely delete old Correlation ID values if needed (set to NULL) for records that were successfully inserted and no longer need tracking.

Q: Why can't I insert records into my Databricks table in Streamline?

A: If INSERT operations are disabled in Streamline's Deliver data interface, your table has an IDENTITY primary key without a Correlation ID column configured. By default, tables with IDENTITY primary keys are not "insertable" in Streamline until you complete the Correlation ID setup. Follow the steps in this guide to enable INSERT operations.

Q: I only need to UPDATE records, do I still need a Correlation ID?

A: No! Correlation ID is only required for INSERT operations. UPDATE operations work perfectly fine without a Correlation ID, regardless of whether your table has an IDENTITY primary key.

Troubleshooting

INSERT operations are still disabled after adding Correlation ID

Possible causes:

Correlation ID field name mismatch
- Verify the column name in Databricks matches exactly (case-sensitive)
- Check your Streamline integration configuration
Data catalog not refreshed
- In Streamline, rescan your Databricks integration (data source)
- Wait 1–2 minutes for the rescan to complete

INSERT operations are slow

Possible causes:

Missing Bloom Filter index on correlation ID column

Create index:

CREATE BLOOMFILTER INDEX
ON TABLE your_table
FOR COLUMNS(STREAMLINE_CORRELATION_ID);

This is the most common cause of slow lookups after INSERT.

2. Large table without statistics

Run:

ANALYZE TABLE your_table COMPUTE STATISTICS;

3. Warehouse undersized

Consider scaling up your SQL Warehouse
Check warehouse query history for performance metrics

Error: "Column 'streamline_corr_id' not found"

Solution:

Verify the column exists: DESCRIBE TABLE your_table;
If missing, add it:

ALTER TABLE your_table ADD COLUMN streamline_corr_id STRING;

Create Bloom Filter index:

CREATE BLOOMFILTER INDEX your_table_corr_id_idx
ON TABLE your_table
FOR COLUMNS(STREAMLINE_CORRELATION_ID);

Error: "AUTOINCREMENT is not a valid keyword"

Solution:

Databricks does NOT support AUTOINCREMENT
Use GENERATED BY DEFAULT AS IDENTITY instead:

-- ❌ WRONG
CREATE TABLE accounts (
    account_id BIGINT PRIMARY KEY AUTOINCREMENT,
    account_name STRING
);

-- ✅ CORRECT
CREATE TABLE accounts (
    account_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    account_name STRING,
    streamline_corr_id STRING
) TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported');

Support

If you encounter issues not covered in this guide:

Check Insert Availability: Verify if INSERT operations are enabled for your table in Streamline's Deliver data interface
Check Streamline Logs: Look for error messages related to your Databricks integration
Verify Configuration: Ensure the Correlation ID column name matches exactly in both Databricks and Streamline
Refresh Data Catalog: After adding the Correlation ID column, refresh your data catalog in Streamline
Test Simple Insert: Try inserting a single record to isolate the problem
Contact Support: Reach out to Streamline support with:
- Your Databricks table structure (DESCRIBE TABLE)
- Streamline integration configuration
- Error messages or logs

Quick Reference Card

✅ Checklist for Setting Up Correlation ID

For New Integrations:

Identify tables with IDENTITY primary keys
Add Correlation ID column to each table (STRING, e.g., STREAMLINE_CORRELATION_ID)
Create Bloom Filter index on Correlation ID column (CRITICAL!)
Verify TBLPROPERTIES includes delta.feature.allowColumnDefaults
Configure Databricks integration in Streamline
Enter Correlation ID field name in integration settings
Test connection
Perform test insert operation
Verify data is flowing correctly

For Existing Integrations:

Add Correlation ID column to your Databricks table(s)
Create Bloom Filter index on Correlation ID column
Document current integration settings
Remove old integration from Streamline
Create new integration with Correlation ID field name
Reconfigure workflows to use new integration
Test INSERT operations

📋 Quick SQL Script Template

-- ============================================
-- Databricks Correlation ID Setup Script
-- ============================================

-- Replace placeholders with your actual values
USE CATALOG <your_catalog>;
USE SCHEMA <your_schema>;

-- Step 1: Add Correlation ID column
ALTER TABLE <your_table> 
ADD COLUMN streamline_corr_id STRING;

-- Step 2: Add index for performance (CRITICAL!)
CREATE INDEX idx_<table>_corr_id 
ON <your_table>(streamline_corr_id);

-- Step 3: Add documentation comment
COMMENT ON COLUMN <your_table>.streamline_corr_id IS 
'Used by Streamline for data integration. Managed automatically. Do not modify.';

-- Step 4: Verify the column was added
DESCRIBE TABLE <your_table>;

-- Step 5: Verify TBLPROPERTIES (should include allowColumnDefaults)
SHOW CREATE TABLE <your_table>;

-- ============================================
-- Example: Creating a NEW table with IDENTITY
-- ============================================
CREATE TABLE <your_table> (
    id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    name STRING,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
    streamline_corr_id STRING
) TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported');

-- Create index on correlation ID
CREATE INDEX idx_<table>_corr_id 
ON <your_table>(streamline_corr_id);

-- Verify
DESCRIBE TABLE <your_table>;
SHOW CREATE TABLE <your_table>;

Known Limitations

Editing Existing Data Sources

Current Status: Editing existing Databricks data sources/integrations is not currently available in Streamline.

Impact: If you have an existing Databricks integration and want to add Correlation ID support, you must recreate the integration.

Workaround: See Step 3: Update Existing Integration for detailed instructions on how to safely recreate your integration.

Planned Enhancement: The ability to edit existing data sources will be added in a future release. This will allow you to:

Add or update the Correlation ID field name
Modify other connection settings
Avoid recreating integrations

Timeline: To be determined. Check Streamline release notes for updates.

Understanding the Problem

What is a Correlation ID?

When Do You Need This?

Step-by-Step Setup Guide

Scenarios That Don't Require Correlation ID

Best Practices

Frequently Asked Questions

Troubleshooting

Support

Quick Reference Card

✅ Checklist for Setting Up Correlation ID

Known Limitations

References

Related Streamline Documentation

Comments

Understanding the Problem

What is a Correlation ID?

When Do You Need This?

Step-by-Step Setup Guide

Scenarios That Don't Require Correlation ID

Best Practices

Frequently Asked Questions

Troubleshooting

Support

Quick Reference Card

✅ Checklist for Setting Up Correlation ID

Known Limitations

References

Related Streamline Documentation

Related articles