This guide explains how to configure Databricks tables that use IDENTITY primary keys (auto-generated values) so that INSERT operations can be enabled in Streamline using a Correlation ID column. It does not apply to tables without IDENTITY primary keys.
Key Points
⚠️ INSERT operations are disabled by default for all tables
⚠️ Adding a Correlation ID column only enables INSERT operations for tables with IDENTITY primary keys
⚠️ Adding a Correlation ID column does not enable INSERTs or provide any benefit for tables without IDENTITY primary keys
✅ UPDATE operations work without any special configuration (no Correlation ID needed)
🔧 This guide shows how to enable INSERTs by adding a Correlation ID column
⏱️ Setup takes 5-10 minutes per table
📊 Minimal impact: Only adds one STRING column to your table
💡 Quick Check: If you only need to UPDATE existing records (not create new ones), you can stop reading now—you don't need a Correlation ID!
Understanding the Problem
Databricks's IDENTITY columns have a unique limitation that affects data integration platforms like Streamline. When inserting new records into tables with IDENTITY primary keys, Databricks does not provide a reliable way to retrieve the generated ID value.
Why This Matters
When Streamline inserts data into your Databricks tables, it needs to know the primary key values of inserted records to:
Build relationships between parent and child tables
Track changes in the dataset
Maintain data integrity across related tables
Example Scenario
- You have two related tables
CREATE TABLE accounts (
account_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
account_name STRING,
streamline_corr_id STRING
) TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported');
CREATE TABLE contacts (
contact_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
account_id BIGINT, -- Foreign key to accounts
contact_name STRING,
streamline_corr_id STRING,
FOREIGN KEY (account_id) REFERENCES accounts(account_id)
) TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported');
-- Streamline needs to:
-- 1. Insert into accounts → Get the generated account_id (e.g., 12345)
-- 2. Insert into contacts using that account_id → Link the contact to account 12345
The Technical Limitation
Databricks IDENTITY columns have an important limitation for data integration:
- There is no standard way to retrieve the auto-generated IDENTITY value immediately after an INSERT through JDBC/SQL.
- Functions and patterns commonly used in other databases (such as INSERT … RETURNING, “last inserted ID” helpers, or getGeneratedKeys() over JDBC) are not supported or do not behave reliably for Databricks IDENTITY.
-
Databricks does not allow uuid() to be used in DEFAULT expressions, so you cannot define a DEFAULT UUID primary key column.
Because of this, Streamline cannot reliably obtain the generated primary key value after inserting into an IDENTITY column. The Correlation ID pattern is the workaround: Streamline inserts a known value into a separate column and then queries back using that value to retrieve the new primary key.
Reference: Databricks IDENTITY Columns Documentation
What is a Correlation ID?
A Correlation ID is a special column you add to your Databricks table that Streamline uses to retrieve the auto-generated primary key after an INSERT operation.
ℹ️ Note: Correlation ID is ONLY needed for INSERT operations. If you only plan to UPDATE existing records, you do not need a Correlation ID column.
How It Works
Streamline generates a unique UUID (e.g.,
a7f3c8e1-4b2d-4c9a-8f1e-9d6b5a2c3f4e)Inserts the record with this UUID in your Correlation ID column
Queries the table using the UUID to find the newly inserted record
Retrieves the auto-generated primary key from the query result
Uses that primary key for relationships and tracking
Analogy
Think of the Correlation ID as a "tracking number" that helps Streamline find the exact record it just inserted, similar to how you track a package with a tracking number.
Important Implementation Details for Databricks
IDENTITY Column Types
Databricks supports two types of IDENTITY columns:
1. GENERATED ALWAYS AS IDENTITY
- Database always generates the value
- Cannot provide explicit values
- Use case: Strict auto-increment only
2. GENERATED BY DEFAULT AS IDENTITY
- Database generates value if not provided
- Can provide explicit values when needed
- Use case: Flexible - supports both auto-increment and explicit values
ℹ️ Streamline does not require one option over the other.
Use GENERATED ALWAYS when you want the database to strictly control IDs, or GENERATED BY DEFAULT when you need flexibility (e.g., backfills, migrations, or tests that provide explicit IDs).
When Do You Need This?
You NEED a Correlation ID when:
Your table meets BOTH of these conditions:
-
Primary key uses IDENTITY (auto-generated values)
CREATE TABLE my_table ( id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, name STRING, streamline_corr_id STRING ) TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported'); You want to perform INSERT operations (create new records) from Streamline
How to Know
In Streamline's Deliver data interface, if INSERT operations are disabled for your table, it means the table has an IDENTITY primary key without a Correlation ID column. Follow the setup steps in this guide to enable INSERT operations.
Common Use Cases
Creating new records in tables with IDENTITY primary keys
Parent-child table relationships (e.g., Accounts → Contacts)
Multi-table datasets requiring inserts
Tables where you don't control the primary key generation
Step-by-Step Setup Guide
Prerequisites
Access to your Databricks workspace with ALTER TABLE privileges
Knowledge of which tables will receive data from Streamline
Basic understanding of SQL
Step 1: Add Correlation ID Column to Your Databricks Table
For each table with IDENTITY primary keys that will receive data from Streamline, add a Correlation ID column.
1.1 Choose a Column Name
Pick a descriptive name for your Correlation ID column. We recommend:
streamline_corr_id(recommended)Other names (such as
integration_corr_idorsync_corr_id) are supported, but this guide and all examples assumeSTREAMLINE_CORRELATION_ID.
1.2 Add the Column
-- Connect to your Databricks workspace
USE CATALOG your_catalog_name;
USE SCHEMA your_schema_name;
-- Add the Correlation ID column
-- Note: STRING is used because Databricks stores UUIDs as strings
-- UUIDs are 36 characters long (e.g., 'a7f3c8e1-4b2d-4c9a-8f1e-9d6b5a2c3f4e')
ALTER TABLE your_table_name
ADD COLUMN streamline_corr_id STRING;
Example:
-- For an accounts table USE CATALOG sales_catalog; USE SCHEMA public; ALTER TABLE accounts ADD COLUMN streamline_corr_id STRING; -- For a contacts table (if it also has IDENTITY) ALTER TABLE contacts ADD COLUMN streamline_corr_id STRING;
1.3 Create an Index (⚠️ CRITICAL for Performance)
Since Streamline will query this column after every insert, adding an index significantly improves performance
-- Create a Bloom Filter index on the Correlation ID column
CREATE BLOOMFILTER INDEX your_table_name_corr_id_idx
ON TABLE your_table_name
FOR COLUMNS(STREAMLINE_CORRELATION_ID);
Example:
CREATE BLOOMFILTER INDEX accounts_corr_id_idx ON TABLE accounts FOR COLUMNS(STREAMLINE_CORRELATION_ID); CREATE BLOOMFILTER INDEX contacts_corr_id_idx ON TABLE contacts FOR COLUMNS(STREAMLINE_CORRELATION_ID);
⚠️ Performance Impact: Without an index, queries can be 10-100x slower on large tables!
Note: Bloom Filter indexes are optimized for equality lookups (e.g., WHERE STREAMLINE_CORRELATION_ID = 'some-uuid').
For additional performance optimization on large tables, you can also use Z-Ordering:
OPTIMIZE your_table_name ZORDER BY (STREAMLINE_CORRELATION_ID);Use both together for the best performance characteristics: Bloom Filter for fast equality predicate pruning, and Z-Ordering to cluster data by the correlation ID.
1.4 Verify the Column Was Added
-- Check the table structure DESCRIBE TABLE your_table_name; -- You should see your new column listed
Step 2: Configure Databricks Integration in Streamline
Now that your Databricks tables are prepared, configure the integration in Streamline.
2.1 Navigate to Integrations
Log in to Streamline
Go to Integrations > Data Sources
Click Add New Integration
Select Databricks
2.2 Enter Connection Details
Provide your Databricks connection information:
Field |
Description |
Example |
|---|---|---|
Host |
our Databricks workspace URL |
myworkspace.cloud.databricks.com |
Client ID |
Auth 2.0 Client ID for authentication |
your-client-id |
Client Secret |
Auth 2.0 Client Secret for authentication |
your-client-secret |
Warehouse |
QL Warehouse HTTP Path ID |
abc123def456 |
Correlation ID Field Name |
he column name you added in Step 1 |
STREAMLINE_CORRELATION_ID |
⚠️ Important: The Correlation ID Field Name must match exactly the column name you created in Step 1 (case-sensitive in Databricks).
📝 Note on Databricks Connection Details
Host: Your Databricks workspace URL typically follows one of these patterns:
{workspace-name}.cloud.databricks.com(AWS){workspace-name}.azuredatabricks.net(Azure){workspace-name}.gcp.databricks.com(GCP)
Client ID & Client Secret: These are OAuth 2.0 credentials for service principal authentication. You can create these in your Databricks workspace under Settings → Access Tokens → Service Principals.
Warehouse: This is the HTTP Path ID of your SQL Warehouse, not the warehouse name. You can find this in your Databricks workspace under SQL Warehouses → Select your warehouse → Connection details →
HTTP Path (it looks like /sql/1.0/warehouses/abc123def456). Use only the ID portion (e.g., abc123def456).
2.3 Save and Test Connection
Click Test Connection to verify connectivity
If successful, click Save Integration
Step 3: Update Existing Integration (If Applicable)
⚠️ For Customers with Existing Databricks Integrations
If you already have a Databricks data source/integration configured in Streamline and want to add Correlation ID support:
3.1 Add the Correlation ID Column
First, complete Step 1 to add the Correlation ID column to your Databricks table(s).
3.2 Update Your Integration in Streamline
You need to update your existing Databricks integration to include the Correlation ID field name.
🚧 IMPORTANT WARNING: Editing existing data sources is not currently available in Streamline. Until this feature is implemented, you will need to recreate your data source with the Correlation ID configuration.
To recreate your data source:
-
Document your current settings:
Note all connection details (Host, Client ID, Client Secret, Warehouse, etc.)
Note which tables/entities are being used
Note any workflows or delivery steps using this integration
-
Remove the old integration:
Go to Integrations > Data Sources
Find your Databricks integration
Delete/remove it
-
Create a new integration:
Follow Step 2 above
Use the same connection details as before
Add the Correlation ID Field Name this time
-
Reconfigure workflows:
Update any workflows that were using the old integration
Point them to the new integration
-
Test thoroughly:
Test that data is flowing correctly
Verify INSERT operations are now enabled
💡 Future Update: The ability to edit existing data sources will be added in a future Streamline release, eliminating the need to recreate integrations.
Step 4: Use Deliver Data in Streamline
Once configured, deliver data works automatically!
4.1 Select Your Dataset
Navigate to your Workflow
Select the Entity (Dataset) that corresponds to your Databricks table
Add a Deliver data step
4.2 Automatic Handling
When Streamline performs an insert operation:
✅ Automatically generates a unique UUID for the Correlation ID
✅ Inserts the record with the UUID in your Correlation ID column
✅ Retrieves the auto-generated primary key using the UUID
✅ Handles relationships between parent and child tables seamlessly
You don't need to do anything else! The system handles everything behind the scenes.
Scenarios That Don't Require Correlation ID
✅ You DON'T need a Correlation ID when:
Your table uses one of these alternative primary key strategies:
Scenario 1: Explicit Primary Key Values
If your application provides explicit primary key values (not auto-generated):
CREATE TABLE accounts (
account_id BIGINT PRIMARY KEY, -- No IDENTITY - values provided by application
account_name STRING
);
Why it works: Streamline knows the PK value before inserting because you provide it.
Scenario 2: Update-Only Operations
If you're only updating existing records (not inserting new ones), you don't need a Correlation ID regardless of your primary key type.
Example:
-- Update operations work fine without Correlation ID UPDATE accounts SET account_name = 'New Name' WHERE account_id = 12345;
In Streamline: If your workflow only updates existing records and never creates new ones, you can skip the Correlation ID setup entirely.
Scenario 3: Read-Only Operations
If you're only reading data from Databricks (no inserts/updates), you don't need a Correlation ID regardless of your primary key type.
Best Practices
1. Consistent Naming
Use the same Correlation ID column name across all your Databricks tables for consistency:
-- ✅ Good: Same name everywhere ALTER TABLE accounts ADD COLUMN streamline_corr_id STRING; ALTER TABLE contacts ADD COLUMN streamline_corr_id STRING; ALTER TABLE orders ADD COLUMN streamline_corr_id STRING;
2. Always Add an Index
Performance degrades significantly without an index, especially for large tables:
-- ⚠️ CRITICAL for performance
CREATE BLOOMFILTER INDEX
ON TABLE accounts
FOR COLUMNS(STREAMLINE_CORRELATION_ID);
Performance Impact (for Streamline correlation ID lookups only):
With index on
`STREAMLINE_CORRELATION_ID`: ~10–50 ms per lookupWithout index:
~1–10 seconds per lookup (100x slower for these lookups)
This indexing recommendation affects only the internal queries Streamline runs against the STREAMLINE_CORRELATION_ID column (for example, WHERE STREAMLINE_CORRELATION_ID = 'some-uuid').
It does not change the performance characteristics of your other application queries that do not filter on this column.
Bloom Filter indexes are specifically optimized for equality lookups on columns like STREAMLINE_CORRELATION_ID.
For even better performance on very large tables, combine this with Z-Ordering:
OPTIMIZE accounts ZORDER BY (STREAMLINE_CORRELATION_ID);
3. Don't Populate Existing Records
The Correlation ID should remain NULL for existing records. Only Streamline populates this field during insert operations.
-- ❌ DON'T DO THIS
UPDATE accounts SET streamline_corr_id = uuid();
-- ✅ DO THIS (leave existing records as NULL)
-- Only Streamline populates this column during INSERT
4. Document Your Tables
Add a comment to document the purpose of the Correlation ID column:
COMMENT ON COLUMN accounts.streamline_corr_id IS
'Used by Streamline for data integration. Managed automatically. Do not modify.';
5. Consider Future Tables
⚠️ Important for Databricks:
When creating new tables, we recommend using IDENTITY columns with the correlation ID pattern:
-- ✅ RECOMMENDED: IDENTITY with correlation ID
CREATE TABLE accounts (
account_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
account_name STRING,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
STREAMLINE_CORRELATION_ID STRING
) TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported');
-- Create Bloom Filter index on correlation ID
CREATE BLOOMFILTER INDEX
ON TABLE accounts
FOR COLUMNS(STREAMLINE_CORRELATION_ID);Reference: CREATE TABLE
Frequently Asked Questions
Q: I already have a Databricks integration in Streamline. What do I do?
A: You'll need to add the Correlation ID column to your Databricks table, then recreate your integration in Streamline with the Correlation ID field configured. See Step 3: Update Existing Integration for detailed instructions.
⚠️ Note: Editing existing integrations is not currently available. You must recreate the integration until this feature is implemented.
Q: Does the Correlation ID affect my existing data?
A: No. Existing records will have NULL in the Correlation ID column. Only new records inserted by Streamline will have values populated.
Q: Can I use the Correlation ID for my own purposes?
A: No. The Correlation ID column is exclusively managed by Streamline. Do not read from, write to, or depend on this column in your own applications.
Q: What if I have multiple Streamline integrations to the same table?
A: Use the same Correlation ID column for all integrations. Each insert operation generates a unique UUID, so there's no conflict.
Q: Can I delete or modify Correlation ID values?
A: Once inserted, you should not modify Correlation ID values. However, you can safely delete old Correlation ID values if needed (set to NULL) for records that were successfully inserted and no longer need tracking.
Q: Why can't I insert records into my Databricks table in Streamline?
A: If INSERT operations are disabled in Streamline's Deliver data interface, your table has an IDENTITY primary key without a Correlation ID column configured. By default, tables with IDENTITY primary keys are not "insertable" in Streamline until you complete the Correlation ID setup. Follow the steps in this guide to enable INSERT operations.
Q: I only need to UPDATE records, do I still need a Correlation ID?
A: No! Correlation ID is only required for INSERT operations. UPDATE operations work perfectly fine without a Correlation ID, regardless of whether your table has an IDENTITY primary key.
Troubleshooting
INSERT operations are still disabled after adding Correlation ID
Possible causes:
-
Correlation ID field name mismatch
- Verify the column name in Databricks matches exactly (case-sensitive)
- Check your Streamline integration configuration
-
Data catalog not refreshed
- In Streamline, rescan your Databricks integration (data source)
- Wait 1–2 minutes for the rescan to complete
INSERT operations are slow
Possible causes:
Missing Bloom Filter index on correlation ID column
Create index:
CREATE BLOOMFILTER INDEX
ON TABLE your_table
FOR COLUMNS(STREAMLINE_CORRELATION_ID);This is the most common cause of slow lookups after INSERT.
2. Large table without statistics
Run:
ANALYZE TABLE your_table COMPUTE STATISTICS;3. Warehouse undersized
Consider scaling up your SQL Warehouse
Check warehouse query history for performance metrics
Error: "Column 'streamline_corr_id' not found"
Solution:
Verify the column exists:
DESCRIBE TABLE your_table;If missing, add it:
ALTER TABLE your_table ADD COLUMN streamline_corr_id STRING;Create Bloom Filter index:
CREATE BLOOMFILTER INDEX your_table_corr_id_idx
ON TABLE your_table
FOR COLUMNS(STREAMLINE_CORRELATION_ID);
Error: "AUTOINCREMENT is not a valid keyword"
Solution:
Databricks does NOT support
AUTOINCREMENTUse
GENERATED BY DEFAULT AS IDENTITYinstead:
-- ❌ WRONG
CREATE TABLE accounts (
account_id BIGINT PRIMARY KEY AUTOINCREMENT,
account_name STRING
);
-- ✅ CORRECT
CREATE TABLE accounts (
account_id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
account_name STRING,
streamline_corr_id STRING
) TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported');
Support
If you encounter issues not covered in this guide:
- Check Insert Availability: Verify if INSERT operations are enabled for your table in Streamline's Deliver data interface
- Check Streamline Logs: Look for error messages related to your Databricks integration
- Verify Configuration: Ensure the Correlation ID column name matches exactly in both Databricks and Streamline
- Refresh Data Catalog: After adding the Correlation ID column, refresh your data catalog in Streamline
- Test Simple Insert: Try inserting a single record to isolate the problem
-
Contact Support: Reach out to Streamline support with:
- Your Databricks table structure (DESCRIBE TABLE)
- Streamline integration configuration
- Error messages or logs
Quick Reference Card
✅ Checklist for Setting Up Correlation ID
For New Integrations:
- Identify tables with IDENTITY primary keys
- Add Correlation ID column to each table (STRING, e.g., STREAMLINE_CORRELATION_ID)
- Create Bloom Filter index on Correlation ID column (CRITICAL!)
- Verify TBLPROPERTIES includes delta.feature.allowColumnDefaults
- Configure Databricks integration in Streamline
- Enter Correlation ID field name in integration settings
- Test connection
- Perform test insert operation
- Verify data is flowing correctly
For Existing Integrations:
- Add Correlation ID column to your Databricks table(s)
- Create Bloom Filter index on Correlation ID column
- Document current integration settings
- Remove old integration from Streamline
- Create new integration with Correlation ID field name
- Reconfigure workflows to use new integration
- Test INSERT operations
📋 Quick SQL Script Template
-- ============================================
-- Databricks Correlation ID Setup Script
-- ============================================
-- Replace placeholders with your actual values
USE CATALOG <your_catalog>;
USE SCHEMA <your_schema>;
-- Step 1: Add Correlation ID column
ALTER TABLE <your_table>
ADD COLUMN streamline_corr_id STRING;
-- Step 2: Add index for performance (CRITICAL!)
CREATE INDEX idx_<table>_corr_id
ON <your_table>(streamline_corr_id);
-- Step 3: Add documentation comment
COMMENT ON COLUMN <your_table>.streamline_corr_id IS
'Used by Streamline for data integration. Managed automatically. Do not modify.';
-- Step 4: Verify the column was added
DESCRIBE TABLE <your_table>;
-- Step 5: Verify TBLPROPERTIES (should include allowColumnDefaults)
SHOW CREATE TABLE <your_table>;
-- ============================================
-- Example: Creating a NEW table with IDENTITY
-- ============================================
CREATE TABLE <your_table> (
id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
name STRING,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
streamline_corr_id STRING
) TBLPROPERTIES('delta.feature.allowColumnDefaults' = 'supported');
-- Create index on correlation ID
CREATE INDEX idx_<table>_corr_id
ON <your_table>(streamline_corr_id);
-- Verify
DESCRIBE TABLE <your_table>;
SHOW CREATE TABLE <your_table>;
Known Limitations
Editing Existing Data Sources
Current Status: Editing existing Databricks data sources/integrations is not currently available in Streamline.
Impact: If you have an existing Databricks integration and want to add Correlation ID support, you must recreate the integration.
Workaround: See Step 3: Update Existing Integration for detailed instructions on how to safely recreate your integration.
Planned Enhancement: The ability to edit existing data sources will be added in a future release. This will allow you to:
Add or update the Correlation ID field name
Modify other connection settings
Avoid recreating integrations
Timeline: To be determined. Check Streamline release notes for updates.
Comments
0 comments
Article is closed for comments.