One of Streamline’s standout features is its ability to automatically detect and tag sensitive data across your data fabric. These tags help you stay on top of security and compliance by highlighting areas that may require special handling.
What Are Sensitive Data Tags?
Streamline uses tags to identify and label data sources, entities, and datasets that may contain sensitive information. These tags are visible right in the data catalog, making it easy to know where sensitive data lives.
Example view in the data catalog:
How Does Streamline Detect Sensitive Data?
We use a process called data classification. Here’s how it works:
Automatic Detection:
When a data source is connected, our machine learning model runs automatically—usually in just a second or two.
Classification by Data Class:
The model inspects both the contents and the structure of your tables to assign a data class to each field.
View the Results:
You can see the assigned data classes by clicking on an entity in your data catalog.
What Is Data Classification?
Data classification is the process of labeling data based on its type—like identifying that a column contains Social Security Numbers or email addresses.
For example:
Each column in your data source is assigned a single data class (like “Full Name” or “IP Address”). If a field doesn’t match any known type, it’s marked as “Other.”
Note: The classification is based on data content, not the column name. For instance,
patient_full_name
anddoctor_full_name
would both be classified as “Full Name.”
Supported Data Classes
Streamline supports a wide and growing list of data classes, grouped into categories like:
ID Number: Social Security Number, Passport Number, etc.
Contact: Email Address, Phone Number
Medical: ICD Codes, MRNs, UDI
Financial: Bank Account Numbers, Payment Card Info
Location, Name, Demographics, and more
You can find the full list of data classes in the definitions.py
file of our classifier code.
Some data classes are currently limited to specific systems like Salesforce or EHRs—these are noted in the list.
Note: Custom data classes are not currently not available.
How Are Tags Like PII, PHI, and PCI Applied?
Streamline currently supports 4 sensitive data tags:
Tag | Description |
---|---|
PII | Personally Identifiable Information |
PHI | Protected Health Information |
PCI | Payment Card Information |
HIPAA | Combination of PII and PHI |
When Are These Tags Applied?
Each tag is triggered based on the presence of certain data classes:
PII: Triggered by things like Social Security Numbers, Names, IPs, etc.
PHI: Includes medical codes, patient IDs, diagnosis data, etc.
PCI: Covers credit card numbers, expiration dates, and similar data.
HIPAA: Appears when both PII and PHI tags are present, or if a Medical Record Number (MRN) or UDI is found.
These lists evolve as we support more data classes.
How Does Streamline Classify Different Data Sources?
Table-like Data Sources (e.g., SQL Databases)
We classify data at the column level based on:
- Sample values from the data
- Column names and table structure
- Why classify columns instead of rows?
- It’s more accurate
- It fits how data fabrics typically structure data
Standard Schema Sources (e.g., Salesforce, EHRs)
These systems have well-documented structures. Here’s how we handle them:
Salesforce: We used an LLM to analyze documentation and assign default classifications to standard fields. Custom fields fall back to the ML model.
EHR Systems (FHIR-based): Fields are classified manually during setup.
How the Classifier Works
Our classification engine combines:
An expert system for rule-based identification
A custom-trained language model, trained on public and synthetic data
Together, these components assign a best-fit data class—or “Other” if no match is found.
Comments
0 comments
Article is closed for comments.