Connecting Databricks to Collate: A Complete Guide

Collate is an AI-enabled platform designed to help data teams organize, govern, and optimize their data assets. It focuses on automating tasks like data discovery, quality assurance, observability, and compliance to boost productivity and reduce costs. with over 90 connectors, Collate allows you to quickly find and collaborate on key data assets across various sources. You can generate secure and permission-aware insights from a unified knowledge graph, ensure regulatory compliance, such as GDPR, and build a self-service data culture that accelerates development and problem resolution.

Databrickshas become one of the most popular platforms for data analytics and machine learning, but managing metadata, lineage, and data quality across large datasets can quickly become overwhelming. That's where Collate steps in, offering a comprehensive data catalog solution that seamlessly integrates with Databricks, bringing all your data management needs under one roof. A companion video to this blog is available on YouTube.

Article Contents

Setting Up the Connection: Surprisingly Simple
Metadata Ingestion and Agents
Agents Further Explained
Intelligent Data Profiling and Classification
Sample Data Handling and Privacy Controls
Smart Filtering and Tag-Based Management
AI Agents
Conclusion

Setting Up the Connection: Surprisingly Simple

The initial setup process for connecting Collate to Databricks is straightforward. Starting from Collate's landing page, navigate to Settings > Services > Databases, where Databricks can be easily located among the extensive list of supported connectors. The process looks like this:

1. Navigate to Settings: Begin by accessing the services section in Collate's settings.

2. Add New Database Service: Select “Services”, then “Databases”, then "Add New Service" and search for Databricks in the service list.

3. Configure Connection Details:

Enter a descriptive name for your database service
Provide database credentials (username, password)
Specify the host endpoint and port
Provide your Databricks token
Provide the HTTP Path

Press the Test Connection first to ensure you have connectivity. Note, however, that if the service isn’t running and the test connection is what wakes it up, the first try might fail if it doesn’t start fast enough. If that’s the case, wait a couple of minutes and try the test again; it should work. It is always best practice to test your connection. Once done, click Next.

Collate uses filters to control what data is ingested, databases, schemas, or tables, via names or regular expressions (regex). Out of the box, it excludes system schemas, such as "information_schema" or "performance_schema", to focus on user data.

Accept the defaults for a full ingest, or customize: for instance, include only schemas matching "^prod_.*" to target production data. Use the filtering options to control which databases, schemas, or tables are imported, thereby reducing unnecessary bloat.

Once the initial connection is established, Collate automatically launches its Metadata Agent that begins ingesting data immediately. But this is just the beginning. Collate has a comprehensive agent ecosystem, each designed to extract different types of value from your Redshift data.

Metadata Ingestion and Agents

One of Collate's standout features is its comprehensive agent ecosystem. Once the initial connection is established, Collate automatically launches its Metadata Agent that begins ingesting data immediately. Collate has a comprehensive agent ecosystem, each designed to extract different types of value from your Databricks data.

We can check the status of the Metadata Agent by navigating to the database service we just created and selecting Agents.

Available Agents

Metadata Agent: Brings in database structure and metadata
Usage Agent: Captures query patterns and data popularity statistics
Lineage Agent: Tracks data lineage and relationships
Profiler Agent: Provides detailed table metrics (e.g., row counts)
Auto Classification Agent: Automatically categorizes tables
DBT Agent: Integrates with DBT for enhanced data transformation insights

Agents Further Explained

The Metadata Agent serves as the foundation, running on a scheduled basis (typically Sundays) to continuously sync your Databricks metadata. This agent ensures your catalog stays current with schema changes, new tables, and structural modifications happening in your Databricks environment.

The Lineage Agent tracks data movement and transformations, creating visual maps of how data flows through your systems. While sample datasets might not show extensive lineage due to limited data movement, production environments reveal complex data relationships that become invaluable for impact analysis and troubleshooting.

The Usage Agent captures query patterns and data access information, providing insights into how your Databricks data is actually being utilized. This intelligence helps identify popular datasets, optimize performance, and understand user behavior patterns.

The optional dbt Agent allows you to bring in a related dbt manifest file to curate lineage, description, and other types of metadata from that.

Intelligent Data Profiling and Classification

The Profiler Agent can be configured to sample your data statistically, making it practical even for massive datasets. You can configure percentage-based sampling (typically 20-30%) to achieve statistical significance. Alternatively, for organizations dealing with huge datasets, a specific number of records can be specified, rather than attempting to profile entire datasets or a percentage that would result in a far too large number of records.

The Auto Classification Agent represents a significant tool for automated data governance. This AI-powered agent automatically identifies and tags various data types, including PII, datetime fields, IP addresses, and other sensitive information. For organizations with thousands of tables, this automation is essential. Manually tagging 20,000 tables isn't only impractical but also virtually impossible at scale.

Sample Data Handling and Privacy Controls

Collate takes a privacy-first approach to collecting sample data. By default, the system doesn't store actual data samples, preventing accidental ingestion of sensitive information. However, if you do need sample data for a better understanding of your datasets, you can explicitly enable this feature in the Auto Classification Agent settings. This deliberate approach ensures organizations maintain control over their data exposure.

Smart Filtering and Tag-Based Management

Collate introduces an innovative classification filter system that goes beyond simple naming conventions. Instead of manually specifying every table you want to profile or monitor, you can create tag-based rules to automate the process. For example, you might tag specific tables as "gold-certified" or "sample-data" and then create automation rules that apply specific agents to any data matching those tags. This approach scales with you as your data estate grows.

This tag-based approach also extends to profiling decisions. Rather than maintaining endless lists of table names, you can establish governance policies that automatically apply profiling, quality checks, or monitoring based on data classification tags.

AI Agents

The AI Documentation Agent addresses one of the most common pain points in data management: missing or inadequate documentation. This agent automatically suggests descriptions for tables and columns that lack documentation, allowing teams to accept, reject, or process these suggestions in bulk. Teams can even configure it to enforce descriptions, ensuring comprehensive documentation coverage.

The AI Quality Agent offers a smart starting point for data quality initiatives. By analyzing metadata patterns, it suggests appropriate data quality tests from Collate's extensive test library, giving organizations a solid foundation for their data quality programs without requiring extensive manual configuration.

Finally, there is the AI Tiering Agent, which automatically classifies data assets from Tier 1 to Tier 5 based on their importance. This classification considers multiple factors, including query frequency, lineage position, upstream and downstream dependencies, and the overall impact on the data ecosystem.

Conclusion

In summary, Collate's real-time ingestion allows metadata to appear in the platform as the agent processes it, offering immediate access to table structures, column definitions, and data types. Each table view provides detailed column information, including types and descriptions, forming a centralized data asset catalog.

Collate complements Databricks' storage and processing strengths by incorporating observability, governance, and quality controls, which help organize and validate data assets.

From data discovery and documentation to lineage tracking and quality monitoring, this integration supports effective data management across scales, from sample datasets to petabyte environments, and accommodates single or multi-database setups.

To explore further, consider the Collate Free Tier for managed OpenMetadata or the Product Sandbox with demo data.