Connecting Collate® to BigQuery: Unlocking Google's Data Ecosystem

Introduction

The data landscape continues to evolve rapidly, and Google BigQuery has emerged as a competitor to traditional data warehousing solutions, such as Snowflake and Databricks. As more organizations migrate from legacy SQL Server environments to Google Cloud Platform (GCP), the need for comprehensive data observability becomes critical. This is where Collate's BigQuery connector comes in. Offering deep integration with Google's ecosystem to provide lineage, usage monitoring, and cost observation capabilities.

BigQuery implements a serverless architecture that decouples compute from storage, allowing organizations to scale resources independently based on workload demands. This differs from traditional data warehouse architectures that require pre-provisioned infrastructure and capacity planning. Collate also works with other Google Cloud services, including Looker for visualization and Google Cloud Storage for data lake functionality. This article will cover the integration of BigQuery with Collate. A companion video to this blog is available on YouTube.

Setting Up the Connection

The initial setup process for connecting Collate to BigQuery is straightforward. Starting from Collate's landing page, navigate to Settings > Services > Databases, where BigQuery can be easily located among the extensive list of supported connectors. The process looks like this:

1. Navigate to Settings: Begin by accessing the services section in Collate's settings.

2. Add New Database Service: Select “Services”, then “Databases”, then "Add New Service" and search for BigQuery in the service list.

3. Configure Connection Details:

Setting up the BigQuery connector is straightforward. The connector comes pre-configured with standard Google API endpoints, similar to Power BI's approach, where the differentiation lies in authentication rather than endpoint configuration. Collate will prefill in those fields, such as:

Host And Port
Authentication URI
Token URI
Authentication Provider x509 Certificate URL

The connection process involves several key components:

Authentication Options: Collate supports multiple authentication methods, including service account credentials, external accounts, and application default credentials. This flexibility ensures compatibility with various organizational security policies.

Project Scope Management: You can configure the connector to ingest metadata from a single project or multiple projects, providing flexibility in structuring your data discovery. The project serves as the top-level organizational unit, containing databases and schemas that are organized underneath.

Service Account Configuration: The process requires specific credentials, including project key ID, private key, client email, and client ID. While this might seem extensive, most values are standard Google API configurations that your GCP administrator can easily provide.

Press the Test Connection button first to ensure you have connectivity. Note, however, that if the service isn’t running and the test connection is what wakes it up, the first try might fail if it doesn’t start fast enough. If that’s the case, wait a couple of minutes and try the test again; it should work. It is always best practice to test your connection. Once done, click Next.

Collate uses filters to control what data is ingested, databases, schemas, or tables, via names or regular expressions (regex). Out of the box, it excludes system schemas, such as "information_schema" or "performance_schema", to focus on user data.

Accept the defaults for a full ingest, or customize: for instance, include only schemas matching "^prod_.*" to target production data. Use the filtering options to control which databases, schemas, or tables are imported, thereby reducing unnecessary bloat.

Once the initial connection is established, Collate automatically launches its Metadata Agent that begins ingesting data immediately. That said, the connector also supports incremental ingestion, ensuring that subsequent metadata updates are efficient and don't unnecessarily burden your BigQuery instance. This is particularly important for cost-conscious organizations monitoring their BigQuery usage.

Comprehensive Metadata Integration

The BigQuery connector provides comprehensive metadata coverage that extends beyond basic table and column discovery. The integration captures:

Complete metadata hierarchy: From projects down to individual columns
Query usage patterns: Understanding how data assets are actually being used
Column-level lineage: Tracking data flow at the most granular level
Auto-classification capabilities: Automatically categorizing sensitive data
Reverse metadata functionality: Pushing descriptions and tags back to BigQuery

One limitation is the absence of owner information, which Google's API doesn't currently provide. However, this doesn't significantly impact the overall data governance capabilities that Collate delivers.

From within the BigQuery database view, there is an option to View in BigQuery. This button will launch GCP in a new browser window and direct you to the database, provided you have login permissions. This convenient feature makes it easy to access the data source.

Evolving with the Ecosystem

Collate's BigQuery connector is designed to evolve in tandem with Google's platform. As BigQuery introduces new features and API endpoints, Collate commits to incorporating these enhancements, ensuring that your data governance capabilities evolve in tandem with your technology stack.

Organizations investing in BigQuery can be confident that their governance and observability tools will continue to leverage the platform's latest capabilities.

Broader GCP Integration

The BigQuery connector is just part of Collate's Google Cloud integration. The platform also supports GCS for data lake scenarios and Looker for business intelligence governance. This GCP coverage enables organizations to maintain consistent data governance across their entire GCP ecosystem.

For organizations considering or already implementing GCP-based data stacks, Collate provides the governance foundation necessary to maintain control, understanding, and compliance across your data landscape.

Conclusion

The BigQuery connector represents a mature, well-designed integration that grows with both your data needs and Google's platform evolution. Whether you're migrating from traditional data warehouses or building a new data stack from scratch, this integration provides the foundation for sustainable and scalable data governance within the Google Cloud ecosystem.

To explore further, consider the Collate Free Tier for managed OpenMetadata or the Product Sandbox with demo data.

Connecting Collate® to BigQuery: Unlocking Google's Data Ecosystem

Introduction

Article Contents

Setting Up the Connection

Comprehensive Metadata Integration

Seamless Navigation and User Experience

Evolving with the Ecosystem

Broader GCP Integration

Conclusion

Fashion Retailer Mango’s Data Journey with Collate

Related Articles

Share this article