BigQuery isa powerful data warehousing tool and a popular choice among engineering teams looking to work with massive amounts of data.
But as is the case with any cloud-based service, BigQuery costs money. And without proper management, overspending can become a problem.
Among respondents in a2023 Oomnitza Survey, a staggering 53% spent at least 10% oftheir annual budget on underutilized, unmanaged, or unused cloud resources; 19% wasted at least 20% of their budget.
This is a significant amount of dollars gone to waste that could be better used for other business priorities. With the increasing adoption of cloud services like BigQuery, it’s crucial for you to implement cost optimization strategies to ensure you are not overspending on cloud resources.
In this article, we will examine the best practices for optimizing BigQuery costs, including strategies for managing data storage and query usage.
Understanding BigQuery Pricing and Billing
Recently, on March 29, 2023, Google introducednew pricing editionsfor BigQuery, including a new "flex" option that makes the pricing process simpler and more predictable. The "flex" option allows you to purchase BigQuery capacity in smaller increments, so you can get used to the platform.
Note: This pricing will be effective from July 5th, 2023.
Offers core BigQuery features at a competitive rate
Is ideal for small to medium-sized businesses with ad-hoc development and test workloads
Supports advanced security and compliance features
Is designed for large enterprises with sensitive data and complex regulatory requirements
Offers powerful machine learning and data management capabilities
Enterprise Plus Edition:
Includes all features of the previous editions
Is designed for mission-critical workloads with advanced support options
Is ideal for customers with high uptime, availability, and recovery requirements
Components of BigQuery Pricing
Let's use an example to better illustrate how GCP Cloud Logging pricing works. If you generate 500 GB of log data monthly from a web application, the ingestion cost would be $250 per month ($0.50 per GB) minus the free allotment of 50 GB.
To retain the data for 90 days, you would be charged $0.01 per GB per month for the remaining 60 days, resulting in a monthly storage cost of $45 for 450 GB of log data. The total cost for GCP Cloud Logging in this scenario would be $295 per month.
BigQuery comes with two major cost drivers that need to be factored in when using the platform:
Based on the volume of data handled by each query, BigQuery offers two primary pricing models for computing:
On-demand pricing: With this pricing model, you are charged for the number of bytes processed by each query, rounded up to the nearest megabyte. The price per megabyte is $5, with a minimum charge of 10 MB per query. The first 1 TB of data processed / month is free. This pricing strategy is appropriate for workloads with variable use patterns or sporadic query volumes.
Capacity pricing: With this pricing model, you are charged based on the compute capacity used to run queries, measured in slots (virtual CPUs) over time. This model is suitable for workloads with predictable query volumes and uniform usage patterns.
BigQuery users are charged for two types of logical storage:
The data that has not been modified in the past 90 days are categorized as long-term logical storage, which costs half the price of active logical storage.
Please keep in mind that BigQuery storage costs are based on logical storage, rather than physical storage, which may require additional consideration when making cost estimations.
Plus, when you load data into BigQuery, charges begin to incur based on the amount of uncompressed data stored in the tables.
The following table shows the data sizes for each data type supported by BigQuery:
Reducing the amount of data BigQuery has to process is one of the easiest ways to minimize costs and improve performance.
Let's say you have a large dataset with customer transactions and you want to work out the average transaction per customer.
You might start with a query like this:
SELECT customer_location, AVG(transaction_value)
GROUP BY customer_location
But if your dataset is chock full of these transactions, the query could turn out to be rather expensive. You could modify the query to only select the necessary columns and filter the data by date range:
SELECT customer_location, AVG(transaction_value)
WHERE transaction_date BETWEEN '2022-01-01' AND '2022-12-31'
GROUP BY customer_location
By specifying a date range, you're only processing a subset of your dataset, which can greatly reduce the amount of data being queried and lower your costs.
Avoid SQL Anti-Patterns
Anti-patterns are the opposite of best practices. In SQL, they can result in slow query performance, increased query costs, and poor user experience. To optimize query performance and reduce costs, you should identify and avoid these anti-patterns. Using SELECT * queries.
One of the most common anti-patters involves using SELECT * queries. This can increase the amount of data scanned and negatively impact query performance. Ideally, you should only select the necessary columns you want to analyze.
Make sure you check the amount of GB processed before running your queries.
Not Optimizing Joins
Joining tables can be a costly operation in BigQuery, particularly when dealing with large tables. An anti-pattern is joining tables without optimizing the query first.
For instance, joining a 1 billion row table and a 500 million row table without optimization can lead to scanning all 1.5 billion rows, even if only a subset of the data is needed.
To optimize the query, you could sort or cluster both tables on the join key, and use a smaller subset of data for the join operation. This can significantly reduce the amount of data scanned, improve query performance, and reduce costs.
Misusing the LIMIT Clause
Whenever you use a LIMIT clause in a query, the entire table is still scanned, and the query is billed based on the total amount of data read, not just the number of rows returned by the LIMIT clause.
For example, if you run the query "SELECT * FROM my_table LIMIT 100", BigQuery will still scan the entire table "my_table", and you will be charged for reading all the bytes in that table, regardless of whether the query returns 100 rows or the entire table.
And so using limit 1…
…has the same effect on cost as using limit 100
One way to avoid overspending here is to use the "maximum bytes billed" setting in the Query settings to limit the amount of data that can be read and billed for a single query.
Query Performance Optimization
Now that you know which anti patterns to avoid, let's take a look at some ways in which you can analyze and optimize query performance.
Query Plan Analysis
Query Plan Analysis is a BigQuery feature that allows you to view and understand the execution plan of a query.
This is a detailed breakdown of the steps that BigQuery will take to execute a query, including the sequence of operations, the resources required, and the estimated costs.
To access the query plan details of a completed query in theGoogle Cloud console, you need to click on the Execution Details button, which is located near the Results button.
To improve query performance, BigQuery caches frequently accessed data.
The problem is, cache may be cold for the first query execution, resulting in longer query times.
Cache warming is a technique wherein you warm up the cache by running the query beforehand to ensure that the data is cached before the actual query execution.
This technique is particularly useful for dashboard queries that require fast response times. For example, the following query warms up the cache for a given query:
SELECT COUNT(*) FROM mydataset.mytable WHERE date = '2022-01-01'
Data sampling is another method of figuring out how long an query will take to run. Simply add the TABLESAMPLE clause to your SQL query and you can testrun a portion of your dataset.
SELECT * FROM mydataset.mytable TABLESAMPLE SYSTEM (10 PERCENT)
Data Processing Optimization
Data processing is where the rubber meets the road in BigQuery. Here are a few ways you can reduce spending without impacting performance.
Partitioning is all about dividing a large table into smaller, more manageable parts based on a column in the table called the partitioning column.
Partitioning can reduce query processing time and costs by allowing BigQuery to scan only the relevant partitions instead of the entire table.
For example, if you have a table that contains sales data for several years, you can partition the table by year. This way, when you run a query to get sales data for a specific year, BigQuery only needs to scan the partition for that year, instead of scanning the entire table.
By optimizing data ingestion, you can improve BigQuery's data loading efficiency. Some techniques include:
Using streaming inserts: Streaming inserts allow you to insert data into BigQuery in real-time
Using batching: Batching means grouping multiple data inserts together and loading them into BigQuery at once
Using load jobs: Load jobs allow you to load data from Google Cloud Storage, Google Drive, or a local file system into BigQuery
Defining appropriate table expiration times lets you set a time limit for how long data should be kept in a table. This helps to remove unused data and reduces the amount of data that needs to be processed.
For example, if you have a table that contains data for the past 5 years, you can set the expiration time to 6 years. This way, data older than 6 years will be automatically deleted, reducing the amount of data that needs to be processed or stored.
Cost-Effective Storage Options
BigQuery provides three storage tiers:
Multi-regional storage is best suited for frequently accessed data that requires high availability.
Regional storage is less expensive and provides data access within a single region.
Coldline storage is the least expensive and ideal for long-term storage of infrequently accessed data such as archiving or backups
Make sure you choose the right tier for your data storage as each tier comes with a different level of durability, availability, and cost.
Use a FinOps Platform To Deduce Spending
FinOps platforms like Finout are cloud cost management tools that can help you optimize your cloud spending across multiple cloud providers, including BigQuery.
Some of the key features of a FinOps platform include:
Cost allocation: FinOps platforms provide detailed cost allocation reports that enable users to understand their cloud spend and identify areas where costs can be reduced.
Resource optimization: FinOps platforms use machine learning algorithms to identify unused or underutilized cloud resources and suggest optimizations to reduce costs. These optimizations may include resizing or shutting down instances, changing storage options, or using spot instances.
Budgeting and forecasting: FinOps platforms enable users to set budgets and forecast future cloud spend based on historical usage and trends. This helps users plan and allocate their cloud spending more effectively and avoid unexpected costs.
Automation: FinOps platforms provide automation capabilities to optimize cloud resources, track spending, and generate alerts in real-time to take action on the areas of improvement.
Using a FinOps platform helps users optimize their BigQuery usage and reduce costs by providing valuable insights, recommendations, and automation to manage their cloud spending more efficiently.
We've looked at some of the easiest ways you can start with when trying to cut BigQuery spending. As you can conclude, many of these methods are all about reducing the amount of data stored or processed so that you don't incur any needless costs.
There are other best practices you should research, and more importantly, experiment with so you can see what's working and what isn't.
And if you're looking for a tool that will do the heavy lifting for you, try Finout. Our FinOps-certified platform will give you a detailed breakdown as well as recommendations for reducing BigQuery spending.