BigQuery is a powerful data warehousing tool and a popular choice among engineering teams looking to work with massive amounts of data.
But as is the case with any cloud-based service, BigQuery costs money. And without proper management, overspending can become a problem.
Among respondents in a 2023 Oomnitza Survey, a staggering 53% spent at least 10% of their annual budget on underutilized, unmanaged, or unused cloud resources; 19% wasted at least 20% of their budget.
This is a significant amount of dollars going to waste that could be better used for other business priorities. With the increasing adoption of cloud services like BigQuery, it’s crucial for you to implement cost optimization strategies to ensure you are not overspending on cloud resources.
In this article, we will examine the best practices for optimizing BigQuery costs, including strategies for managing data storage and query usage.
Recently, on March 29, 2023, Google introduced new pricing editions for BigQuery, including a new "flex" option that makes the pricing process simpler and more predictable. The "flex" option allows you to purchase BigQuery capacity in smaller increments, so you can get used to the platform.
Note: This pricing will be effective from July 5th, 2023.
Reference: https://cloud.google.com/blog/products/data-analytics/introducing-new-bigquery-pricing-editions
Standard Edition:
Enterprise Edition:
Enterprise Plus Edition:
Let's use an example to better illustrate how GCP Cloud Logging pricing works. If you generate 500 GB of log data monthly from a web application, the ingestion cost would be $250 per month ($0.50 per GB) minus the free allotment of 50 GB.
To retain the data for 90 days, you would be charged $0.01 per GB per month for the remaining 60 days, resulting in a monthly storage cost of $45 for 450 GB of log data. The total cost for GCP Cloud Logging in this scenario would be $295 per month.
BigQuery comes with two major cost drivers that need to be factored in when using the platform:
Based on the volume of data handled by each query, BigQuery offers two primary pricing models for computing:
BigQuery users are charged for two types of logical storage:
The data that has not been modified in the past 90 days are categorized as long-term logical storage, which costs half the price of active logical storage.
Please keep in mind that BigQuery storage costs are based on logical storage, rather than physical storage, which may require additional consideration when making cost estimations.
Plus, when you load data into BigQuery, charges begin to incur based on the amount of uncompressed data stored in the tables.
The following table shows the data sizes for each data type supported by BigQuery:
Reference:https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
Google has done a good job with their recommendations for reducing BigQuery spending, but don't worry, we'll give you a few additional pointers.
Reducing the amount of data BigQuery has to process is one of the easiest ways to minimize costs and improve performance.
Let's say you have a large dataset with customer transactions and you want to work out the average transaction per customer.
You might start with a query like this:
```
SELECT customer_location, AVG(transaction_value)
FROM customer_transactions
GROUP BY customer_location
```
But if your dataset is chock full of these transactions, the query could turn out to be rather expensive. You could modify the query to only select the necessary columns and filter the data by date range:
```
SELECT customer_location, AVG(transaction_value)
FROM customer_transactions
WHERE transaction_date BETWEEN '2022-01-01' AND '2022-12-31'
GROUP BY customer_location
```
By specifying a date range, you're only processing a subset of your dataset, which can greatly reduce the amount of data being queried and lower your costs.
One of the most common anti-patters involves using SELECT * queries. This can increase the amount of data scanned and negatively impact query performance. Ideally, you should only select the necessary columns you want to analyze.
Make sure you check the amount of GB processed before running your queries.
Joining tables can be a costly operation in BigQuery, particularly when dealing with large tables. An anti-pattern is joining tables without optimizing the query first.
For instance, joining a 1 billion row table and a 500 million row table without optimization can lead to scanning all 1.5 billion rows, even if only a subset of the data is needed.
To optimize the query, you could sort or cluster both tables on the join key, and use a smaller subset of data for the join operation. This can significantly reduce the amount of data scanned, improve query performance, and reduce costs.
Whenever you use a LIMIT clause in a query, the entire table is still scanned, and the query is billed based on the total amount of data read, not just the number of rows returned by the LIMIT clause.
For example, if you run the query "SELECT * FROM my_table LIMIT 100", BigQuery will still scan the entire table "my_table", and you will be charged for reading all the bytes in that table, regardless of whether the query returns 100 rows or the entire table.
And so using limit 1…
…has the same effect on cost as using limit 100
One way to avoid overspending here is to use the "maximum bytes billed" setting in the Query settings to limit the amount of data that can be read and billed for a single query.
Now that you know which anti patterns to avoid, let's take a look at some ways in which you can analyze and optimize query performance.
Query Plan Analysis is a BigQuery feature that allows you to view and understand the execution plan of a query.
This is a detailed breakdown of the steps that BigQuery will take to execute a query, including the sequence of operations, the resources required, and the estimated costs.
To access the query plan details of a completed query in the Google Cloud console, you need to click on the Execution Details button, which is located near the Results button.
To improve query performance, BigQuery caches frequently accessed data.
The problem is, cache may be cold for the first query execution, resulting in longer query times.
Cache warming is a technique wherein you warm up the cache by running the query beforehand to ensure that the data is cached before the actual query execution.
This technique is particularly useful for dashboard queries that require fast response times. For example, the following query warms up the cache for a given query:
```
SELECT COUNT(*) FROM mydataset.mytable WHERE date = '2022-01-01'
```
Data sampling is another method of figuring out how long an query will take to run. Simply add the TABLESAMPLE clause to your SQL query and you can testrun a portion of your dataset.
```
SELECT * FROM mydataset.mytable TABLESAMPLE SYSTEM (10 PERCENT)
```
Data processing is where the rubber meets the road in BigQuery. Here are a few ways you can reduce spending without impacting performance.
Partitioning is all about dividing a large table into smaller, more manageable parts based on a column in the table called the partitioning column.
Partitioning can reduce query processing time and costs by allowing BigQuery to scan only the relevant partitions instead of the entire table.
For example, if you have a table that contains sales data for several years, you can partition the table by year. This way, when you run a query to get sales data for a specific year, BigQuery only needs to scan the partition for that year, instead of scanning the entire table.
Reference: https://cloud.google.com/bigquery/docs/partitioned-tables
By optimizing data ingestion, you can improve BigQuery's data loading efficiency. Some techniques include:
Defining appropriate table expiration times lets you set a time limit for how long data should be kept in a table. This helps to remove unused data and reduces the amount of data that needs to be processed.
For example, if you have a table that contains data for the past 5 years, you can set the expiration time to 6 years. This way, data older than 6 years will be automatically deleted, reducing the amount of data that needs to be processed or stored.
BigQuery provides three storage tiers:
Make sure you choose the right tier for your data storage as each tier comes with a different level of durability, availability, and cost.
FinOps platforms like Finout are cloud cost management tools that can help you optimize your cloud spending across multiple cloud providers, including BigQuery.
Some of the key features of a FinOps platform include:
Using a FinOps platform helps users optimize their BigQuery usage and reduce costs by providing valuable insights, recommendations, and automation to manage their cloud spending more efficiently.
We've looked at some of the easiest ways you can start with when trying to cut BigQuery spending. As you can conclude, many of these methods are all about reducing the amount of data stored or processed so that you don't incur any needless costs.
There are other best practices you should research, and more importantly, experiment with so you can see what's working and what isn't.
And if you're looking for a tool that will do the heavy lifting for you, try Finout. Our FinOps-certified platform will give you a detailed breakdown as well as recommendations for reducing BigQuery spending.