Top Google BigQuery Interview Questions (2024)
What is Google BigQuery?
Explain Google BigQuery architecture?
How to batch load data into GCP BigQuery from Cloud Storage or from a local file as a batch operation?
How to create a time-unit column-partitioned table in GCP BigQuery?
What is Window Functions and why to use in GCP BigQuery?
How to use Window Functions, give some example?
How to get top 3 rank results for each group using Window Function in GCP BigQuery?
In GCP BigQuery, can you generate an extra column in the SQL Query to add a suffix to distinguish duplicate ids as "serial number"?
How to troubleshoot Common SQL Errors with BigQuery?
What query would you use to find each user between two dates in GCP BigQuery?
Q: What is Google BigQuery?
Ans:
Google BigQuery is a cloud-based big data analytics web service used for processing very large read-only data sets. BigQuery is a fully managed, serverless data warehouse that allows for scalable data processing over petabytes. It's a Platform as a Service that offers ANSI SQL querying. Machine learning capabilities are also built-in.
Q: Explain Google BigQuery architecture?
Ans:
BigQuery makes use of a large number of multi-tenant services that are driven through low-level Google infrastructure technologies like Dremel, Colossus, Jupiter, and Borg.
Dremel
- Compute is Dremel, a large multi-tenant cluster that runs SQL queries.
- Dremel creates execution trees from SQL queries. The tree's leaves, known as slots, handle the heavy work of reading data from storage and performing any required computations. The aggregation is performed by the tree's branches, which are called "mixers."
Colossus
- Storage is Colossus, Google's global storage system.
- To store data in Colossus, BigQuery uses the columnar storage format and compression algorithm, which is suited for reading massive amounts of structured data.
- Colossus also takes care of replication, recovery (in the event of disc failure), and distributed management (so there is no single point of failure). BigQuery users can use Colossus to scale to dozens of petabytes of data stored in real time.
-
Jupiter
- The petabit Jupiter network allows Compute and Storage to communicate with one another.
- Between Storage and Compute, there's shuffle,' which uses Google's Jupiter network to move data very quickly from one place to another.
-
Borg
- BigQuery is orchestrated via Borg, Google's precursor to Kubernetes.
- For the processing of data, BigQuery uses Borg. At the same time, Borg works in thousands of Dremel clusters consisting of ten thousand machines. Borg handles fault tolerance as well as the assignment of Compute capacity for jobs at Dremel.
Take a look at our Suggested Posts :
Q: How to batch load data into GCP BigQuery from Cloud Storage or from a local file as a batch
operation?
Ans:
As a batch operation, you can load data into BigQuery from Cloud Storage or a local file.
Loading data from Cloud Storage
Any of the following formats can be used to store the source data:- Avro
- Comma-separated values (CSV)
- JSON (newline-delimited)
- ORC
- Parquet
- Firestore exports stored in Cloud Storage. Go to Create table -> select Cloud Storage to load data from Cloud Storage.
Loading data from local files
Go to Create table -> select Upload to load data from local files.
Q: How to create a time-unit column-partitioned table in GCP BigQuery?
Ans:
To create a table, we should have at least the following permissions:
bigquery.tables.create
permissions to create the tablebigquery.tables.updateData
to write data to the table using a query job, a load job, or a copy jobbigquery.jobs.create
to run a query job, load job, or copy job for writing data to the table
import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.Field;
import com.google.cloud.bigquery.Schema;
import com.google.cloud.bigquery.StandardSQLTypeName;
import com.google.cloud.bigquery.StandardTableDefinition;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.TableInfo;
import com.google.cloud.bigquery.TimePartitioning;
public class CreatePartitionedTable {
public static void main(String[] args) {
String datasetName = "TECHGEEKNEXT_DATASET";
String tableName = "USER";
Schema schema =
Schema.of(
Field.of("name", StandardSQLTypeName.STRING),
Field.of("address", StandardSQLTypeName.STRING),
Field.of("date", StandardSQLTypeName.DATE));
createPartitionedTable(datasetName, tableName, schema);
}
public static void createPartitionedTable(String datasetName, String tableName, Schema schema) {
try {
BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
TableId tableId = TableId.of(datasetName, tableName);
TimePartitioning partitioning =
TimePartitioning.newBuilder(TimePartitioning.Type.DAY)
// name of column to use for partitioning
.setField("date")
.setExpirationMs(7776000000L) // 90 days
.build();
StandardTableDefinition tableDefinition =
StandardTableDefinition.newBuilder()
.setSchema(schema)
.setTimePartitioning(partitioning)
.build();
TableInfo tableInfo = TableInfo.newBuilder(tableId, tableDefinition).build();
bigquery.create(tableInfo);
System.out.println("TechGeekNext - Partitioned table created successfully");
} catch (BigQueryException e) {
System.out.println("TechGeekNext - Partitioned table not created. \n" + e.toString());
}
}
}
Q: What is window functions and why to use in GCP BigQuery?
Ans:
Functions are a technique of computing values across several rows. In comparison to aggregate functions, which return a single value for a group of rows, they return a single value for each row.
Window functions are quite useful in a variety of situations. Here are a few examples:
- Running/Cumulative Total
- Moving Average
- Rank rows by custom criteria and groupings
- Finding the year-over-year % Change
window_function_name ([arguments]) OVER (
[PARTITION BY partition_expression]
[ORDER BY expression ASC | DESC]
[ROWS frame_clause])
PARTITION BY specifies how to GROUP rows into partitions, ORDER BY specifies
how to order the rows
in those partitions, and ROWS specifies which rows to consider in those sorted partitions.
Q: How to get top 3 rank results for each group using Window Function in GCP BigQuery?
Ans:
We used RANK to rank every player by points across each season in this example. The top three ranked players for each season were then returned using a subquery.
SELECT
*
FROM
(
SELECT
season,
RANK() OVER (PARTITION BY season ORDER BY points DESC) AS points_rank,
player,
points
FROM
top_scorers
) AS table_1
WHERE
(points_rank <= 3)
Output
season |points_rank |player |points
2019 |1 |Wally Hong |2828
2019 |2 |Joe Walker |2100
2019 |3 |Kevin Dave |2055
2020 |1 |Tim Jang |2535
2020 |2 |Monit Lin |1934
2020 |3 |Smith Bardon |1833
Q: In GCP BigQuery, can you generate an extra column in the SQL Query to add a suffix to
distinguish duplicate ids as "serial number" in BigQuery?
Ans:
BigQuery Standard SQL is given below.
select *, id || '-' || row_number() over(partition by id) extra_column
from TechGeekNextTable
It'll generate output as below:
Row id extra_column
---------------------------
1 10000 10000-1
2 10000 10000-2
3 10001 10001-1
4 10001 10001-2
5 10002 10002-1
6 10003 10003-1
7 10003 10003-2
8 10003 10003-3
Q: How to troubleshoot Common SQL Errors with BigQuery?
Ans:
If your query syntax is incorrect, use the Query Validator. When you run a query with problems, it fails, and the error is reported in the Job details.
A green checkmark appears in the query validator when the query is error-free. When the green checkmark appears, click Run to run the query and see the results.
Q: What query would you use to find each user between two dates in GCP BigQuery?
Ans:
SELECT
timestamp_trunc(timestamp, DAY) as Day,
user_id,
count(1) as Number
FROM `table`
WHERE timestamp >= '2024-08-22 00:00:00 UTC'
AND timestamp <= '2024-08-25 23:59:59 UTC'
GROUP BY 1, 2
ORDER BY Day