Google Professional Data Engineer Real Exam Questions
The questions for Professional Data Engineer were last updated at Dec 31,2024.
- Exam Code: Professional Data Engineer
- Exam Name: Google Certified Professional – Data Engineer
- Certification Provider: Google
- Latest update: Dec 31,2024
You are deploying 10,000 new Internet of Things devices to collect temperature data in your warehouses globally. You need to process, store and analyze these very large datasets in real time.
What should you do?
- A . Send the data to Google Cloud Datastore and then export to BigQuery.
- B . Send the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.
- C . Send the data to Cloud Storage and then spin up an Apache Hadoop cluster as needed in Google Cloud Dataproc whenever analysis is required.
- D . Export logs in batch to Google Cloud Storage and then spin up a Google Cloud SQL instance,
import the data from Cloud Storage, and run an analysis as needed.
You are building a model to make clothing recommendations. You know a user’s fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available.
How should you use this data to train the model?
- A . Continuously retrain the model on just the new data.
- B . Continuously retrain the model on a combination of existing data and the new data.
- C . Train on the existing data while using the new data as your test set.
- D . Train on the new data while using the existing data as your test set.
Which of the following IAM roles does your Compute Engine account require to be able to run pipeline jobs?
- A . dataflow.worker
- B . dataflow.compute
- C . dataflow.developer
- D . dataflow.viewer
You work for a large fast food restaurant chain with over 400,000 employees. You store employee information in Google BigQuery in a Users table consisting of a FirstName field and a LastName field. A member of IT is building an application and asks you to modify the schema and data in BigQuery so the application can query a FullName field consisting of the value of the FirstName field concatenated with a space, followed by the value of the LastName field for each employee.
How can you make that data available while minimizing cost?
- A . Create a view in BigQuery that concatenates the FirstName and LastName field values to produce the FullName.
- B . Add a new column called FullName to the Users table. Run an UPDATE statement that updates the FullName column for each user with the concatenation of the FirstName and LastName values.
- C . Create a Google Cloud Dataflow job that queries BigQuery for the entire Users table, concatenates the FirstName value and LastName value for each user, and loads the proper values for FirstName,
LastName, and FullName into a new table in BigQuery. - D . Use BigQuery to export the data for the table to a CSV file. Create a Google Cloud Dataproc job to process the CSV file and output a new CSV file containing the proper values for FirstName, LastName and FullName. Run a BigQuery load job to load the new CSV file into BigQuery.
Which methods can be used to reduce the number of rows processed by BigQuery?
- A . Splitting tables into multiple tables; putting data in partitions
- B . Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause
- C . Putting data in partitions; using the LIMIT clause
- D . Splitting tables into multiple tables; using the LIMIT clause
You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible.
What should you do?
- A . Load the data every 30 minutes into a new partitioned table in BigQuery.
- B . Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery
- C . Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore
- D . Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.
What Dataflow concept determines when a Window’s contents should be output based on certain criteria being met?
- A . Sessions
- B . OutputCriteria
- C . Windows
- D . Triggers
You are developing a software application using Google’s Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline.
Which component will be used for the data processing operation?
- A . PCollection
- B . Transform
- C . Pipeline
- D . Sink API
Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day’s events. They also want to use streaming ingestion.
What should you do?
- A . Create a table called tracking_table and include a DATE column.
- B . Create a partitioned table called tracking_table and include a TIMESTAMP column.
- C . Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.
- D . Create a table called tracking_table with a TIMESTAMP column to represent the day.
Flowlogistic’s management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably.
Which combination of GCP products should you choose?
- A . Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage
- B . Cloud Pub/Sub, Cloud Dataflow, and Local SSD
- C . Cloud Pub/Sub, Cloud SQL, and Cloud Storage
- D . Cloud Load Balancing, Cloud Dataflow, and Cloud Storage