Google Professional Data Engineer Real Exam Questions
The questions for Professional Data Engineer were last updated at Nov 22,2024.
- Exam Code: Professional Data Engineer
- Exam Name: Google Certified Professional – Data Engineer
- Certification Provider: Google
- Latest update: Nov 22,2024
How would you query specific partitions in a BigQuery table?
- A . Use the DAY column in the WHERE clause
- B . Use the EXTRACT(DAY) clause
- C . Use the __PARTITIONTIME pseudo-column in the WHERE clause
- D . Use DATE BETWEEN in the WHERE clause
Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery.
Which three approaches can you take? (Choose three.)
- A . Disable writes to certain tables.
- B . Restrict access to tables by role.
- C . Ensure that the data is encrypted at all times.
- D . Restrict BigQuery API access to approved users.
- E . Segregate data across multiple tables or databases.
- F . Use Google Stackdriver Audit Logging to determine policy violations.
Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow. Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.
The data scientists have written the following code to read the data for a new key features in the logs.
BigQueryIO.Read
.named(“ReadLogData”)
.from(“clouddataflow-readonly:samples.log_data”)
You want to improve the performance of this data read.
What should you do?
- A . Specify the TableReference object in the code.
- B . Use .fromQuery operation to read specific fields from the table.
- C . Use of both the Google BigQuery TableSchema and TableFieldSchema classes.
- D . Call a transform that returns TableRow objects, where each element in the PCollexction represents a single row in the table.
Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable. The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data. They want to improve this performance while minimizing cost.
What should they do?
- A . Redefine the schema by evenly distributing reads and writes across the row space of the table.
- B . The performance issue should be resolved over time as the site of the BigDate cluster is increased.
- C . Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.
- D . Redesign the schema to use row keys based on numeric IDs that increase sequentially per user viewing the offers.
You are working on a sensitive project involving private user data. You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project.
How should you maintain users’ privacy?
- A . Grant the consultant the Viewer role on the project.
- B . Grant the consultant the Cloud Dataflow Developer role on the project.
- C . Create a service account and allow the consultant to log on with it.
- D . Create an anonymized sample of the data for the consultant to work with in a different project.
You want to use a BigQuery table as a data sink. In which writing mode(s) can you use BigQuery as a sink?
- A . Both batch and streaming
- B . BigQuery cannot be used as a sink
- C . Only batch
- D . Only streaming
You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive.
What should you do?
- A . Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.
- B . Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. Reference the column TS instead of the column DT from now on.
- C . Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. Reference the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.
- D . Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, reference the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.
- E . Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.
Which of the following statements about the Wide & Deep Learning model are true? (Select 2 answers.)
- A . The wide model is used for memorization, while the deep model is used for generalization.
- B . A good use for the wide and deep model is a recommender system.
- C . The wide model is used for generalization, while the deep model is used for memorization.
- D . A good use for the wide and deep model is a small-scale linear regression problem.
By default, which of the following windowing behavior does Dataflow apply to unbounded data sets?
- A . Windows at every 100 MB of data
- B . Single, Global Window
- C . Windows at every 1 minute
- D . Windows at every 10 minutes
Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the data.
How should you deduplicate the data most efficiency?
- A . Assign global unique identifiers (GUID) to each data entry.
- B . Compute the hash value of each data entry, and compare it with all historical data.
- C . Store each data entry as the primary key in a separate database and apply an index.
- D . Maintain a database table to store the hash value and other metadata for each data entry.