mrc_certified_dea

COMENTARIOS

ESTADÍSTICAS

RÉCORDS

REALIZAR TEST

Título del Test:

mrc_certified_dea

Descripción:
Master Cheat Sheet

Autor:
mrc

OTROS TESTS DEL AUTOR

Fecha de Creación: 2026/03/27

Categoría: Otros

Número Preguntas: 57

Valoración:

(0)

COMPARTE EL TEST

Nuevo Comentario

Comentarios
NO HAY REGISTROS

Temario:

A dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_age EXPECT (age > 0) ON VIOLATION FAIL UPDATE What is the expected behavior when an incoming batch of data that violates this constraint gets processed?. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset. Records that violate the expectation cause the job to fail. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.

Which of the following events will not happen when a cluster terminates in Databricks?. Associated virtual machines and operational memory will be purged. Attached volume storage will be deleted. Network connections between nodes will be removed. Execution state and cache will be cleared out. Code and data files which you have saved earlier will be lost.

As a data engineer, you are working on a task to retain only those versions of sales Delta table which are not older than last 7 days. You have observed that directory associated with this table contains a large number of data files for all the historical versions since inception. What will be the query that you will run to delete the files of sales table older than 1 week?. VACUUM sales RETAIN 168 HOURS. VACUUM sales RETAIN 7 DAYS. VACUUM sales RETAIN 10080 MINUTES. VACUUM sales RETAIN 1 WEEK. VACUUM sales RETAIN 604800 SECONDS.

A junior data engineer has created a global temporary view called orders_vw to do some task. He is trying to run the following query on the view. SELECT * FROM orders_vw But, he is getting an error. As a senior data engineer, what correction you will suggest to your colleague in order to run this query?. SELECT * FROM global_temporary.orders_vw. SELECT * FROM global_temp.orders_vw. SELECT * FROM gbl_temp.orders_vw. SELECT * FROM gbl_temporary.orders_vw. SELECT * FROM gbl_tmp.orders_vw.

Which of the following statements Databricks uses to allow insert, update and delete operations to be run as a single atomic transaction on a Delta table?. UPSERT. MERGE. UPDATE. INSERT INTO. DELETE.

You have to create a delta table players which contains the following columns: playerId playerName playerRuns 1 Virat 23726 2 Yuvraj 11778 . . . . . . . . . . . . Which of the following SQL DDL commands is used to create an empty Delta table in the above format only if a table does not already exists with this name?. CREATE OR REPLACE TABLE players ( playerId INT, playerName STRING, playerRuns LONG ) USING DELTA;. CREATE TABLE IF NOT EXISTS players ( playerId INT, playerName STRING, playerRuns LONG );. CREATE OR REPLACE TABLE players WITH COLUMNS ( playerId INT, playerName STRING, playerRuns LONG ) USING DELTA;. CREATE TABLE players IF NOT EXISTS ( playerId INT, playerName STRING, playerRuns LONG ) USING DELTA;. CREATE OR REPLACE TABLE players AS SELECT playerId INT, playerName STRING, playerRuns LONG USING DELTA;.

You need to create a table employees which will have the following columns. CREATE TABLE employees ( id INT, name STRING, gender STRING, dateOfBirth DATE, age INT ___________, salary LONG ) USING DELTA; Here, age is a calculated column which is getting inferred from dateOfBirth column. Choose the correct statement from below and fill in the blank for this SQL code block to execute. GENERATED DEFAULT AS (CAST(DATEDIFF(CURRENT_DATE, dateOfBirth) / 365) AS INT). GENERATED ALWAYS AS (CAST(DATEDIFF(CURRENT_DATE, dateOfBirth) / 365) AS FLOAT). GENERATED DEFAULT AS (CAST(DATEDIFF(CURRENT_DATE, dateOfBirth) / 365) AS FLOAT). GENERATED ALWAYS AS (CAST(DATEDIFF(CURRENT_DATE, dateOfBirth) / 365) AS INT). GENERATED AS (CAST(DATEDIFF(CURRENT_DATE, dateOfBirth) / 365) AS FLOAT).

A team of data analysts is testing data in a table called sales using SQL. They want some help from the data engineering team on this table but the data engineers use Python in their Databricks notebooks. How can the data engineers query this table using PySpark?. spark.table(“select * from sales“). “select * from sales“. spark.sql(“select * from sales“). spark.sql(“sales“). There is no way to access sales table in Python notebook.

Which of the below SQL DDL can be used to create a UDF to convert distance value from Miles to KM and vice-versa? Note: If any other distance measure is provided then function should return the same value. CREATE UDF convert_distance(distance DOUBLE, measure STRING) RETURN CASE WHEN measure = “Miles“ THEN (distance * 1.60934) WHEN measure = “KM“ THEN (distance * 0.621371) ELSE distance END;. CREATE UDF FUNCTION convert_distance(distance DOUBLE, measure STRING) RETURN CASE WHEN measure = “Miles“ THEN (distance * 1.60934) WHEN measure = “KM“ THEN (distance * 0.621371) ELSE distance END;. CREATE FUNCTION convert_distance(distance DOUBLE, measure STRING) RETURN CASE WHEN measure = “Miles“ THEN (distance * 1.60934) WHEN measure = “KM“ THEN (distance * 0.621371) ELSE distance END;. CREATE FUNCTION convert_distance(distance DOUBLE, measure STRING) RETURNS DOUBLE RETURN CASE WHEN measure = “Miles“ THEN (distance * 1.60934) WHEN measure = “KM“ THEN (distance * 0.621371) ELSE distance END;. CREATE UDF convert_distance(distance DOUBLE, measure STRING) RETURNS DOUBLE RETURN CASE WHEN measure = “Miles“ THEN (distance * 1.60934) WHEN measure = “KM“ THEN (distance * 0.621371) ELSE distance END;.

As part of ingesting data, which of the following options will help you to overwrite the existing data and schema of a Delta table?. INSERT OVERWRITE. INSERT INTO. MERGE INTO. COPY INTO. CREATE OR REPLACE TABLE.

How can you inject the Python variables called table_name and database_name into a SQL query and execute it using PySpark?. spark.sql(f“SELECT * FROM {database_name}+{table_name}“). spark.sql(f“SELECT * FROM [database_name].[table_name]“). spark.sql(f“SELECT * FROM {database_name}.{table_name}“). spark.sql(f“SELECT * FROM (database_name).(table_name)“). spark.sql(“SELECT * FROM .“).

Which of the following SQL commands will return the count of distinct product names from an existing Delta table called products?. SELECT COUNT(DISTINCT product_name) FROM products. SELECT COUNT(UNIQUE product_name) FROM products. SELECT UNIQUE(COUNT(product_name)) FROM products. SELECT DISTINCT(COUNT product_name) FROM products. SELECT DISTINCT(product_name) FROM products.

Which of the following data workloads will utilize a Bronze table as its destination?. A job that aggregates cleaned data to create standard summary statistics. A job that queries aggregated data to publish key insights into a dashboard. A job that develops a feature set for a machine learning application. A job that enriches data by parsing its timestamps into a human-readable format. A job that ingests raw data from a streaming source into the Lakehouse.

Which of the following techniques Structured Streaming uses to ensure end-to-end, exactly-once semantics under any failure condition?. Write Ahead Logging and Water Marking. Checkpointing, Write Ahead Logs and Idempotent Sinks. Write Ahead Logging and Idempotent Sinks. Checkpointing and Write Ahead Logging. Idempotent Sinks and Water Marking.

What is the name of the column added by Auto Loader automatically to capture any data that might be malformed and not fit into the table during incremental data ingestion?. rescued_data. rescued_data_. _rescued_data. _rescued_data_. rescuedData.

As a data engineer, you are working on a complex project using Databricks Lakehouse Platform. You want the ability to break down your code into simpler and reusable components. For this, you have designed a module having all the custom functions in a separate notebook which can be used in any other notebook. Which of the following magic commands can be used to import the custom functions notebook to your current notebook?. %import. %execute. %run. %load. %include.

A data analysis team consisting of 5 members is running queries on a SQL endpoint and performance of queries was quite okay. The number of concurrent users has been increased now from 5 to 50. Due to this, performance of SQL endpoint got impacted and queries are running too slowly. The cluster size has already been set to the maximum but still the queries are running slowly. Which of the following approaches can the data analysis team use so that the queries run faster for all concurrent users?. Turn on the Auto Stop feature for the SQL endpoint. Turn on the Serverless feature for the SQL endpoint. Increase the cluster size of the SQL endpoint. Increase the maximum bound of the SQL endpoints scaling range. Turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to Reliability Optimized“.

How many days of historical job runs are preserved by the Databricks?. 30. 45. 60. 90. 120.

A data engineer has set up two Jobs called A and B. Job A runs before Job B and the output of Job A is used as the input of Job B. If the Job A gets failed due to some reason then Job B cannot be run until Job A gets fixed and rerun successfully. Which of the following statements is logically correct about the two Jobs A and B?. Job A has a dependency on Job B to get completed. Both jobs are dependent on each other to get completed. Job B has a dependency on Job A to get completed. Both jobs are independent of each other and can be run independently. None of the above.

Which of the following views cannot be referenced outside of the notebook in which they are declared?. View and Temporary View. Global Temporary View and Temporary View. View. Global Temporary View. Temporary View.

In which of the following scenarios, you will want to use a job cluster instead of an all-purpose cluster?. A team of data scientists need to collaborate on testing of a machine learning model. A data engineer needs to manually debug and fix a production error. An ad-hoc analytics report needs to be developed using some data visualizations. An automated workflow needs to be run every 1 hour. A data analyst needs to schedule a Databricks SQL query for upward reporting.

Which of the following statements correctly describe the Databricks Lakehouse Platform?. Databricks Lakehouse Platform is the replacement of data warehouses. Databricks Lakehouse Platform is only suitable for streaming data workloads to process data in real time. Databricks Lakehouse Platform does not have the support BI use cases. Databricks Lakehouse Platform combines the best elements of data lakes and data warehouses to deliver reliability, strong governance and performance. Databricks Lakehouse Platform is designed only for batch data workloads with huge data.

A data engineer is in the process of converting their existing data pipeline to utilize Auto Loader for incremental processing in the ingestion of JSON files. streaming_df = ( spark .readStream .format(“cloudFiles“) .______________________ .option(“cloudFiles.schemaLocation“, schemaLocation) .load(sourcePath) ) Which of the following code snippet should be used to fill above blank so that data engineer can use Auto Loader for ingesting the data?. option(“format“, “json“). option(“cloudFiles.format“, “json“). option(“cloudFiles“, “json“). option(cloudFiles.format, json). option(“json“).

As a data engineer, you are working on a task to load external CSV files into a Delta table using the COPY INTO command given below. COPY INTO my_table FROM ‘/path/to/files‘ FILEFORMAT = CSV After running the command once, the schema of the incoming data has changed in one of the files due to addition of a new column and data load got failed. Which of the following SQL commands will help you to overcome this issue?. COPY INTO my_table FROM ‘/path/to/files‘ FILEFORMAT = CSV FORMAT_OPTIONS (‘overWriteSchema‘ = ‘true‘) COPY_OPTIONS (‘overWriteSchema‘ = ‘true‘);. COPY INTO my_table FROM ‘/path/to/files‘ FILEFORMAT = CSV FORMAT_OPTIONS (‘mergeSchema‘ = ‘true‘) COPY_OPTIONS (‘mergeSchema‘ = ‘true‘);. COPY INTO my_table FROM ‘/path/to/files‘ FILEFORMAT = CSV FORMAT_OPTIONS (‘mergeSchema‘ = ‘true‘) COPY_OPTIONS (‘overWriteSchema‘ = ‘true‘);. COPY INTO my_table FROM ‘/path/to/files‘ FILEFORMAT = CSV FORMAT_OPTIONS (‘overWriteSchema‘ = ‘true‘) COPY_OPTIONS (‘mergeSchema‘ = ‘true‘);. None of the above.

A junior data engineer was typing up a query to manually delete some records from delta table products but he accidentally executed the following query and deleted all records from this table. DELETE FROM products You ran the following query: DESCRIBE HISTORY products and figured out that the latest version is 21. You need to roll back the delete operation and get the table back to its original version having all the data. Which of the following SQL statements will help you to accomplish this task?. ROLLBACK TABLE products TO VERSION 20. RESTORE products TO VERSION AS OF 19. ROLLBACK TABLE products TO VERSION 20. RESTORE products TO VERSION 19. RESTORE TABLE products TO VERSION AS OF 20.

You want to execute a block of code only when the value of variable isValid is 1 and value of variable isRunnable is True Which of the following Python control flow statements will help you to achieve this task?. if isValid = 1 and isRunnable = True:. if isValid == 1 and isRunnable:. if isValid == 1 and isRunnable = “True“:. if isValid = 1 and isRunnable:. if isValid == 1 && isRunnable = True:.

A data engineering team have noticed that one of their ETL jobs which runs every midnight fails intermittently due to one of its tasks. Every time, they have to manually rerun the job in the morning to complete the process, causing overhead and trouble. Which of the following ways can be used by the data engineering team to ensure that the job completes every night while minimizing compute costs?. They can observe the task as it runs to try and determine why it is failing. They can institute a retry policy for the task that periodically fails. They can institute a retry policy for the entire job. They can set up the Job to run multiple times ensuring that at least one will complete. They can utilize a Jobs cluster for each of the tasks in the Job.

In a Databricks notebook, all the cells contain code in Python language and you want to add another cell with a SQL statement in it. You changed the default language of the notebook to accomplish this task. What changes (if any) can be seen in the already existing Python cells?. Magic command %python will be added at the end of all cells that contain Python code. Python cells will be grayed out and wont run until you change the default language back to Python. Magic command %python will be added at the beginning of all cells that contain Python code. Magic command %sql will be added at the beginning of all cells that contain Python code. There will be no change in any cell and the notebook will remain the same.

You have a view employees_updates that shows employees who joined or resigned last month. ·Column type indicates whether the employee is "new" or "former". You want to update the employees Delta table by: ·Inserting new employees ·Deleting former employees Which query accomplishes this CDC task?. MERGE INTO employees e USING employees_updates eu ON e.id = eu.id WHEN MATCHED AND eu.type = “former“ THEN DELETE * WHEN NOT MATCHED AND eu.type = “new“ THEN INSERT ALL UPSERT INTO employees e USING employees_updates eu ON e.id = eu.id WHEN MATCHED AND eu.type = “former“ THEN DELETE WHEN NOT MATCHED AND eu.type = “new“ THEN INSERT *. MERGE INTO employees e USING employees_updates eu ON e.id = eu.id WHEN MATCHED AND eu.type = “former“ THEN DELETE WHEN NOT MATCHED AND eu.type = “new“ THEN INSERT *. MERGE INTO employees e USING employees_updates eu ON e.id = eu.id IF MATCHED AND eu.type = “former“ THEN DELETE IF NOT MATCHED AND eu.type = “new“ THEN INSERT *. MERGE INTO employees e USING employees_updates eu ON e.id = eu.id WHEN MATCHED AND eu.type = “former“ THEN DELETE ALL WHEN NOT MATCHED AND eu.type = “new“ THEN INSERT *.

A data engineer wants to make a new table from different data sources and change the data a bit. They want to do this in one step without changing the original data. Which SQL command is the best for making a new table and changing the data at the same time?. INSERT INTO. CREATE TABLE AS SELECT (CTAS). MERGE INTO. UPDATE. ALTER TABLE.

A data engineer notices that the data files linked to a Delta table are unusually small, leading to suboptimal performance. They decide to consolidate these small files into fewer, larger files to enhance query efficiency. Which keyword should be used to achieve the consolidation of small files?. REDUCE. OPTIMIZE. REPARTITION. COMPACTION. VACUUM.

A data engineer is implementing a data pipeline that requires updates, deletions, and merges into a large dataset stored on a distributed file system. The engineer needs to ensure that these operations are performed reliably and consistently, with the ability to handle concurrent modifications without data corruption or loss. Which feature should the data engineer leverage to guarantee atomic, consistent, isolated, and durable (ACID) transactions within their data pipeline?. Delta Lake. File compaction. Spark Streaming. Data partitioning. Dataframe caching.

You have a scheduled weekly job that compiles analytics reports. Stakeholders depend on these reports for their Monday morning meetings. It‘s crucial that any job failures are addressed promptly, and stakeholders are updated on the status of their reports. How should you set up the job to ensure stakeholders are promptly informed about the status of the analytics report job runs?. Manually check the job status every Monday and send emails to stakeholders. Schedule the job run for Monday morning and hope it completes on time. Disable all alerts to avoid flooding stakeholders with notifications. Configure the job to send completion alerts to the stakeholders‘ email addresses.

A data engineering team possesses two distinct tables: march_transactions, which aggregates all retail transactions for March, and april_transactions, encompassing all transactions for April. There are no duplicate records between the tables. To compile a new table, all_transactions, that merges records from both march_transactions and april_transactions without introducing duplicates, which command should be executed?. CREATE TABLE all_transactions AS SELECT * FROM march_transactions OUTER JOIN SELECT * FROM april_transactions;. CREATE TABLE all_transactions AS SELECT * FROM march_transactions INTERSECT SELECT * from april_transactions;. CREATE TABLE all_transactions AS SELECT * FROM march_transactions INNER JOIN SELECT * FROM april_transactions;. CREATE TABLE all_transactions AS SELECT * FROM march_transactions MERGE SELECT * FROM april_transactions;. CREATE TABLE all_transactions AS SELECT * FROM march_transactions UNION SELECT * FROM april_transactions;.

In the context of Databricks workflows, which pattern would you use to handle data at different stages of refinement, from raw to ready-for-use?. Fan-out. Sequence. Loop. Funnel.

You are a data engineer overseeing a critical daily data processing pipeline. This pipeline includes tasks for data ingestion, transformation, and finally, loading the data into a data warehouse. One morning, you receive an alert that the job has failed during the transformation task due to an unexpected outage in the cluster. What action should you take to minimize downtime and ensure data integrity in your daily pipeline?. Use the repair feature to rerun only the failed transformation task after resolving the cluster issue. Manually restart the entire job from the beginning. Ignore the alert and wait for the next scheduled run. Permanently remove the transformation task from the pipeline to avoid future failures.

A dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_timestamp EXPECT (timestamp > ‘2022-01-01‘) ON VIOLATION DROP ROW What is the expected behavior when a batch of data containing data that violates these constraints is processed?. Records that violate the expectation cause the job to fail. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.

In which scenario should a predecessor task be set up?. When all tasks need to run in parallel. When a task must wait for the completion of another task before it starts. When tasks are scheduled at different times. When tasks are completely independent of each other.

How does Databricks‘ workflow feature ‘Simple Workflow Authoring‘ cater to teams?. It is exclusively available for users with administrative privileges. It requires advanced programming skills for all data team members. It provides a point-and-click authoring experience accessible to all team members. It only allows workflow editing in command-line interfaces.

What is a key difference between Delta Live Tables and Workflow Jobs in Databricks regarding the source of the tasks?. Delta Live Tables can run tasks from any source, while Workflow Jobs are limited to notebooks only. Both Delta Live Tables and Workflow Jobs support tasks from JARs, notebooks, and DLT, without support for other programming languages. Delta Live Tables can only run tasks from notebooks, while Workflow Jobs can run tasks from a variety of sources including JARs, notebooks, and applications written in multiple languages. Both Delta Live Tables and Workflow Jobs are restricted to running tasks from JAR files only.

A task in your Databricks job fails unexpectedly, and you need to find out why to fix the issue and prevent it from happening again. What is the first step you should take to debug this failed task?. Immediately escalate the issue to Databricks support. Check the task‘s logs for error messages and clues about the failure. Restart the job and hope it works this time. Increase the resources allocated to the job without further investigation.

A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run. Which of the following tools can the data engineer use to solve this problem?. Data Explorer. Databricks SQL. Auto Loader. Delta Lake. Unity Catalog.

A data engineer discovered an error in a daily update to a table and decided to use Delta Lake‘s time travel feature to revert the table to its state from three days ago. However, they encountered an issue where the attempt to time travel failed due to the deletion of the data files. What is the reason for the missing data files?. The table was cleaned using the VACUUM command. The HISTORY command was run on the table. The DELETE HISTORY command was executed on the table. The TIME TRAVEL feature was utilized on the table. The OPTIMIZE command was applied to the table.

A data engineer uses a Databricks SQL dashboard to oversee the accuracy of data for an ELT job. This job includes a Databricks SQL query that identifies orders with a quantity of 0. The engineer aims to have the whole team alerted through a messaging webhook when any order is recorded with a quantity of 0. Which method should the data engineer employ to ensure the team receives a notification via a messaging webhook whenever an order‘s quantity is reported as 0?. They can set up an Alert with a custom template. They can set up an Alert with a new webhook alert destination. They can set up an Alert with one-time notifications. They can set up an Alert with a new email alert destination. They can set up an Alert without notifications.

A data engineer needs to add a new data record to an existing Delta table named my_table. The record details are as follows: id STRING = ‘c01‘ rank INTEGER = 9 rating FLOAT = 6.2 Which SQL command should be used to append this new record to my_table?. my_table UNION VALUES (‘c01‘, 9, 6.2). INSERT VALUES (‘c01‘, 9, 6.2) INTO my_table. INSERT INTO my_table VALUES (‘c01‘, 9, 6.2). UPDATE my_table VALUES (‘c01‘, 9, 6.2). UPDATE VALUES (‘c01‘, 9, 6.2) my_table.

Which of the following Structured Streaming queries is performing a hop from a Silver table to a Gold table?. (spark.table(“sales“) .groupBy(“store“) .agg(sum(“sales“)) .writeStream .option(“checkpointLocation“, checkpointPath) .outputMode(“complete“) .table(“newSales“) ). (spark.table(“sales“) .withColumn(“avgPrice“, col(“sales“) / col(“units“)) .writeStream .option(“checkpointLocation“, checkpointPath) .outputMode(“append“) .table(“newSales“) ). (spark.read.load(rawSalesLocation) .writeStream .option(“checkpointLocation“, checkpointPath) .outputMode(“append“) .table(“newSales“) ). (spark.table(“sales“) .filter(col(“units“) > 0) .writeStream .option(“checkpointLocation“, checkpointPath) .outputMode(“append“) .table(“newSales“) ). (spark.readStream.load(rawSalesLocation) .writeStream .option(“checkpointLocation“, checkpointPath) .outputMode(“append“) .table(“newSales“) ).

Which of the following must be specified when creating a new Delta Live Tables pipeline?. A key-value pair configuration. The preferred DBU/hour cost. A path to cloud storage location for the written data. A location of a target database for the written data. At least one notebook library to be executed.

In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command?. When the source table can be deleted. When the location of the data needs to be changed. When the source is not a Delta table. When the target table cannot contain duplicate records. When the target table is an external table.

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table. The code block used by the data engineer is below: (spark.readStream .table(“sales“) .withColumn(“avg_price“, col(“sales“) / col(“units“)) .writeStream .option(“checkpointLocation“, checkpointPath) .outputMode(“complete“) ._______ .table(“new_sales“) ) If the data engineer only wants the query to process all of the available data in as many batches as required, which of the following lines of code should the data engineer use to fill in the blank?. trigger(parallelBatch=True). trigger(availableNow=True). processingTime(1). trigger(processingTime=“once“). trigger(continuous=“once“).

For what reason would a data engineer specify a Task in the “Depends On“ field while configuring a new Task in a Databricks Job?. When another task has the same dependency libraries as the new task. When another task needs to fail before the new task begins. When another task needs to successfully complete before the new task begins. When another task needs to be replaced by the new task. When another task needs to use as little compute resources as possible.

How can a data engineer perform audit logging and examine the lineage of datasets within Delta Live Tables pipelines?. Reviewing the Delta Lake transaction log for each table to trace back the lineage and audit changes. Utilizing the DLT pipeline‘s settings to automatically generate audit logs for review. Querying the event log for user_action events and flow_definition to audit user actions and understand dataset lineage. By enabling detailed logging at the cluster level and examining system logs.

Which command allows for data to be inserted into a Delta table while ensuring duplicates are not written?. INSERT. UPDATE. MERGE. APPEND. DROP.

What tool does Auto Loader utilize to process data in increments?. Checkpointing. Spark Structured Streaming. Unity Catalog. Databricks SQL. Data Explorer.

A data engineer notices that one of the notebooks within a Job, which runs two notebooks as separate tasks, is performing slowly during the Job‘s current execution. They seek assistance from a tech lead to uncover the reason behind this slow performance. Which approach can the tech lead take to determine why the notebook is running slowly as part of the Job?. They can navigate to the Runs tab in the Jobs UI and click on the active run to review the processing notebook. They can navigate to the Tasks tab in the Jobs UI to immediately review the processing notebook. They can navigate to the Runs tab in the Jobs UI to immediately review the processing notebook. They can navigate to the Tasks tab in the Jobs UI and click on the active run to review the processing notebook. There is no way to determine why a Job task is running slowly.

A data engineer needs to apply a complex run schedule from one Job to others without manually setting each time in Databricks. Which tool allows them to define and apply this schedule through code (programmatically)?. Cron syntax. pyspark.sql.types.DateType. pyspark.sql.types.TimestampType. datetime. There is no way to represent and submit this information programmatically.

A data engineer needs to update an existing Delta table with fresh data from a recent batch process. The new data should replace all existing records in the table to ensure that only the most current data is available for analysis. The engineer wants to accomplish this in a single operation to maintain efficiency. Which SQL command should the data engineer use to update the table with the new data, replacing all old records?. INSERT OVERWRITE. UPDATE. INSERT INTO. ALTER TABLE. MERGE INTO.

What is a critical step in orchestrating a task within a Databricks Workflow Job?. Assigning a unique identifier to each task manually. Choosing a color scheme for task visualization. Configuring the maximum allowable downtime for the job. Setting up task dependencies to control execution order.

Denunciar Test

▲