Databricks professional data engineer
|
|
Título del Test:
![]() Databricks professional data engineer Descripción: Databricks professional data engineer |



| Comentarios |
|---|
NO HAY REGISTROS |
|
A data engineer has created a 'transactions' Delta table on Databricks that should be used by the analytics team. The analytics team wants to use the table with another tool which requires Apache Iceberg format. What should the data engineer do?. A. Require the analytics team to use a tool which supports Delta table. B. Create an Iceberg copy of the 'transactions' Delta table which can be used by the analytics team. C. Convert the 'transactions' Delta to Iceberg and enable uniform so that the table can be read as a Delta table. D. Enable uniform on the transactions table to 'iceberg' so that the table can be read as an Iceberg table. A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure. The silver_device_recordings table will be used downstream for highly selective joins on a number of fields, and will also be leveraged by the machine learning team to filter on a handful of relevant fields. In total, 15 fields have been identified that will often be used for filter and join logic. The data engineer is trying to determine the best approach for dealing with these nested fields before declaring the table schema. Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?. A. Because Delta Lake uses Parquet for data storage, Dremel encoding information for nesting can be directly referenced by the Delta transaction log. B. Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems. C. The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient. D. By default, Delta Lake collects statistics on the first 32 columns in a table; these statistics are leveraged for data skipping when executing selective queries. A platform engineer is creating catalogs and schemas for the development team to use. The engineer has created an initial catalog, Catalog_A, and initial schema, Schema_A. The engineer has also granted USE CATALOG, USE SCHEMA, and CREATE TABLE to the development team so that the engineer can begin populating the schema with new tables. Despite being owner of the catalog and schema, the engineer noticed that they do not have access to the underlying tables in Schema_A. What explains the engineer's lack of access to the underlying tables?. A. The owner of the schema does not automatically have permission to tables within the schema, but can grant them to themselves at any point. B. Users granted with USE CATALOG can modify the owner's permissions to downstream tables. C. Permissions explicitly given by the table creator are the only way the Platform Engineer could access the underlying tables in their schema. D. The platform engineer needs to execute a REFRESH statement as the table permissions did not automatically update for owners. A data engineer has created a new cluster using shared access mode with default configurations. The data engineer needs to allow the development team access to view the driver logs if needed. What are the minimal cluster permissions that allow the development team to accomplish this?. A. CAN VIEW. B. CAN RESTART. C. CAN ATTACH TO. D. CAN MANAGE. A data engineer wants to create a cluster using the Databricks CLI for a big ETL pipeline. The cluster should have five workers and one driver of type i3.xlarge and should use the '14.3.x-scala2.12' runtime. Which command should the data engineer use?. A. databricks compute add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster. B. databricks clusters create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster. C. databricks compute create 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster. D. databricks clusters add 14.3.x-scala2.12 --num-workers 5 --node-type-id i3.xlarge --cluster-name Data Engineer_cluster. A 'transactions' table has been liquid clustered on the columns 'product_id’, ’user_id' and 'event_date'. Which operation lacks support for cluster on write?. A. CTAS and RTAS statements. B. spark.writeStream.format(’delta').mode(’append’). C. spark.write.format('delta’).mode('append'). D. INSERT INTO operations. The data governance team has instituted a requirement that the "user" table containing Personal Identifiable Information (PII) must have the appropriate masking on the SSN column. This means that anyone outside of the HRAdminGroup should see masked social security numbers as ***-**-****. The team created a masking function: What does the data governance team need to do next to achieve this goal?. A. CREATE TABLE users - (name STRING); ALTER TABLE users CREATE COLUMN ssn CREATE MASK ssn_mask;. B. CREATE TABLE users - (name STRING, int STRING); ALTER TABLE users ALTER COLUMN ssn CREATE MASK if is_member('HRAdminGroup');. C. CREATE TABLE users - (name STRING, ssn INT MASKED ssn_mask);. D. CREATE TABLE users - (name STRING, ssn STRING); ALTER TABLE users ALTER COLUMN ssn SET MASK ssn_mask;. A data engineer needs to create an application that will collect information about the latest job run including the repair history. How should the data engineer format the request?. A. Call/api/2.1/jobs/runs/list with the run_id and include_history parameters. B. Call/api/2.1/jobs/runs/get with the run_id and include_history parameters. C. Call/api/2.1/jobs/runs/get with the job_id and include_history parameters. D. Call/api/2.1/jobs/runs/list with the job_id and include_history parameters. A data engineer is working in an interactive notebook with many transformations before outputting the result from display(df.collect() ). The notebook includes wide transformations and a cross join. The data engineer is getting the following error: "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached." Which action should the data engineer take?. A. Run the notebook on a single node cluster to keep driver from falling. B. Rewrite their code to avoid putting memory pressure on the driver node. C. Check into the Spark UI to see how many jobs are assigned to each stage as they are employing fewer executors. D. Look at the compute metrics UI to see if the executors have higher than 90% memory utilization. An analytics team wants run an experiment in the short term on the customer transaction Delta table (with 20 billions records) created by the data engineering team in Databricks SQL. Which strategy should the data engineering team use to ensure minimal downtime and no impact on the ongoing ETL processes?. A. Deep clone the table for the analytics team. B. Create a new table for the analytics team using a CTAS statement. C. Shallow clone the table for the analytics team. D. Give access to the table for the analytics team. A data team is working to optimize an existing large, fast-growing table 'orders' with high cardinality columns, which experiences significant data skew and requires frequent concurrent writes. The team notice that the columns 'user_id', 'event_timestamp' and 'product_id' are heavily used in analytical queries and filters, although those keys may be subject to change in the future due to different business requirements. Which partitioning strategy should the team choose to optimize the table for immediate data skipping, incremental management over time, and flexibility?. A. Partition the table with: ALTER TABLE orders PARTITION BY user_id, product_id, event_timestamp. B. Use z-order after partitiing the table: OPTIMIZE orders ZORDER BY (user_id, product_id) WHERE event_timestamp = current date () - 1 DAY. C. Cluster the table with: ALTER TABLE orders CLUSTER BY user_id, product_id, event_timestamp. D. Z-order the table with OPTIMIZE orders ZORDER BY (user_id, product_id, event_timestamp). A faulty IoT sensor in a factory reports a temperature of -500, causing the LDP pipeline to fail the expectation, which only allows values between -100 and 200 degrees Celsius. The data engineer would like to further analyze the faulty data to better understand the reason behind this. How should the data engineer resolve the faulty data while ensuring data quality standards are maintained?. A. Remove all expectations form the pipeline to prevent any future failures, regardless of data quality. B. Ignore the error and simply re-run the pipeline, as Databricks will automatically skip the problematic record on the next run. C. Fix the pipeline code and implement a quarantine logic to isolate the faulty data before re-running the pipeline. D. Change the expectation action from fail to warn so that invalid records are included in the output and the pipeline does not fail. A data engineer is optimizing a managed table that suffers from data skew and frequently changing query filter columns. The engineer needs to avoid costly data rewrites when query patterns evolve. The table size is under 1TB. How should data engineer meet this requirement?. A. Use Hive-style partitioning, as it provides efficient data skipping and is easy to change partition columns at any time. B. Combine partitioning and Z-ordering to maximize flexibility and minimize maintenance as query patterns change. C. Enable liquid clustering, as it efficiently handles data skew, allows clustering keys to be changed without rewriting existing data, and adapts to evolving query patterns. D. Apply Z-ordering, since it allows flexible reorganization of data layout without rewriting existing and adapts easily to new filter columns. A security team wants to enforce data protection for a customer table containing customer PII data. To comply with local policies, sales team members should only see customers from their region, while non-admin users should have email addresses masked. Which implementation approach should be used when using Unity Catalog row filters and column masks?. A. Create SQL UDFs for row filtering based on user region and column masking based on group membership, then apply them using ALTER TABLE SET ROW FILTER and ALTER COLUMN SET MASK commands. B. Use table ACLs to restrict access using tags with GRANT SELECT ON table_name WITH TAG command, and rely on application-level filtering for sensitive data based on user region. C. Create a view with dynamic WHERE clauses for region filtering and use string replacement functions for email masking using ALTER COLUMN SET MASK command. D. Implement row filters with SQL UDFs based on user region only since column masks cannot be combined with row filters on the same table, then apply them be recreating the table with DROP TABLE and CREATE TABLE SET ROW FILTER commands. A data engineer is evaluating tools to build a production-grade data pipeline. The team must process change data from cloud object storage, filter out or isolate invalid records, and ensure the timely delivery of clean data to downstream consumers. The team is small, under tight deadlines, and wants to minimize operational overhead while keeping pipelines auditable and maintainable. Which approach should the data engineer implement?. A. Ingest data directly into Delta tables via Spark jobs, apply data quality filters using UDFs, and use LDP for creating Materialized Views. B. Use a hybrid approach: Ingest with Auto Loader into Bronze tables, then process using SQL queries in Databricks Workflows to generate cleaned Silver and Gold tables on a schedule. C. Implement ingestion using Auto Loader with Structured Streaming, and manage invalid data handling and table updates using checkpointing and merge logic. D. Use LDP to build declarative pipelines with Streaming Tables and Materialized Views, leveraging built-in support for data expectations and incremental processing. A data engineer is using Structured Streaming to read in transaction data from a bronze Delta table. It was discovered that the data has quality issues where sometimes the transaction value is negative, and when that occurs, the rows need to be routed to a separate quarantine table. They have low latency requirements for the good data since it is used by downstream systems, but the bad data will only be analyzed periodically and has no production dependencies. The quarantine job needs to be implemented so that it cannot affect the production processes that depend on the good data, and the cost of the job needs to be minimized. How should the quarantine process be implemented in order to satisfy these requirements?. A. The streaming job for the good data needs to be modified to filter out records with a transaction value less than 0 before writing. The streaming job for the quarantine data needs to filter out records with a transaction value greater than or equal to 0 before writing. Both should run as separate streams on the same cluster to minimize cost. B. The existing streaming job for the good data should be updated to incorporate the quarantining of the bad data. A new boolean column called “quarantine” should be added to the dataframe, and its value should be set to true if the transaction value is less than 0 and false if the transaction value is greater than or equal to 0. Processing and storing all the data together will save costs. C. The existing streaming job for the good data should be updated to incorporate the quarantining of the bad data. Inside a foreachBatch function, the dataframe should be filtered so that records with a transaction value greater than or equal to 0 are written to the good data table and records with a transaction value less than 0 are written to a quarantine table. Try/Catch can be added around the writes in the foreachBatch function so that the stream can’t fail. D. The streaming job for the good data needs to be modified to filter out records with a transaction value less than 0 before writing, and should not share compute with other processes. The streaming job for the quarantine data needs to filter out records with a transaction value greater than or equal to 0 before writing, and should be implemented on a separate small cluster and only run once a day to minimize cost. When monitoring a complex workload, being able to see the query plan is critical to understanding what the workload is doing. Where can the visualization of the query plan be found?. A. In the Spart UI, under the Jobs tab. B. In the Query Profiler, under Query Source. C. In the Spark UI, under the SQL/DataFrame tab. D. In the Query Profiler, under the Stages tab. A data engineer is troubleshooting a slow-running Delta Lake query on Databricks SQL involves complex joins and large datasets. They need to identify whether the root cause is related to poor data skipping, inefficient join strategies, or excessive data shuffling. Which approach should identify the specific bottlenecks using native Databricks tools?. A. Analyze the Top Operators panel in the Query Profile to identify high-cost operations like BroadcastNestedLoopJoin. B. Check the query’s execution time in the Jobs UI and correlate it with cluster resource utilization metrics. C. Enable the EXPLAIN command to review the parsed logical plan and manually estimate shuffle sizes. D. Use the LIMIT clause to run a subset of the query and compare execution times with the full dataset. A data engineer is designing a secure data sharing strategy for their organization. The company needs to share sensitive customer analytics data with two different partners. Partner A uses Databricks with Unity Catalog enabled, while Partner B uses Apache Spark on AWS without Databricks. How should the company implement secure data sharing for these scenarios?. A. For Partner A, implement Databricks-to-Databricks sharing (D2D) with Unit Catalog integration and no-token exchange system. For Partner B, use open sharing protocol (D2O) with either bearer tokens or OIDC federation for authentication, ensuring both approaches maintain robust security and governance. B. Both partners should use the same Delta Sharing approach since security requirements are identical. You should create bearer tokens for both partners and use the open sharing protocol (D2O) for maximum compatibility. C. Open sharing protocol (D2O) should be used for both partners because it provides better security than D2D sharing. The bearer token approach is always more secure than Unity Catalog’s native authentication. D. Databricks-to-Databricks sharing (D2D) can only be used within the same cloud provider, so you must use open sharing (D2O) for any cross-cloud scenarios. Unit Catalog governance is not available when sharing with external platforms. Which approach demonstrates a modular and testable way to use DataFrame transform for ETL code in PySpark?. A. B. C. D. A company stores account transactions in a Delta Lake table. The company needs to apply frequent account-level correlations (e.g., UPDATE statements) but wants to avoid rewriting entire Parquet files for each change to reduce file churn and improve write performance. Which Delta Lake feature should they enable?. A. Enable automatic file compaction on writes. B. Enable change data feed on the Delta table. C. Partition the Delta table by account_id. D. Enable deletion vectors on the Delta table. In a Databricks Asset Bundle project, in the file resources/app.yml, the data engineer would like to deploy a Databricks Apps databricks_app_deployed and Volume volume_deployed and grant the Service Principal behind Databricks Apps permissions to READ and WRITE to the Volume. How should the data engineer achieve the deployment?. A. B. C. D. A data engineering teams needs to implement a tagging system for their tables as part of an automated ETL process, and needs to apply tags programmatically to tables in Unity Catalog. Which SQL command adds tags to a table programmatically?. A. ALTER TABLE table_name SET TAGS (‘key1’ = ‘value1’, ‘key2’ = ‘value2’);. B. APPLY TAGS ON table_name VALUES (‘key1’ = ‘value1’, ‘key2’ = ‘value2’). C. COMMENT ON TABLE table_name TAGS (‘key1’ = ‘value1’, ‘key2’ = ‘value2’). D. SET TAGS FOR table_name AS (‘key1’ = ‘value1’, ‘key2’ = ‘value2’). A platform engineer needs to report the resource consumption, categorized by SKU tier, across all workspaces. The engineer decides to use the system.billing.usage system table to create a query. Which SQL query will accurately return the daily usage by product?. A. B. C. D. A data engineer is running a groupBy aggregation on a massive user activity log grouped by user_id. A few users have millions of records, causing task skew and long runtimes. Which technique will fix the skew in this aggregation?. A. Increase the Spark driver memory and retry. B. Filter out the skewed users before the aggregation. C. Use salting by adding a random prefix to skewed keys before aggregation, then aggregate again after removing the prefix. D. Use reduceByKey instead of groupBy to avoid shuffles. A job runs four independent tasks (X, Y, Z, W) in parallel to process regional sales data. The Data Engineering team recently updated its cluster policy to ban cost-prohibitive instance types. Task Y now fails due to the newly enforced cluster policy restricting the use of a specific instance type. A data engineer needs to resolve the failure quickly without disrupting the other tasks. How should the data engineer resolve the failure of tasks?. A. Delete the failed run, disable the cluster policy, and re-execute all tasks. B. Manually create a new cluster for Task Y, update the job configuration, and trigger a full re-run. C. Use “Repair run”, override the cluster configuration for Task Y to use a permitted instance type, and let Databricks re-run only Task Y. D. Edit the global cluster policy to allow the restricted instance type, then re-run the entire job. A data engineer is analyzing a large, partitioned retail dataset in Databricks, where each row represents a sale made by a salesperson. The dataset contains millions of records with the following schema: sales_df: [salesperson_id: string, region: string, sale_amount: double, sale_date: date] The data engineer needs to generate a DataFrame that ranks salespeople within each region based on their total cumulative sales, with the highest seller ranked as 1. If multiple salespeople have the same total sales, they should share the same rank. The data engineer wants to implement this logic using a PySpark window function and the dense_rank () function. Which code snippet will perform this ranking?. A. B. C. D. What describes a primary technical challenge in ensuring consistent PII masking across all nodes in large-scale, distributed Databricks batch and streaming pipelines?. A. PII masking is only required for direct identifiers. B. Dynamic data masking is applied only at rest, so it does not affect query performance. C. Masking functions must be standardized and managed through Unity Catalog, with enforcement applied across all relevant datasets to avoid any data inconsistency. D. Native masking in Databricks automatically synchronizes with all downstream external Databricks systems. A data engineer is working on a Databricks notebook that requires several third-party Python libraries. Some of these are available on PyPI, while others are custom-developed and stored as local.wheel (.whl) and source (.tar.gz) files in an S3 bucket. The goal is to ensure all dependencies are installed and correctly available across multiple jobs running on any automated cluster in a Unity Catalog-enabled workspace. The engineer needs to install the required dependencies in a way that ensures a consistent environment setup across interactive notebooks and jobs and complies with workspace security policies (no internet access). Which approach should the engineer use to install and manage these dependencies while also ensuring reproducibility and compliance?. A. Use an init script on the cluster to install all dependencies using pip, referencing the local file system. B. Install all dependencies manually in the driver node of an interactive cluster, then export the environment and reimport on job clusters using %conda. C. Create a Python wheel file for the entire project, upload it to the Databricks Workspace Files or Volumes, and install it using a Cluster Library or pip install in a requirements.txt declared within a Databricks Asset Bundle. D. Use %pip install in every notebook and job to install packages directly from PyPl and custom S3 paths. A data engineer is using Auto Loader to read in JSON data as it arrives. They have configured Auto Loader to quarantine invalid JSON records. They are noticing that over time, some records are being quarantined even though they are well-formed JSON. The snippet of code is: What is the cause of the missing data?. A. The source data is valid JSON, but doesn’t conform to their defined schema in some way. B. The badRecordsPath location is accumulating many small files. C. The engineer forgot to set the option “cloudFiles.quarantineMode”, “rescue”. D. At some point, the upstream data provider switched everything to multi-line JSON. A data engineer wants to enforce the principle of least privilege when configuring ACLs for Databricks jobs in a collaborative workspace. Which approach should the data engineer use?. A. Assign users only the minimum permission level (e.g., CAN RUN or CAN VIEW) required for their role on each job. B. Use only folder-level permissions and avoid setting permissions on individual jobs. C. Grant CAN RUN permission to everyone and CAN MANAGE to a single admin group. D. Grant all users CAN MANAGE permission on all jobs to avoid access issues. A data engineering workspace was automatically enabled for Unity Catalog, creating a workspace catalog. New team members report they can create tables in the default schema but cannot access table in other schemas within the same workspace catalog. Why are the new team members unable to access tables in other schemas?. A. Workspace catalog permissions are not subject to inheritance rules. B. Workspace users receive USE CATALOG and specific privileges on default schema only. C. Tables in other schemas require additional BROWSE privileges that new users don’t receive automatically. D. New users only receive CREATE TABLE privileges on the default schema. A data engineer is implementing a job to download multiple PDF files from a third-party provided REST API endpoint by specifying different report types. The REST API is time-consuming and encounters intermittent errors, so the engineer wants to track each download activity to know when it fails and to retry partially, while providing scalable throughput. The engineer needs to download ten report types, and the list can be changed over time. How should the data engineer achieve this?. A. Use a foreach task with a list of report types as its inputs. B. Define ten Notebook tasks to clearly track which report download failed. C. Use a Delta Lake table to track each report download status as 10 rows, and use it as a source table to execute the download function as a Pandas UDF. D. Define a list variable within a Notebook to loop through the report types to download them, and print the download results. Execute it as a Notebook tasks. A data engineer and a platform engineer are working together to automate their system tasks. A script needs to be executed outside of Databricks only if a particular daily Databricks job finishes successfully for the day. Databricks CLI command was used to check the last execution of the job. What are the required command options for that task?. A. databricks jobs list-runs --job-id JOB_ID --start-time-to TODAY_MIDNIGHT_EPOCH_MS --completed-only. B. databricks jobs list-runs --job-id JOB_ID --start-time-from TODAY_MIDNIGHT_EPOCH_MS --active-only. C. databricks jobs list-runs --job-id JOB_ID --start-time-to TODAY_MIDNIGHT_EPOCH_MS --active-only. D. databricks jobs list-runs --job-id JOB_ID --start-time-from TODAY_MIDNIGHT_EPOCH_MS --completed-only. A data engineering team needs to create a SQL Alert that monitors data quality across multiple columns in their customer table. They want to trigger an alert when both the percentage of customers with missing email addresses exceeds 15% AND the percentage of customers with invalid phone number formats exceeds 10%. Which SQL query pattern is appropriate for implementing this multi-column alert condition?. A. SELECT COUNT (*) FROM customers WHERE email IS NULL OR phone_format_invalid = true. B. SELECT email, phone FROM customers WHERE email IS NULL AND phone NOT RLIKE ‘ˆ[0-9-+()\\s]+$’. C. SELECT email_null_pct, phone_invalid_pct FROM (SELECT (COUNT(CASE WHEN email IS NULL THEN 1 END) * 100.0/COUNT (*)) as email_null_pct, (COUNT(CASE WHEN phone NOT RLIKE ‘ˆ[0-9-+()\\s]+$’ THEN 1 END)* 100.0/COUNT (*)) as phone_invalid_pct FROM customers). D. SELECT CASE WHEN email_null_pct >15 AND phone_invalid_pct> 10 THEN 1 ELSE 0 END FROM (SELECT (COUNT (CASE WHEN email IS NULL THEN 1 END) * 100.0 / COUNT (*)) as phone_invalid_pct FROM customers) metrics. A data engineer is brining an existing production Databricks job under asset bundle management and wants to ensure that: • The job’s current configuration is captured as YAML, and all referenced files are included in their bundle project. • Future changes to the bundle’s YAML will update the existing job in-place (not create a new job) How should the data engineer successfully move the production job under asset bundle management?. A. Run Databricks bundle generate job --existing-job-id to generate the YAML and download referenced files. Then, run Databricks bundle deploy to deploy the bundle, which will always update the existing job automatically. B. Export the job definition as JSON, convert it to YAML, and place it in your bundle. Then, run Databricks bundle deploy to update the existing job. C. Manually create the YAML configuration for the job in your bundle project, ensuring all settings match the existing job. Then, run Databricks bundle deploy the bundle, which will update the existing job in your workspace. D. Run databricks bundle generate job --existing-job-id to generate the YAML and download referenced files. Then, run Databricks bundle deployment, bind to link the bundle’s job resource to the existing job in Databricks. A data architect is implementing Delta Sharing as part of their data governance strategy to enable secure data collaboration with external partners and internal business units. The architect must establish a permission framework that allows designated data stewards to create shares for their respective domains while maintaining security boundaries and audit compliance. Which specific permissions and roles must be assigned to enable users to create, configure, and manage Delta Shares while maintaining proper security governance and access controls?. A. Only workspace admins can create and manage shares. B. Users need the MANAGE SHARES permission on the workspace. C. Users need to be metastore admins or have CREATE SHARE privilege for the metastore. D. Any user with USE_CATALOG privilege can create shares. A data engineer manages a production Lakeflow Spark Declarative Pipeline that processes customer transaction data. The pipeline includes several data quality expectations, such as: transaction_amount > 0 and customer_id IS NOT NULL. These expectations are defined using the EXPECT clause in SQL. The engineer aims to monitor the pipeline’s data quality by analyzing the number of records that passed or failed each expectation during the latest pipeline update. The Lakeflow Spark Declarative Pipelines event logs are stored in a Delta table named event_log_table. For the most recent pipeline update, determine a programmatically approach to extract information like the name of each expectation, associated dataset, count of records that passed the expectation, and count of records that failed the expectation. Which method retrieves the desired data quality metrics from the Lakeflow Spark Declarative Pipelines event log?. A. Access the event_log_table, filter for events where event_type = ‘flow progress’, and parse the details.flow_progress.data_quality.expectations field to extract the required metrics. B. Use the Lakeflow Spark Declarative Pipelines UI to navigate to the specific pipeline, select the dataset, and view the Data Quality tab to manually retrieve the expectation metrics. C. Query the event_log_table for events with event_type = ‘data_quality’ and directly select the passed_records and failed_records fields. D. Access the event_log_table, filter for events where event_type = ‘expectatation_result’, and extract the expectation metrics from the details field. Predictive Optimization is an automated Databricks service enabled by default for Unity Catalog Managed tables. It helps maintain Delta tables by continuously optimizing them to ensure optimal performance and costs. Which two operations does Predictive Optimization run to maintain the Delta tables? (Choose two.). A. PARTITION BY. B. COMPACT. C. ANALYZE. D. OPTIMIZE. E. BUCKETING. A data engineer is building a customer data pipeline in Lakeflow Spark Declarative Pipelines. The source is a cloud-based event stream with limited retention containing inserts, updates, and deletes for customer records. These changes are being applied using the AUTO CDC INTO syntax to maintain an SCD Type 1 table as the target table, customer_dim. How should the data engineer build a downstream job that streams from the customer_dim table to only act on updates and delete events, processing data incrementally?. A. Use ignoreChanges flag while streaming from customer_dim to avoid breaking the pipeline during updates and deletes. B. Read change data feed from customer_dim table and apply filters to incrementally act on the change events. C. Streaming from customer_dim table would only be possible in the case of SCD 2 retention. D. When stored as SCD 1, the target of AUTO CDC INTO includes updates and deletes. Streaming from customer_dim can fail due to these operations. Instead, build another stream from the original source. A data engineer is designing a system leveraging Lakeflow Declarative Pipeline technology to process real-time truck telemetry data ingested from JSON files in S3 using Auto Loader. The data includes truck_id, timestamp, location, speed, and fuel_level. The system must support two use cases: 1. Near-real-time monitoring of the latest location, speed, and fuel_level per truck_id for the operations team. 2. Daily aggregated reports of total distance traveled and average fuel efficiency per truck_id for the management team. Which approach should the data engineer use for streaming tables and materialized views in the Lakeflow Declarative Pipeline to meet these requirements?. A. Define a streaming table to ingest and store the raw telemetry data, and create a streaming table to compute the daily aggregated distance and fuel efficiency per truck_id reporting. Create a materialized view to compute the latest location, speed, and fuel_level per truck_id for real-time monitoring. B. Define a streaming table to ingest and store the raw telemetry data, and create a materialized view to compute the latest location, speed, and fuel_level per truck_id for real-time monitoring. Create another materialized view to compute the daily aggregated distance and fuel efficiency per truck_id for reporting. C. Define a streaming table to ingest and store the raw telemetry data, and create a streaming table to incrementally compute the latest location, speed, and fuel_level per truck_id for real-time monitoring. Create a materialized view to compute the daily aggregated distance and fuel efficiency per truck_id for reporting. D. Define a materialized view to ingest and store the raw telemetry data, and create a streaming table to compute the latest location, speed, and fuel_level per truck_id for real-time monitoring. Create another materialized view to compute the daily aggregated distance and fuel efficiency per truck_id for reporting. A platform team lead is responsible for automating the individual teams attribution towards SQL Warehouse usage. The requirement is to identify the SQL warehouse usage at the individual user’s level and generate a daily report to be shared with an executive team that includes leaders from all business units. How should the platform lead generate an automated report that can be shared daily?. A. Use the system tables to capture the audit and billing usage data and share the queries with the executive team. This enables the executives to execute the query and see the latest results any time. B. Use the system tables to capture the audit and billing usage data and create a dashboard with daily refresh schedules and shared with the executive team. C. Restrict users from running any SQL query unless they provide all the query details so that the attribution can be calculated and shared with the executive team. D. Let the users run the SQL query and then directly report the usage to the executives. The ownership of the SQL warehouse usage will be with the individual teams. A data engineering team is collaborating on a Databricks project where each team member needs to develop and test code independently before merging changes into the main branch. They want to avoid accidental overwrites or branch switching issues while ensuring that all work is version- controlled and can be integrated into their CI/CD pipeline. How should the data engineer achieve collaboration?. A. Each team member creates their own Databricks Git folder, mapped to the same remote Git repository, and works in their own development branch within their personal folder. B. All team members work in the same Databricks Git folder and perform Git operations (pull, push, commit, branch switching) directly in that shared folder. C. Team members edit notebooks directly in the workspace’s shared folder and periodically copy changes into a Git folder for version control. D. Team members use the Databricks CLI to clone the Git repository and perform Git operations from a cluster’s web terminal. A data engineer is using the AUTO CDC API in Lakeflow Spark Declarative Pipeline to propagate deletions from a source table (orders_source) to a target table (orders_target). The source has Change Data Feed (CDF) enabled, but some delete events arrive out of order due to upstream delays. How does the AUTO CDC API internally ensure deletions are applied correctly despite out-of-order events?. A. It ignores deletions if they arrive after updates for the same key. B. It manually sorts incoming events by timestamp before applying changes. C. It runs VACUUM on the target table to purge conflicting records. D. It uses sequence_by to order events and retains tombstones for deleted rows until older sequences are processed. A data governance team at a large enterprise is improving data discoverability across its organization. The team has hundreds of tables in their Databricks Lakehouse with thousands of columns that lack proper documentation. Many of these tables were created by different teams over several years, with missing context about column meanings and business logic. The data governance team needs to quickly generate comprehensive column descriptions for all existing tables to meet compliance requirements and improve data literacy across the organization. They want to leverage modern capabilities to automatically generate meaningful descriptions rather than manually documenting each column, which would take months to complete. The team is looking for a solution that can understand data patterns, column names, and sample values to create intelligent descriptions. Which approach should the team use in Databricks to automatically generate column comments and descriptions for existing tables?. A. Write custom PySpark code using df.describe () and df.schema to programmatically generate basic statistical descriptions for each column. B. Navigate to the table in Databricks Catalog Explorer, select the table schema view, and use the “AI Generate” option which leverages artificial intelligence to automatically create meaningful column descriptions based on column names, data types, sample values, and data patterns. C. Use the DESCRIBE TABLE command to extract existing schema information and manually write descriptions based on column names and data types. D. Use Delta Lake’s DESCRIBE HISTORY command to analyze table evolution and infer column purposes from historical changes. A company processes semi-structured JSON files from an external source using Auto Loader in a classic Databricks job. Occasionally, records arrive with null critical fields, invalid types, or unexpected nested schema variations. The engineer must ensure that malformed or non-conforming records are not dropped silently and are captured in a separate quarantine table. The pipeline should continue processing good records into the Bronze layer without failing the job, and the approach must support both batch and streaming ingestion. The data engineer needs to build a robust ingestion pattern that automatically routes bad records to a quarantine Delta table, while still ingesting good records into the Bronze layer for further processing. Which approach fulfills the quarantine mechanism in this ingestion architecture?. A. Create a notebook job with inferSchema= True, write a streaming query with .foreachBatch() and catch exceptions using try/except to redirect failed batches to quarantine. B. Use Auto Loader with failFast mode to set to false, and enable schema evolution; invalid records will be silently ignored during ingestion. C. Use Lakeflow Spark Declarative Pipelines with a SQL pipeline; configure it to drop rows with nulls using where critical_fields is not null, and rely on audit logs for malformed data. D. Use Auto Loader with LDP and implement an EXPECT () constraint with a record audit logic to route bad records. A data engineer is analyzing transactional data in a PySpark DataFrame df containing customer_id, transaction_timestamp (precise to milliseconds), and amount_spent. The objective is to compute a cumulative sum of amount_spent per customer, strictly ordered by transaction_timestamp. The cumulative sum must include all transactions from the earliest timestamp up to and including the current row, respecting temporal ordering within each customer partition. Which PySpark code snippet most accurately constructs the appropriate window specification and applies the aggregation to yield the correct cumulative expenditure per customer?. A. B. C. D. A data team is implementing an append-only Delta Lake pipeline that needs to handle both and streaming data. They want to ensure that schema changes in the source data can be automatically incorporated without breaking the pipeline. Which configuration should the team use when writing data to the Delta table?. A. ignoreChanges = false. B. validateSchema= false. C. overwriteSchema=true. D. mergeSchema=true. A senior data engineer is planning large-scale data workflows. The current task is to identify the considerations that form a foundation for creating scalable data models that are essential for effective management of large datsets. The data engineering team has identified the core capabilities as part of a scalable data model to build a modern data platform and provided their reasoning for considering Delta Lake for review. The senior data engineer is responsible for identifying the recommendations that are not valid. Which key features can be ignored while evaluating Delta Lake?. A. Delta Lake works with various data formats (e.g., Parquet, JSON, CSV) and integrates well with Spark and Databricks tools. B. Delta Lake optimizes metadata handling, efficiently managing billions of files and facilitating scalability to petabyte-scale datasets. C. Delta Lake’s capability to process data in both batch and streaming modes seamlessly, providing flexibility in data ingestion and processing. D. Delta Lake provides limited support for monitoring and troubleshooting data pipelines, so relevant partner tools have to be identified and set up for enhanced operational efficiency. A data engineering team is implementing an append-only data pipeline using Delta Lake, and wants to ensure that data is never modified or deleted once written. Which Delta Lake feature should the data engineer enable to prevent modifications to existing data?. A. Delta APPEND_ONLY. B. Delta VACUUM. C. Delta OPTIMIZE. D. Delta Time Travel. A data engineering team is setting up a Git project to automate integration tests using Databricks Asset Bundles and the Git provider’s CI/CD functionalities. When a pull containing changes to their pipleline is sent, they need to run a Job to test their data pipeline. What is the correct databricks bundle command sequence to be executed from the Git provider’s CI/CD automation for this task?. A. init, deploy, run, validate. B. validate, deploy, run. C. init, validate, deploy, run. D. deploy, run, validate. A data engineer is attempting to execute the following PySpark code: df=spark.read.table(“sales”) result=df.groupBy(“region”).agg(sum(“revenue”)) However, upon inspecting the execution plan and profiling the Spark job, they observe excessive data shuffling during the aggregation phase. Which technique should be applied to reduce shuffling during the groupBy aggregation operation?. A. Use.coalesce(1) after the aggregation. B. Caching the DataFrame df. C. Use broadcast join. D. Repartition by region before aggregation. While reviewing a query’s execution in the Deatricks Query Profile, a data engineer observes that the “Top operators” panel shows a sort operator with high “Time spent” and “Memory peak” metrics, and the Spark UI reports frequent data spilling. How should the data engineering address this issue?. A. Repartition the DataFrame to a single partition before sorting. B. Convert the sort operation to a filter operation. C. Switch to a broadcast join to reduce memory usage. D. Increase the number of shuffle partitions to better distribute data. A data engineer manages a Unity Catalog table customer_data in schema finance that includes sensitive fields like ssn and credit_score. Intern Group should only see masked values, while Analyst Group should only access rows for their assigned region. The data engineer needs to restrict access based on user role and region without duplicating data. How should the data engineer enforce this security policy?. A. Use Unity Catalog’s row filters based on the region and column masks based on user roles. B. Create views using current_user() and is_account_group_member() functions, and apply masking logic inside the SQL SELECT clause for each sensitive column. C. Create dynamic views for each user role and manage access with ACLs. D. Use Unity Catalog’s row filters based on the user roles and column masks based on the region. A data engineer us ingesting JSON files from cloud object storage using Databricks Auto Loader. The source folder may occasionally receive large files of data, which risks overwhelming the stream. To ensure predictable micro-batch sizes, the team wants to throttle ingestion based on the volume of data scanned at 1 GB, regardless of the number of files. Which Auto Loader configuration should the data engineer used to achieve this?. A. Configure cloudFiles.maxBytesPerTrigger with 1 GB to place a limit. B. Configure cloudFiles.maxSizePerTrigger with 1 GB to place a limit. C. Configure cloudFiles.maxFilesPerTrigger and estimate the average file size to approximate a size-based throttle of 1 GB. D. Configure cloudFiles.maxPartitionBytes with 1GB to limit data in each partition. A data company uses Databricks Unity Catalog and has multiple enterprise data sources, including PostgreSQL, Snowflake, and SQL Server. The central data platform team wants to configure Lakehouse Federation so analysts can query external tables directly in Databricks using Databricks SQL, without duplicating data. Which steps are necessary to configure Lakehouse Federation in a secure and governed manner?. A. Mirror the external datasets into Delta Lake using Auto Loader, and govern them using Data Lineage and System Tables. B. Configure connections and foreign catalog in Unity Catalog, then grant access to foreign catalogs, schemas, and tables using Unity Catalog permissions. C. Use Partner Connect to create linked datasets, and apply table ACLs at the source system to govern access through Databricks. D. Create external locations and storage credentials to connect to each database, then register foreign tables in Unity Catalog. Why are Pandas UDFs often preferred over traditional PySpark UDFs in performance-critical applications involving large datasets?. A. They minimize memory usage by streaming each row individually through a lightweight Python wrapper, avoiding batch processing overhead. B. They leverage Apache Arrow to enable vectorized operations between the JVM and Python runtimes, reducing serialization costs and improving computational efficiency. C. They allow row-level execution of functions in Python with native Spark optimization, removing the need for columnar execution. D. They eliminate the JVM-Python boundary by bypassing serialization entirely, thereby avoiding data conversion overhead. A data engineer is setting up a pipeline to ingest data from a message bus system that occasionally delivers duplicate messages. The duplicate messages can be a week apart. The target is a Databricks Delta Lake table where each record should appear exactly once. Which Databricks ingestion pattern should be implemented to handle potential duplicates where events can arrive outside of the configured watermark?. A. Use Delta Lake’s change data feed to filter duplicate records. B. Use Delta Lake time travel to identify and remove duplicates. C. Configure Structured Streaming with dropDuplicates transformation. D. Implement a write operation using MERGE INTO with a unique key. A data engineer needs to design an efficient pipeline that automatically processes new CSV files as they arrive in S3 storage. Which Databricks feature should the data engineer use to meet these requirements?. A. Streaming from cloud storage using standard Spark readStream with format (“csv”) and format (“json”). B. COPY INTO SQL command with parameters to track processed files. C. Traditional batch processing with scheduled Databricks Jobs. D. Auto Loader with schema inference and evolution enabled. A data engineer is implementing liquid clustering on a Delta Lale table and needs to understand how it affects data management operations. The table will be updated frequently with new data. The table is an external table and not managed by Unity Catalog. How does liquid clustering in Delta Lake handle new data that is inserted after the initial table creation?. A. New data is rejected if it doesn’t match the clustering pattern. B. New data is automatically clustered during write operations. C. New data is written to a staging area and clustered during scheduled maintenance. D. New data remains unclustered until the next OPTIMIZE operation. A data engineer is optimizing a MERGE operation on an 800GB UC-managed table that experiences frequent updates and deletions. Which two actions should the engineer prioritize to improve MERGE performance? (Choose two.). A. Apply liquid clustering using the merge join keys. B. Enable deletion vectors on the table if not already enabled. C. Partition the table by date. D. Use ZORDER on high-cardinality columns. E. Overwrite the table instead of Merge. A data engineer is building a streaming data pipeline to ingest JSON files from cloud storage into a Delta Lake table. The pipeline must process files incrementally, handle schema evolution automatically, ensure exactly-once processing, and minimize manual infrastructure management. How should the data engineer fulfill these requirements?. A. Use Lakeflow Spark Declarative Pipelines with a static DataFrame read, merge schema with spark.conf.set (“spark.databricks.delta.schema.autoMerge.enabled”, “true”). B. Use Auto Loader in batch mode with a daily job to overwrite the Delta table. C. Use Lakeflow Spart Declarative Pipelines with Auto Loader and enabling schema inference with “cloudFiles.schemaEvolutionMode”= “addNewColumns”. D. Use traditional Spark Structured Streaming with Auto Loader, manually configuring checkpoints location and enabling schema inference with “mergeSchema”= “true”. Given the following PySpark code snippet in a Databricks notebook: filtered_df=spark.read.format(“delta”).load(“/mnt/daya/large_table”) \ .filter (“event_date> ‘2024-01-01’”) filtered_df.count () The data engineer notices from the Query Profile that the scan operator for filtered_df is reading almost all files, despite a filter being applied. What is the probable reason for poor data skipping?. A. The Delta table lacks optimization that enables dynamic file pruning. B. The filter condition involves a data type that is excluded from data skipping support. C. The filter is executed only after the full data scan, which prevents data skipping from taking place. D. The event_date column is outside the table’s partitioning and Z-ordering scheme. When a new Databricks project starts, the central IP team provisions the required infrastructure using Terraform and a Service Principal. This includes creating a Databricks workspace, a Unity Catalog linked to an External Location, and a Databricks group containing all project team members. Project teams must store all assets – e.g., tables and volumes, as Managed assets in Unity Catalog. This model hides infrastructure complexity while giving teams autonomy within their catalog. They can create and manage schemas, tables, volumes, and related objects but cannot rename, delete, or change catalog permissions, those remain under IT’s control. Which rights should the project group be granted to enable this model?. A. The group needs to have USE CATALOG and USE SCHEMA on the catalog. B. The group needs to have ALL PRIVILEGES and the MANAGE on the catalog. C. The group needs to have ALL PRIVILEGES on the catalog. D. The group should be made OWNER of the catalog. A data engineer needs to productionize a new Spark application written by teammate. This application has numerous external dependencies, including libraries, and requires custom environment variables and Spark configuration parameters to be set. Which two methods will help the data engineer accomplish the task? (Choose two.). A. Install libraries on DBFS. B. Add libraries to compute policies. C. Use secrets in init scripts to store configuration data. D. Use compute policies to set system properties, environment variables, and Spark configuration parameters. E. Create init scripts on DBFS. A healthcare analytics team is implementing a dimensional model in Delta Lake for patient care analysis. They have a date dimension table and are evaluating design options to ensure it supports a wide range of time-based analyses. Which design approach for the date dimension will support efficient time-based querying and aggregation?. A. Store the date as string in ISO format (YYYY-MM-DD) for readability. B. Pre-calculate attributes like fiscal_period, quarter, month_name, day_of_week, and holiday. C. Store only the date value and calculate all time attributes in queries. D. Create separate dimension tables for different calendar systems (fiscal, academic, etc.). A data engineer is tasked with ensuring that a Delta table in Databricks continuously retains deleted files for 15 days (instead of the default 7 days), in order to permanently comply with the organization’s data retention policy. Which code snippet correctly sets this retention period for deleted files?. A. spark.sql(“ALTER TABLE my_table SET TBLPROPERTIES (‘delta.deletedFileRetentionDuration’ = ‘interval 15 days’)”). B. from delta.tables import *; deltaTable = DeltaTable.forPath(spark, "/mnt/data/my_table"); deltaTable.deletedFileRetentionDuration = "interval 15 days". C. spark.sql(“VACUUM my_table RETAIN HOURS”). D. spark.conf.set(“spark.databricks.delta.deletedFileRetemtionDuration”, “15 days”). A Data Engineer is building a fraud detection pipeline that calls out to Open AI, via a Python library, and needs to include an access token when using the API. Which Databricks CLI command should the Data Engineer use to create the secret?. A. databricks secrets put-secret KEY SCOPE; dbutils.secrets.get (KEY, SCOPE). B. databricks tokens put-token SCOPE KEY; dbutils.tokens.get (SCOPE, KEY). C. databricks secrets put-secret SCOPE KEY; dbutils.secrets.get (SCOPE, KEY). D. databricks tokens put-token KEY SCOPE; dbutils.secrets.get (KEY, SCOPE). A company has a task management system that tracks the most recent status of tasks. The system takes task events as input and processes events in near real-time using Lakeflow Spark Declarative Pipelines. A new task event is ingested into the system when a task is created or the status is changed. Lakeflow Spark Declarative Pipelines provides a streaming table (table name: tasks_status) for the BI users to query. The table represents the latest status of all tasks and includes 5 columns: task_id(unique for each task), task_name, task_owner, task_status, task_event_time. The table enables 3 properties: deletion vectors, row tracking, and change data feed. A data engineer is asked to create a new Lakeflow Spark Declarative Pipelines to enrich the “task_status” table in near real-time by adding one additional column representing task_owner’s department, which can be looked up from a static dimension table (table name: employee). How should this enrichment be implemented?. A. Create a new Lakeflow Spark Declarative Pipelines: use readStream() function with option readChangeFeed to read tasks_status table CDF; enrich with the employee table; create a new streaming table as the result table and use apply_changes() function to process the changes from the enriched CDF. B. Create a new Lakeflow Spark Declarative pipeline: use the readStream()function to read tasks_status table, enrich with the employee table; store the result in a new streaming table. C. Create a new Lakeflow Spark Declarative Pipeline: use the readStream() function with the option skipChangeCommits to read the tasks_status table; enrich with the employee table; store the result in a new streaming table. D. Create a new Lakeflow Spark Declarative Pipeline: use the read () function to read tasks_status table; enrich with employee table; store the result in a materialized view. A data engineer is reviewing the PySpark code to copy a part of the production dataset to the sandbox environment, and needs to be sure that no PII(Personally Identifiable Information) data is being copied. After checking the sales table, the data engineer notices that it has user emails as the only PII data included as well as being the only column to identify the user. Which anonymised code should be used to achieve the required outcome?. A. df.withColumn (“user_emai”, F.expr(“uuid()”)). B. df.withColumn (“user_email”, F.sha2 (“user_email”)). C. df.withColumn (“hashed_email”, sha2 (“user_email”)). D. df.withColumn (“user_email”, F.regexp_replace (“user_eamail”, “@*”, “@anonymized.com”)). |





