Databricks_Data_Engineer_2

COMENTARIOS

ESTADÍSTICAS

RÉCORDS

REALIZAR TEST

Título del Test:

Databricks_Data_Engineer_2

Descripción:
Examen nº2 Databricks - Data_Engineer

Autor:

David Torres López

OTROS TESTS DEL AUTOR

Fecha de Creación: 2022/10/31

Categoría: Informática

Número Preguntas: 45

Valoración:

(0)

COMPARTE EL TEST

Nuevo Comentario

Comentarios
NO HAY REGISTROS

Temario:

In which of the following scenarios a data engineer would use an all-purpose cluster?. When the cluster needs to be shared between multiple users. When the data engineer needs to save the cost. When the data engineer needs to schedule the job to run every hour. When the job contains multiple languages. When the data engineer needs to terminate the cluster as the job ends.

Which of the following define the managed table correctly?. Managed tables are those which are created by the Databricks admin. Managed tables are those for which both metadata and data are managed by Databricks. Managed tables are those for which only metadata is managed by Databricks. Managed tables are those which are dropPped automatically after use. Managed tables are those for which only data is managed by Databricks.

Which of the following commands can be used to combine small files to a bigger file to achieve better performance?. VACUUM. DELETE. OPTIMIZE. COMBINE. RESTORE.

In a notebook, all the cells contain code in Python language and you want to add another cell with a SQL statement in it. You changed the default language of the notebook to accomplish this task. What changes (if any) can be seen in the already existing Python cells?. The Python cells will be grayed out and won't run until you change the default language back to Python. The magic command %python will be added at the beginning of all the cells that contain Python code. The magic command %python will be added at the end of all the cells that contain Python code. The magic command sąl will be added at the beginning of all the cells that contain Python code. There will be no change in any cell and the notebook will remain the same.

Which of the following statements is INCORRECT for a Delta Lake?. Delta Lake is open-source. Delta Lake is ACID compliance. Delta Lake can run on an existing Data Lake. Delta Lake is compatible with Spark API. Delta Lake stores data with.dl extension.

A data engineer needs to create a Delta table employee only if a table with the same name does not exist. Which of the following SQL statements can be used to create the table?. CREATE OR REPLACE TABLE employee (emp_id int, name string) USING DELTA;. CREATE TABLE employee (emp_id int, name string) ;. CREATE TABLE employee IF NOT EXISTS (emp_id int, name string) USING DELTA;. CREATE TABLE IF NOT EXISTS employee (emp_id int, name string);. CREATE OR REPLACE TABLE employee emp_id int, name string;.

Which of the following is the correct order of git commands to save the data in the central repository?. git add -> git pull. git add-> git push - git commit. git push -> git add - git commit. git add -> git commit -> git push. git pull -> git commit -> git push.

A junior data engineer has just joined your team and has accidentally deleted all the records from delta table currency_exchange using the command DELETE FROM currency_exchange. You ran the following query: DESCRIBE HISTORY currency_exchange and found out that the latest version is 6. You need to roll back the delete operation and get the contents back. Which of the following SQL statements will accomplish this task?. RESTORE TABLE Currency_exchange TO VERSION AS OF 6. RESTORE TABLE currency_exchange TO VERSION5. ROLLBACK currency_exchange TO VERSION AS OF 5. RESTORE TABLE currency_exchange TO VERSION AS OF 5. RESTORE currency_exchange TO VERSION 6.

As a data engineer, you dropped a table by using DROP command and noticed that the table has been removed but the underlying data is still present. Which of the following is true about the situation?. The DROP command has not been completed successfully. The table is a managed table. The data of the table is delete-protected. The table is an external or unmanaged table. The table is a semi-managed table.

Which of the following pages will allow you to view, update or delete a cluster?. Data Page. Compute Page. Workflows Page. Experiments Page. Functions Page.

Databricks allows users to change the default language of their notebooks. Which of the following languages cannot be set as a default language?. Scala. Java. Python. SQL. R.

Which of the following commands can be used to rotate a table on one of its axes?. Rotate. Pivot. Filter. Exists. Reduce.

A data engineer is working on multiple Databricks notebooks. The next notebook to be run depends on the value of the python variable next_notebook which can range from 0 to 7. Which of the following can be used by the data engineer to run the notebooks depending on the value of next_notebook?. AutoLoader. Multi-hop architecture. SQL endpoint. Python control flow. Delta lake.

Which of the following statements will increase the salary by 10000 for all the employees that have rating greater than 3 in the employees table?. UPDATE TABLE employees SET salary = salary + 10000 WHERE rating 3;. UPDATE employees SET salary salary 10000 WHERE rating 3;. UPDATE employees SET salary salary 10000 IF rating 3;. UPDATE employees TABLE SET salary salary 10000 WHERE rating 3;. UPDATE employees salary salary 10000 WHERE rating 3;.

Find the error in the following SQL statement which intends to create a new database as per the following requirements. Also, if the database already exists an error message should be returned. CREATE SCHEMA should be replaced with CREATE DATABASE. COMMENT is not a valid parameter, DESCRIPTION should be used. IF NOT EXISTS should be removed from the statement. Name of the database should always be capitalized ie. coMPANY. CREATE should be appended with OR REPLACE.

The view new_employees contain a set of employees that joined the company in the past one week. Details for some of those employees have already been added to the employees table. Also, the view new_employees and table employees have the same schema. Which of the following SQL statements would insert only those records in the employees table which are not already present, checking on the basis of emp_id column?. UPSERT INTO employees e USING new_employees n ON e.emp_id = n. emp_id WHEN MATCHED THEN INSERT *. INSERT INTO employees VALUES (SELECT FROM new_employees). MERGE INTO employees e USING new_employees n ON e.emp_id = n. emp_id WHEN MATCHED THEN INSERT ALL. MERGE INTO employees e USING new_employees n ON e.emp_id = n.emp_id WHEN MATCHED THEN INSERT. MERGE INTO employees e USING new_employees n ON e.emp_id = n.emp_id IF MATCHED THEN INSERT.

Which of the following views will be persisted through multiple sessions?. Only View. View and Global Temporary View. View and Temporary View. Global Temporary View. All of them.

The following python function show_str() intends to print the string passed to the function. Find the error. def show_str(string_to_show): print(‘string_to_show’). Python function is defined using define keyword and not def. The colon should be removed. show() function should be used instead of print(). The quotes around the argument passed to print () function should be removed. The function has no errors, it will print the desired string.

You have recently got to know about a directory that contains 108 parquet files. As a data engineer, you are asked to look at the first hundred rows from the data to check the quality of the data stored in the files. Which of the following SQL statements can be used?. SELECT* FROM parquet.path LIMIT 100. You need to convert the files to a table as reading data directly from a directory is not supported in Databricks. SELECT* FROM parquet. 'path LIMIT 100. SELECT* FROM parquet. path FIRST 100. SELECT * FROM path LIMIT 100.

The question is in the image. SELECT emp_code, EXPLODE (details). SELECT emp_code, details.*. SELECT emp_code, details :*. SELECT emp_code, details.name, details.age, details.salary. SELECT emp_code, details: name, details :age, details : salary.

A data analyst needs to count the number of NULL values in column secondary_mobile from the data present in the personal_details table. Which of the following SQL statements, when executed, will fetch the required result?. SELECT count (NULL secondary_ mobile) FROM personal_details. SELECT count_if(secondary_mobile NULL) FROM personal_details. SELECT count_if(secondary_mobile IS NULL) FROM personal_details. SELECT count_null (secondary_mobile) FROM personal_details. SELECT count_if(NULL secondary_mobile) FROM personal_details.

Which of the following keywords should be used to create a UDF in SQL?. UDF. FUNC. FUNCTION. USER DEFINED FUNCTION. DEF.

The following SQL statements intend to create a Delta table company_zones using a SQLite table named zones. Which of the following can replace the blank? CREATE TABLE company_zones ______________ OPTIONS ( url = “jdbc:sqlite:/companyDB”, dbtable = “zones” ). USING SQLITE. USING DELTA. USING JDBC. USING DATABASE. USING SQL.

The following is a snapshot of the workers table where the details column is of array type. A data engineer needs to create a new column filtered_workers which contains values only for those records for which the salary of the worker is greater than 10000. They decide to go with higher-order function. Which of the following statements should be used by the data engineer to complete the task?. FILTER (details, i -> i.salary 100ee) AS filtered_workers. UDF (details, i -> i.salary 10000) AS filtered_workers. REDUCE (details, i> 1000e) AS filtered_workers. FILTER (details, salary 10000) AS filtered_workers. FILTER (details, i -> i.salary 10000) AS filtered_workers.

Which of the following is not a feature of AutoLoader?. AutoLoader provides incremental processing of new files which are added to cloud storage. AutoLoader can ingest files from AWS S3 and Google Cloud Storage incrementally. Autoloader can ingest data in various formats including but not limited to JSON and CSV. AutoLoader usesrescued_columns column for capturing the incompatible data. AutoLoader converts all the columns to STRING data type when reading from JSON data source.

Which of the following are the commonly used naming conventions for the three tables of the Incremental multi-hop architecture in Databricks?. Bronze -> Silver -> Gold. Raw -> Silver -> Gold. Silver-> Gold -> Dashboard. Silver-> Gold -> Platinum.

A data engineer is working on a PySpark DataFrame silverDF as part of a multi-hop architecture. A data analyst needs to perform a one-time query on the DataFrame using SQL in the same session. They have written the following query to make the DataFrame available to the data analyst. silverDF.createOrReplaceTempView(‘silver_table’) Now, the data analyst starts working on the provided temporary view and selects all the data from the view using the following SQL statement but they are not able to view the result. SELECT * FROM silverDF; What could be the reason for non-working of the code?. Data engineer has registered the view which cannot be used in SQL as the name of the view and the DataFrame should be identical. Data analyst has used the DataFrame name instead of name off the view. Data engineer has used wrong function, createTable () should have been used. There is no way a PySpark DataFrame can be used in Spark SQL. Data analyst should run the query inside spark. sq1 () function.

A data engineer needs to control the schema for some of the columns while reading the raw JSON data to the Bronze table using AutoLoader. Which of the following is the most efficient way of controlling the schema?. Use trigger. It is not possible to define schema. Use schemaHints. Contact the Databricks Administrator. Use outputMode as append.

A data engineer, who is working on Bronze to Silver hop in medallion architecture, wants to join performers and events tables on location column and retrieve all the records from the performers table but only matching records from the events table. What type of join can be used to accomplish the task? The performers table contains details like name, genre, performer_id etc. of all the performers associated with the company and the events table contains details of all the events like concerts, standup shows etc. and performer_id of the performers who performed at the event. LEFT JOIN. RIGHT JOIN. INNER JOIN. OUTER JOIN. Any one from A (LEFT JOIN) and B (RIGHT JOIN).

The following Python code block intends to perform the Silver-Gold transition as a part of multi-hop architecture to find out the maximum products from each country. The source table is warehouse whereas the target table is max_products. What should replace the blank to run the code correctly? spark.table(“warehouse”) .groupBy(“country”) ____________________ .write .table(“max_products”). agg(max(products)). agg.max.products. agg.max("products"). agg(max("products")). max(agg("products")).

Which keyword(s) should be added to make a table or a view a Delta Live Table?. DELTA. LIVE. DELTA LIVE. STREAMING. No keyword is required as every table in Databricks is DLT, by default.

Which of the following about checkpointing is true for a streaming job in Databricks?. Checkpointing is used for checking the faulty data in a table. Checkpointing helps in making a job fault tolerant. Checkpointing can increase the risk of failure. Checkpointing helps in parallel processing of jobs. Checkpointing is used for scheduling a job in CRON syntax.

Which of the following hops is known for transitioning timestamps into a human-readable format?. Raw to Bronze. Bronze to Silver. Silver to Gold. Raw to Gold. Gold to Silver.

A data engineer gets a lot of files in an AWS S3 directory specified by a Python variable loc. The frequency of files is not uniform. Someday only 3-4 files are received while some days it is more than 100. They need to load the new data received into the users table every hour. Which of the following is the most efficient and time-saving technique?. Use AutoLoader with default processingTime to trigger. Use AutolLoader with processingTime as 60 minutes to trigger the job. Manually run the job every hour. Use AutolLoader with Multi-Hop Architecture. Use AutoRun to run the job every 60 minutes.

Which of the following is not one of the features of SQL endpoint?. SQL endpoint supports multiple cluster size options. SQL endpoint comes with an Auto-stop feature which can be used to shutdown the endpoint after specified time. SQL endpoint can be scaled to multiple clusters. SQL endpoint can be used to run Java applications while preserving the SQL queries. SQL endpoint can be connected with tools like Tableau and Power Bl using the connection details provided by the Databricks for each SQL endpoint.

Which of the following correctly depicts the relationship between a job and a task in Databricks Workflow?. A task consists of one or more jobs which can run linearly or in parallel. Job and task are always equal in any Databricks Workflow. A job can consist of number of tasks. Number of jobs in a Workflow is always greater than the number of tasks. There is no relationship between a job and a task in Databricks Workflow as they are independent of each other.

A team of data analysts is working on a DLT pipeline using SQL which updates the real time weather conditions in different parts of the country. One of the data analysts need to look at the quality of the data loaded to the system. They need to drop the row which has NULL value in temperature column. Which of the following SQL statements should be added by them to accomplish this task?. CONSTRAINT temp_in_range EXPECT (temperature is NOT NULL. CONSTRAINT temp_in_range EXPECT (temperature is NOT NULL) ON VIOLATION DROP ROW. CONSTRATNT temp_in_range EXPECT (temperature is NOT NULL) ON VIOLATION FAIL UPDATE. CONSTRAINT temp_in_range EXPECT (temperature is NULL) ON VIOLATION DROP ROW. CONSTRATINT temp_in_range EXPECT (temperature is NOT NULL) ON VIOLATION DELETE ROW.

Which of the following about the cron scheduling of jobs in Databricks is correct?. Cron scheduling can be used for Manual schedule type. Cron scheduling can be used for creating a new cluster. Cron scheduling should be set in Data tab from left side menu. Cron scheduler supports time zone selection. Cron scheduled jobs cannot be edited.

As a data engineer you need to create a football match scorecard for all the users who log into your website. The team has decided to go with the DLT pipeline and put arguments on which type of pipeline to be used for this project i.e. continuous or triggered pipeline. Which of the following arguments made by the members of your team is incorrect?. Member A states - A continuous pipeline assures that the end users will get the latest score as they log in to the website. Member B says - By using a continuous pipeline our cost will decrease. Member C thinks - The triggered pipeline will make the score refresh rate very slow. Member D writes For live events like a soccer match we should use a continuous pipeline. Member E proposes -Triggered pipelines can be scheduled or run manually.

An organization wants to decrease their cost of SQL endpoint which is currently used by only one of their data analysts to query the data once or twice daily. The queries are not complex and can afford latency. What all steps can be taken in order to decrease the cost incurred?. Select the minimum cluster size. Turn on the Auto-stop feature. Select minimum values for Scaling the endpoint. All of the above can help in reducing the cost. None of the above.

Which of the following can be called as the unit of computation cost while creating a DLT pipeline?. DPU/hour. DPU/second. DBU/hour. Number of seconds. Number of hours.

Your colleague needs to provide select and metadata read permissions on database app_test to a user named abc. They have written the following query to grant the permission. GRANT SELECT, READ_METADATA TO DATABASE app_test ON `abc` What should be changed in the above statement to make this work?. GRANT should be replaced with GRANTS as there are multiple grants in one statement. READ_METADATA should be omitted as all the users have metadata read permission. READ METADATA should be omitted as granting SELECT permission implies that the user can read the metadata. The position of ON and To should be swapped. DATABASE keyword should be replaced by CATALOG.

A data engineer has created a request table. Another data engineer has joined the team and needs all the privileges to that table. The data engineer does not remember the SQL command and wants to use the UI to grant permissions. Which of the following can be used by them to accomplish the task?. AutoLoader. Data Explorer. Git. Multi-Hop architecture. Data Lakehouse.

Which of the following is true for the features of Unity Catalog in Databricks?. Unity catalog is based on ANSI SQL. It allows fine-grain access to specific rows and column. Unity catalog can be used to govern data on different clouds. Unity catalog can control your existing catalogs. All of the above.

Which of the following SQL statements can be used by the Databricks admin to change the owner of database error_logs to user dan@nad.adn. ALTER DATABASE error_logs OWNERSHIP TO dan@nad. adn. GRANT OWNER FOR SCHEMA error_logs TO dan@nad. adn. CHANGE OWNER TO dan@nad.adn FOR SCHEMA error_logs. ALTER DATABSE error_logs GRANT OWNER TO dan@nad.adn. ALTER SCHEMA error_logs OWNER TO dan@nad.adn.

Denunciar Test

▲