Cloudera Data Engineering
|
|
Título del Test:
![]() Cloudera Data Engineering Descripción: Preparacion para examen |



| Comentarios |
|---|
NO HAY REGISTROS |
|
What resource schedulers can be used to manage Spark clusters? (Choose 4). Kubernetes. Standalone. Yarn. Mesos. Slurm. Which code reads text with comma line delimiter in Spark?. spark.read.text(path).option("lineSeparator", ","). spark.read.format("text").option("sep", ",").load(path). spark.read.text(path, lineSep=","). spark.read.text(path).option("sep", ","). spark.read.textFile(path).option("delimiter", ","). Show all customers eligible for a senior discount. (Choose 2). nameAgeDF = customersDF.select("name,"age") nameAgeOver64DF = nameAgeDF.where("age >64") nameAgeOver64DF.show(). customersDF.filter("age > 64").select("name", "age").show(). customersDF.where("age >= 65").show(). nameAgeDF = customersDF.select("name,"age").where("age>64").show(). customersDF.selectExpr("name", "age").where("age >= 65").show(). Which of the following are TRUE about DataFrames? (Choose 3). DataFrames require a database connection to exist. Consist of a collection of loosely typed Row objects. Rows are organized into columns described by a schema. DataFrames are always untyped and have no schema. Model data similar to tables in an RDBMS. Return the first 2 rows of a JSON file using a DataFrame. usersDF = spark.read.json("users.json"); users = usersDF.head(2). spark.read.format("json").load(""users.json"").take(2). userDF = spark.read.json("users.json") users = usersDF.take(3). usersDF = spark.read.json("users.json"); usersDF.show(2). spark.read.json("users.json").limit(2).collect(). How does Catalyst improve performance? (Choose 2). Minimize wide (shuffle) operations. Disable whole-stage codegen. Generate additional shuffle stages. Minimize data transfer between executors. Rewrite all joins as broadcast joins. Tungsten advantages (Choose 3). Supports off-heap allocation. Avoids Garbage Collection off-heap. Much smaller than Java serialization. Requires no schema. Auto-compacts all data. DataFrame operation types (Choose 2). Transformations create new DataFrames. Mutations modify DataFrames in place. Actions return values or save data. DataFrames contain transactional updates. All operations are actions. Chaining transformations examples (Choose 2). df.write.save("/tmp/output"). Scala > val df = usersDF.select("name","age").where("age > 20").show. df.select("col1").where("col1 > 10").groupBy("col1").count(). df.select("x").cache(); df. Python > df = usersDF.select("name","age").where("age > 20").show(). Default parallelism in Spark?. Executors × 2. Fixed at 200. Partitions of parent RDD. Number of columns. Total CPU cores. Improve performance on large-scale transformations. Use DataFrame API. Disable codegen. Avoid caching. Increase partitions randomly. Force sort-merge joins. Best practices in CDE Spark apps (Choose 2). Always run local mode. Use one select(), fewer withColumn(). Manual dependency distribution. Build one assembly JAR. Avoid unit tests. Wide RDD transformations. groupByKey(), reduceByKey(), join(). collect(), count(), take(). map(), filter(), flatMap(). repartition(), coalesce(), mapPartitions(). union(), distinct(), sample(). Parallel tasks = number of…. Jobs. Executors. Input partitions. Cores. Output files. Spark RDD execution plan truths (Choose 2). Uses Hive metastore. Based on lineage. Uses CBO. Requires AQE. Less optimized than DataFrame API. Spark tasks (Choose 3). Render. Transform. Extract. Load. Compile. Spark SQL & Hive. Reads and writes Hive. Only reads. Only writes. Requires RDD conversion. Only non-ACID tables. Best practices for real-time Hive (Choose 3). Use ORC/Parquet. Partition tables. Cache hot tables. Disable vectorization. Broadcast all tables. When does Spark determine partitioning? (Choose 4). Wide transformation + shuffle. When an action runs. Query optimization. Creating SparkSession. Calling repartition/coalesce. Catalyst improvements (Choose 2). Minimize data transfer. Minimize shuffle. Force Tungsten off. Disable codegen. Increase GC. Persisting data (true). Always writes to disk. Prevents failures. Avoids recomputation. Only for RDDs. Removes lineage. When persist (Choose 2). Used multiple times. Only broadcast joins. Expensive to recreate. Only local mode. Always. Avoid Airflow DAG running past dates. dag.catchup = False. depends_on_past=True. max_active_runs=1. schedule=None. Remove start_date. HTTP method Airflow uses to trigger events. DELETE. PUT. POST. PATCH. GET. DAG file processing components (Choose 2). DagFileProcessor. DagFileProcessorManager. TaskWatcher. SchedulerLoop. DagBagController. Airflow @task decorator. Simplifies DAG creation. Requires SubDagOperator. Only with KubernetesExecutor. Disables XComs. No parameters. Advantages of Airflow. Has no UI. Only cron. Requires Spark. Complex pipelines + many operators + CDE integration. No dependencies. Partition sizing benefits (Choose 2). Reduce stages to 1. Avoid spills & max parallelism. Better scheduling/utilization. No caching. Zero shuffles. AQE improvement scenario. Raise shuffle partitions to 10000. Reorder joins by size. Convert joins to cartesian. Disable coalesce. Force broadcast. Avoid OOM (Choose 2). Use collect() often. Columnar caching MEMORY_AND_DISK. Run on driver. Increase executor memory. Disable Tungsten. Fact/dim join conditions (Choose 2). Always bucketed joins. Filter on dimension must propagate. Disable DPP. Avoid partition columns. Fact partitioned by filtering column. Optimizing complex Spark SQL (Choose 3). Automatic rewriting. Use CSV. Predicate pushdown. WholeStageCodeGen. Disable Catalyst. Hive ACID + Spark (Choose 2). Write to ACID files. Insert overwrite ACID. enable hive.acid.read. Disable transactions. Use HiveWarehouseConnector. Best practice for Spark SQL perf. Avoid partition indexes. Partition to reduce shuffle. Use JSON. Disable broadcast. Add more columns. Cache tables in-memory (Choose 2). spark.catalog.cacheTable("tableName"). df.persist(MEMORY_ONLY_SER). table.cache(). spark.sql("UNCACHE TABLE t"). CACHE LAZY TABLE tableName. Optimizers used in DF → RDD. Catalyst and Tungsten. CBO only. Tungsten + Parquet. AQE + Kryo. Kryo + Catalyst. Bucketed table truths (Choose 3). Bucketed tables can be partitioned. Bucket is a directory. Require Hive ACID. Bucket is a file. Hash buckets by function. CDE CLI truths (Choose 2). Can access CDE service. Windows‑only. Uses ~/.cde/config.yaml. Needs Yarn client. No auth. What is a CDE job?. Service without scheduler. Dashboard. Bash script. Airflow DAG with configs & resources, produces job runs. Notebook. Difference CDE Job vs Spark Job. kubectl wrapper. CDE job = Airflow or Spark. No difference. Spark job local only. CDE job always Airflow. CDE autoscaling levels. Pod only. Virtual Cluster + Job. Physical cluster. Job only. None. Supported CDE job resources (Choose 3). File libraries/code. Docker image. Self-signed certificate. Single .class file. Python virtual env. Iceberg Updates. Overwrite/Stream. Merge on Read + Copy on Write. Upsert/CDC. Append/Snapshot. Lambda. Relationship manifests ↔ snapshots. 1:1. Snapshot → many manifests. Snapshots don’t use manifests. Metastore only. Manifests → many snapshots. Iceberg performance features (Choose 2). Partition pruning. Optimized metadata. Disable indexes. JSON format. Remove statistics. Iceberg manifest files. Warehouse. User records. Metadata listing data files. Catalog. Streaming checkpoint. SELECT * flow in Iceberg. Directly read data files. Manifest list → catalog → manifests → data. Search driver storage. Broadcast metadata. Hive metastore only. Processing partitioned tables in Spark. Only row-based formats. Nested partitions allow multiple files. One file per partition. No materialization. No nested partitions. Spark UI shows (Choose 2). Stage metrics. GPU metrics. Number of stages. Network latency. Executor cost. Broadcast vs Shuffle join (Choose 2). Shuffle never adds stages. Broadcast shows no edges. Shuffle join sends into third stage. Broadcast always 4 stages. Broadcast sends its data into the other stage. Spark Streaming on Kubernetes HA. restartPolicy=Always. restartPolicy=Never. Disable checkpoints. Manual restart. Driver on hostPath. Drawback of caching DataFrames. Takes executor memory. Requires HDFS HA. Always reduces memory. Forces out-of-core. Disables Tungsten. Airflow architecture. Monolithic. Agent/worker no UI. DAG, Executor, Webserver, DAG folder, Metadata DB. Scheduler+metastore. No DB. Avoid scanning all partitions (Choose 3). Predicate Pushdown. Filter using partition columns. Dynamic Partition Pruning. Disable stats. Always use JSON. Best ways to cache DF (Choose 2). collect() and store in driver. Temp table. Pandas. persist(MEMORY_AND_DISK). Save CSV. Partitioned table truth. Nested partitions can store multiple files. Only 1 layer. Cannot be nested. Single file per partition. Partitioned tables cannot have multiple files. Enable Iceberg in CDE (Choose 3). Include Iceberg JARs. Install Hive 1.x. VC with Iceberg enabled. Enable Iceberg configs. Convert to CSV. Best description of CDE job. Airflow DAG with configs/resources, produces job runs. Static pod. No scheduler. Script. UI service. How to include resources in CDE. Upload to CDE Resource and reference in job. Mount NFS. Copy manually to nodes. Env vars only. Inside driver image. .explain(True) shows (Choose 3). WholeStageCodeGen only. Physical plan. Parsed logical plan. Optimized plan only. Analyzed logical plan. |




