option
Cuestiones
ayuda
daypo
buscar.php

Test Big Data

COMENTARIOS ESTADÍSTICAS RÉCORDS
REALIZAR TEST
Título del Test:
Test Big Data

Descripción:
Test Big Data

Fecha de Creación: 2024/12/20

Categoría: Otros

Número Preguntas: 21

Valoración:(0)
COMPARTE EL TEST
Nuevo ComentarioNuevo Comentario
Comentarios
NO HAY REGISTROS
Temario:

Which Spark SQL programming interface would you recommend, in terms of performance, for most general-purpose data processing scenarios?. All are acceptable and offer similar performance in general terms. All are acceptable, but Scala is preferred since it is Spark’s native language. All are acceptable, but Java offers better performance, since Spark runs on top of the JVM (Java Virtual Machine). All are acceptable, but the Python and R interfaces are better suited for interactive data analysis.

What is the difference between RDDs, Datasets and DataFrames in Spark?. An RDD is a distributed collection of objects. A Dataset enhances an RDD, incorporating encoding mechanisms that improve performance. A DataFrame is a Dataset of objects of class Row. A DataFrame is a distributed collection of objects. An RDD enhances a Dataset, incorporating encoding mechanisms that improve performance. A Dataset is a DataFrame of objects of class Row. An RDD is a distributed collection of objects. A DataFrame enhances an RDD, incorporating encoding mechanisms that improve performance. A Dataset is a DataFrame of objects of class Row. A Dataset is a distributed collection of objects. A DataFrame enhances a Dataset, incorporating encoding mechanisms that improve performance. An RDD is a Dataset of objects of class Row.

Given the following Spark code written in Scala, what piece of code will give us the list of names of clients that have purchased an iPhone, and the total amount of money each one has spent including the iPhone and other possible purchases?. purchaseDF.filter(purchaseDF("product") === "iPhone") .select("client_id") .join(purchaseDF, "client_id") .groupBy("client_id") .agg(sum("price").as("total_price")) .join(clientDF, "client_id").show. purchaseDF.groupBy("client_id") .agg(sum("price").as("total_price")) .filter(purchaseDF("product") === "iPhone") .select("client_id") .join(clientDF, "client_id").show. clientDF.select("client_id").join(purchaseDF, "client_id").groupBy("client_id").agg(sum("price").as("total_price")).filter(purchaseDF("product") === "iPhone").show. clientDF.select("client_id").join(purchaseDF, "client_id").filter(purchaseDF("product") === "iPhone").groupBy("client_id").agg(sum("price").as("total_price")).join(clientDF, "client_id").show.

The two basic operations that can be performed on a Spark RDD are... Transformations and actions. Transformations change the contents of an input RDD. Actions perform calculations with the RDD data. Actions are incorporated in the execution plan. This plan is triggered by a Transformation. Transformations and actions. Transformations change the contents of an input RDD. Actions perform calculations with the RDD data. Transformations are incorporated in the execution plan. This plan is triggered by an Action. Transformations and actions. Transformations process an input RDD and produce a new one as a result. Actions perform calculations with the RDD data. Transformations are incorporated in the execution plan. This plan is triggered by an Action. Transformations and actions. Transformations process an input RDD and produce a new one as a result. Actions perform calculations with the RDD data. Actions are incorporated in the execution plan. This plan is triggered by a Transformation.

Which one of the following statements about Apache Impala is TRUE?. It provides low latency but also low concurrency. It utilizes its own file formats, not compatible with Hive’s metadata. Instead of MapReduce, it implements its own specialized distributed query engine. It is slower than Hive, but fault-tolerant and well suited for time-consuming batch processing.

What is the relationship between Big Data and Data Science?. Data science is the process of extracting knowledge or insights from data. When a data science problem cannot be addressed using traditional data techniques, then it becomes a Big Data problem. Big Data is the process of extracting knowledge or insights from data. Data science incorporates the scientific method to this process, improving it. Data science focuses on traditional statistical methods and techniques. Big Data is a more modern approach that relies on the value of large datasets to extract knowledge. Big Data will eventually replace data science as the preferred approach for learning from data. Big data and data science are terms that refer to the same family of techniques and can be interchanged. The term "Big Data" is more commonly used in engineering and computer science, whereas "Data science" is more accepted in scientific and academic environments.

Indicate which of the following classes of the Spark MLlib are Transformers and which are Estimators: LogisticRegression, VectorAssembler, Normalizer, CrossValidator. Estimator, Transformer, Estimator, Transformer. Transformer, Estimator, Transformer, Estimator. Estimator, Transformer, Transformer, Estimator. Transformer, Transformer, Estimator, Estimator.

What are the main characteristics and ideal uses for Hive and Impala?. Hive is a data warehouse solution ideal for large scale batch processing. Impala is an analytics database optimized for low latency and high concurrency queries. Hive is a data warehouse solution optimized for low latency and high concurrency queries. Impala is an analytics database ideal for large scale batch processing. Hive is an analytics database optimized for low latency and high concurrency queries. Impala is a data warehouse solution ideal for large scale batch processing. Hive is an analytics database ideal for large scale batch processing. Impala is a data warehouse solution optimized for low latency and high concurrency queries.

In MapReduce... First the Map tasks process partial solutions of the problem. Then the shuffle-and-sort redistributed these partial solutions throughout the cluster. Finally the Reduce tasks combine them into the final result. Multiple Reduce stages are executed to reduce the problem size. Then the Map stage computes the sub-problem partial solutions. Finally the shuffle-and-sort sorts the final results. After the Reduce stage processes different partial solutions of the problem. These are then combined in the shuffle-and-sort stage, and mapped to the final output by the Map stage. First the shuffle-and-sort stage distributed the data throughout the cluster. Then the Map tasks reduce the problem size, by computing partial solutions of the problem. Finally the Reduce tasks combine all these partial solutions into the final result.

Which one of the following statements about Apache Pig is TRUE?. Pig was originally designed as a high-level interface for Spark. Pig capabilities can be extended with User-Defined Functions (UDFs). Pig operates exclusively on Hive tables. Plain HDFS files or other sources are not compatible. Pig provides a high-level programing language (Pig Latin) that allows to easily implement the Map and Reduce functions of a MapReduce job.

Which one of the following statements about running Hadoop in the Cloud is FALSE?. Most cloud vendors provide IaaS and PaaS solutions that make the creation of Hadoop clusters easier. The scalability and elasticity of most modern clouds can improve the capabilities of a Hadoop ecosystem setup. The cluster can be easily adapted to evolving needs. Hadoop (and many applications from its ecosystem) can be executed in virtualized resources such as virtual machines and storage. Cloud services can be used to allocate infrastructure resources (like machines and networks). Then the ecosystem must be deployed using external tools like Hadoop distributions such as Cloudera.

Given the following Spark code written in Scala (asume the SparkContext is stored in object sc and the SparkSession in spark): import org.apache.spark.sql.functions._ case class Client(name: String, city: String, age: Long) val clientSeq = Seq(Client("Alice", "New York", 45), Client("Bob", "Boston", 31), Client("Gill", "Boston", 62), Client("Joe", "Cairo", 15)) val clientDF = sc.parallelize(clientSeq).toDF() We want to calculate the maximum age for each possible value of city. What line of code will give us that?. clientDF.select(max("age").as("maximum")).groupBy("city"). clientDF.groupBy("city").agg(max("age").as("maximum")). clientDF.agg(max("age").as("maximum")).groupBy("city"). clientDF.groupBy("city").select(max("age").as("maximum")).

Given the following Spark code written in Scala (asume the SparkContext is stored in object sc and the SparkSession in spark): case class Client(name: String, age: Long) val clientSeq = Seq(Client("Sue",45), Client("Matt", 31), Client("Gill", 62)) val clientDF = sc.parallelize(clientSeq).toDF() We want to select those clients younger than 50 and sort their names alphabetically. What line of code will give us that?. clientDF.collect().filter(age < 50).sort(). clientDF.filter(clientDF("age") < 50).orderBy("name"). clientDF.select(clientDF("age") < 50).orderBy("name"). clientDF.orderBy("name").filter(clientDF("age") <= 50).

Which one of the following statements about Apache Hive is FALSE?. It can be accessed via several different interfaces, like JDBC and Apache Thrift. It is an ACID relational database on top of MapReduce, with its own SQL interface. It can be directly accessed from Spark. It allows to organize the data stored into HDFS in schemas (databases) and tables.

Given the following pyspark code: tuple_array = [('a', 2), ('b', 0), ('c', 1), ('d', 3)] my_rdd = sc.parallelize(tuple_array) We want to obtain an RDD that contains a collection of the characters 'a', 'b', 'c' and 'd', each repeated the number of times indicated in the initial array (that is, 'a' would be repeated 2 times, 'c' one time, etc.). What line of code will give us that?. my_rdd.flatMap(lambda p: [ p[0] for i in range(p[1]) ]). my_rdd.reduce(lambda p: [ p[0] for i in range(p[1]) ]). my_rdd.reduceByKey(lambda p: [ p[0] for i in range(p[1]) ]). my_rdd.map(lambda p: [ p[0] for i in range(p[1]) ]).

Select the correct definition of a SparkML Pipeline. A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow. To create a Pipeline we need to define a PipelineModel object, which is an Estimator. Once the PipelineModel is fitted, it produces a Pipeline object, which is a Transformer. A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow. To create a Pipeline we need to define a PipelineModel object, which is an Estimator. Once the PipelineModel is fitted, it produces a Pipeline object, which is an Estimator. A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow. A Pipeline is itself a Transformer that can be fitted. The result is an Estimator called a PipelineModel. A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow. A Pipeline is itself an Estimator that can be fitted. The result is a Transformer called a PipelineModel.

We already have statistical inference tools. Why do we need Big Data?. Big Data and statistical inference are not in conflict. They address different types of problems. Big Data is more efficient, but statistically limited by design it can only be used for non-critical applications. Traditional statistical tools are more complicated. Big Data is easier since we do not need to sample or pre-process our data. Big Data is a more modern and advanced alternative to traditional statistics. With time, it will replace statistical inference.

Of the following, which one is the most relevant feature of HDFS?. It incorporates file and service redundancy, to improve fault-tolerance and high availability. It provides information about the distributed nature of the data to data processing applications (like MapReduce and Spark) can take advantage of data locality. It implements a high-performance massively - parallel data access interface, ideal for HPC problems and specialized hardware. It provides a POSIX-like API to access files in a standard fashion.

Which of the following statements about YARN is FALSE?. It will provide optional speculative execution in the future, but it is not yet available in Hadoop 3.x. YARN incorporates several fault-tolerance and high availability features, like redundant ResourceManagers and task failure detection. YARN supports resource allocation restrictions, such as user quotas, execution queues, etc. The YARN Scheduler has a pluggable policy plug-in.

Given the following Spark code written in Scala (asume the SparkContext is stored in object sc and the SparkSession in spark): import org.apache.spark.sql.functions._ case class Client (client_id: Long, name: String, age: Long) val clientSeq = Seq( Client(1, “Alice”, 45), Client(2, “Bob”, 31), Client (3, “Gill”, 62), Client(4, “Joe”, 15) ) case class Purchase(product: String, client_id: Long, price: Double) val purchasesSeq = Seq( Purchase(“iPhone”, 2, 1159.99), Purchase(“Galaxy”, 1, 759), Purchase(“iPhone”, 4, 448.00), Purchase(“Xiaomi”, 3, 269.95), Purchase(“Fairphone”, 2, 639.41) ) val clientDF = sc.parallelize(clientSeq).toDF() val purchaseDF= sc.parallelize(purchaseSeq).toDF() We want show the list of names of clients that have purchased an iPhone, and the total amount of money each one has spent including the iPhone and other possible purchases. What piece of code will give us that?. clientDF.select(“client_id”) .join(purchaseDF,”client_id”) .groupBy(“client_id”) .agg(sum(“price”).as(“total_price”)) .filter(purchaseDF(“product”) === “iPhone”) .show(). purchaseDF.filter(purchaseDF(“product”) === “iPhone”) .select(“client_id”) .join(purchaseDF,”client_id”) .groupBy(“client_id”) .agg(sum(“price”).as(“total_price”)) .join(clientDF, “client_id”) .show(). clientDF.select(“client_id”) .join(purchaseDF,”client_id”).filter(purchaseDF(“product”) === “iPhone”) .groupBy(“client_id”).agg(sum(“price”).as(“total_price”)) .filter(clientDF, “client_id”) .show(). purchaseDF.groupBy(“client_id”) .agg(sum(“price”).as(“total_price”)) .filter(purchaseDF(“product”) === “iPhone”) .select(“client_id”) .join(clientDF, “client_id”) .show().

Having a variable myRDD a Spark RDD of objects of class Integer, Which line of code will return the result of multiplying all of them by 2 and adding them together?. myRDD.map((x1: Int) => x1+2). myRDD.map(*2).collect(_+_). myRDD.reduce((x1: Int, x2: Int) => (x1*2)+(x2*2)). myRDD.map((x: Int) => x*2).reduce(_+_).

Denunciar Test