data science deusto
![]() |
![]() |
![]() |
Título del Test:![]() data science deusto Descripción: preguntas test |




Comentarios |
---|
NO HAY REGISTROS |
Which of the following is not a phase of the CRISP-DM Methodology?. a) Data Formatting. b) Data preparation. c) Business understanding. Dummy variables: a) Can generate columns with any integer number. b) Will generate columns with numbers in the range [0, 1]. c) Will generate columns with 0s or 1s. The ANOVA test: a) Measures the relation between 1 qualitative and 1 quantitative attribute. b) Measures the relation between 2 qualitative attributes. c) Measures the relation between 2 quantitative attributes. The most relevant “V” of Big Data for us is: a) Volume. b) Variety. c) Value. The data science skillset is composed by (mainly): a) Programming skills, Domain Knowledge, Math and Statistics. b) Statistics and Programming skills. c) Machine learning and Data visualization. The correlation coefficient: a) Is always in the range [-1, +1]. b) Is always greater than 0. c) Can be any real value. An association rule: a) Begin uni or bi directional depends on the data used. b) Is unidirectional: X->Y or Y->X. c) Is bidirectional: X<->Y. The regular expression “\d(10)0”: a) Triggers on strings of at least 11 digits ending with a 0. b) Triggers on strings of 11 digits ending with a 0. c) It is triggered by 4-digits string ending with 100. If we discretize into 2 bins a column with 10 numerical values (not repeated) using equal-depth procedure: a) The result will be each of the bins having exactly 5 values. b) The distribution of the values in the result cannot be known beforehand. c) Equal-depth is not a discretization procedure. Imputation of missing data to the most frequent value: a) Can reduce data variance and this is a drawback of the method. b) Can increase data variance and this is an advantage of the method. c) Has no effect on the data variance. The Accuracy dimension of Data Quality: a) Is focused on the degree to which the data is available and up to the date at the time it is needed. b) Is focused on the degree to which the data represents the reality. c) Is focused on the degree to which necessary data is available for use. In a cross table with margin = 1: a) All the values in the table will sum 1. b) All the rows will sum 1. c) Al the columns will sum 1. Parametric methods: a) Will always return the same result when feed with similar data. b) Will not always return the same result when feed with similar data. c) Can control the complexity of the resulting structure independently of the data used. If we observe data is only collected during daytime, and not during night (always): a) It is an example of Missing not at random (MNAR). b) It is an example of Missing Completely at Random (MCAR). c) It is an example of Missing at random (MAR). Statistically, a CITY attribute in a table should be considered as: a) An ordinal attribute. b) A continuous attribute. c) A Nominal attribute. An API will return (usually): a) A CSV file. b) Data coded in JSON format. c) Data coded in a HTML format. MAE, MSE and MAPE stand for: a) Measures used within classification methods. b) Measures used within clustering algorithms. c) Measures used within regression models. If, after using min-max normalization (0-1) we get a value near to 0, it means that: a) It is not possible to known. b) The original value was near to the average value of the data. c) The original value was near to the minimum value of the data. Outliers are usually classified into: a) Global, Local and Contextual. b) Global, Contextual and Collective. c) Individual and Collective. The “Lag” and “Lead” operations: a) They allow calculating arithmetic operations (addition and subtraction) with dates and times. b) Allow accessing a value stored in a row above/below the current row. c) Calculate the moving average forward or backwards, respectively. If we need to predict customer leakage based on past customer actions, we need: a) A supervised model of classification. b) An unsupervised model of clustering. c) A supervised model of regression. Within regular expressions, the character “\”. a) Allows taking one option or another. b) Denotes that previous characters are optional. c) Cancels a metacharacter. SMART stands for: a) Strong, Measurable, Attractive, Relevant and Time-bound. b) Social, Modern, Achievable, Rigit and Time-bound. c) Specific, Measurable, Achievable, Relevant and Time-bound. Which are the 3 (main) tests for measuring relations between variables?. a) Chi Square, Anova and Standard deviation. b) Chi Square, Anova and Tukey. c) Chi Square, Anova and Correlation. The no free lunch theorem stands: a) No learning algorithm has an inherent superiority over learning algorithms for all the problems. b) For the same input data and the same initial configuration, a method will always get the same result (or not). c) The “complexity” or structure of the model can be controlled independently of the data used. Which are the possible SCALES of the data: a) Nominal, Ordinal, Discrete and Continuous. b) Nominal, Ordinal, Interval and Ratio. c) Factor, Integer and Numeric. Stopwords are: a) Each one of the words appearing in a text attribute. b) Those words that do not provide information about the content of the text. c) The words at the end of a phrase. rolling operations are usually applied when: a) All the attributes of the data are numeric. b) There is an attribute for the date. c) Data con be ordered with respect to time and over a numerical attribute. overfitting occurs when: a) The observed error within the training data increases as the complexity of the model does. b) The observed error within the validation data is lower than the error within the training data. c) The observed error within the validation data increases as the complexity of the model does. If we have a table with ids (1,2,3) and another with ids (3,4,5), the resulting table after an outer join: a) Will have 1 row. b) Will have 5 rows. c) Will have 3 rows. The consistency dimension of data quality stands for: a) Degree to which necessary data is available for use. b) Degree to which the data is equal within and between datasets. c) Degree to which the data represents the reality. Parameter tuning is a task within which state of the CRISP-DM methodology?. a) Modelling. b) Deployment. c) Data Preparation. A distance matrix. a) Will have as many rows and columns as INSTANCES are in the original data. b) Will have as many rows and columns as the original data. c) Will have as many rows and columns as ATTRIBUTES are in the original data. Which of the following terms is NOT associated with CLUSTERING algorithms: a) Support. b) Dendogram. c) Centroid. If using MSE for evaluation of a regression model, it implies that: a) Bigger errors penalize more. b) Errors about larger amounts penalize less. c) Positive errors penalize more. The TF-IDF measure increases if: a) A word appears in MANY instances of a text attribute. b) A word appears in LONG instances of a text attribute. c) A word appears in FEW instances of a text attribute. The use of “<tag>” and “<\tag>” is common in: a) The EXCEL formats. b) The JSON formats. c) The HTML formats. The Apriori algorithm: a) Can be affected if outliers are present in data. b) Can process only numerical attributes. c) Can process only categorical attributes. Hotdeck imputation: a) Generates a new instance for each possible value of the attribute with the missing value. b) Replaces the missing value with one present in the same attribute, in a sample of similar instances, selected at random. c) Uses the previous non-missing value for imputation. For which supervised method we need a definition form distance?. a) K-Nearest Neighbours. b) Linear regression. c) Decision tree. |