RL Deusto
|
|
Título del Test:
![]() RL Deusto Descripción: Pa probar |



| Comentarios |
|---|
NO HAY REGISTROS |
|
Who performs the actions on the environment?. The agent. The policy. The environment itself. The definition of state in a RL system should contain all the information of a particular moment?. No. Yes. It depends on the state. It depends on the task. The policy of a RL System is always…. Mapping between states and actions. A complex system that may involve neural networks and underlying algorithms that simulate how the environment works. A table. Imagine a policy that contains an autoencoder that simulates the next possible states for each action and evaluates them. It is a. Model based system. Model free system. Imagine an RL system where the agent cannot access every piece of information of the environment. Is it still an MDP (Markov property)?. No. Yes. Imagine an adversarial game where each agent does not know the other’s action beforehand but has a clear view of the full game state, including enemy resources. Does the MDP still hold?. No. Yes. In Monte Carlo control (ε-based for exploration), the action taken is always the action that has the maximum value. True. False. Is Monte Carlo control (ε-based exploration) a. Deterministic policy. Random policy. Stochastic policy. In Q-Learning, the new Q-value only depends on the highest valued action for the new state. True. False. In Sarsa, the new Q-value only depends on the highest estimated action for the new state. True. False. A min-max tree is a. Policy-based method. Value-based method. What’s the best way to tackle a long-term reward problem? (Assume all options are viable). Reduce the long term reward in a dense continuous function. Increase the discount factor. Curriculum Learning. If the reward is increasing but the entropy loss remains low, it means: High explotation by the policy. High exploration by the policy. Optimal policy archieved. Considering a PPO model, during the training process the reward function does not increase after a while, the explained variance remains low, the loss is increasing, and the entropy loss is relatively high. We can conclude that: If the approx_kl is relatively high, it means that the learning rate is too high and the grading is exploding. If the approx_kl is relatively low, it means that the learning rate is too high and the grading is exploding. If the value loss is relatively high, it means that the reward function is too complex to be understand. If the value loss is relatively low, it means that the reward function is too complex to be understand. Considering a PPO model, if the approx_kl is relatively low during the training process, the reward function is a dense and well-defined signal, the value loss is low, and the entropy loss is moderate, but the reward does not increase at all, what could be the problem?. An unsufficient state definition. An unsufficient exploration. A high learning rate. A over complicated neural network. Given a situation where the definition of a proper reward function is nearly impossible, what would be the correct approach. Curriculum learning. Imitation learning. Initiate the model with a huge exploration factor. We want to design a Boss that learns to play against a player, which option is better. Policy based model. Value based model. We want to design a Boss that learns to play against a player, which option is better. On policy. Off policy. We want to design a Boss that learns to play against a player, which option is better. PPO. SARSA. TD3. |




