Introduction reinforcement learning, with Epsilon-Greedy(Bandit game)algorithm

Rana singh
3 min readSep 29, 2019

--

In deep NLP/Unsuperwiseed deep learning, we saw that unsupervised technique can be used tp pre-train supervised models. in contrast, Reinforcement learning is way different.

both supervised and unsupervised learning have a common platform like training the data. in supervise, we train and predict from data on another hand in unsupervised learning, we fit the model and make the cluster os similar data(k-mean, GMM, PCA etc).

where Reinforcement learning guides the agent on how to act in the world. interface is much broad than just training the model. its an entire environment. the environment can b real-world or simulated world.ex. robot that vacuum your house, a bot that fights on the er, diffuse bombs, make an important decision.

Supervise objective: maximize accuracy or likelihood/minimize cost. label needs to be made by humans(time-consuming and costly)

Reinforcement learning gets feedback as they interact with the environment. the feedback signal(rewards) are automatically given to the agent by the environment where the active is a GOAL.

Timing also very important in Reinforcement learning.

Overall: our aim is to program the agent to be intelligent. Agent interacts with the environment by being in the state, taking action based on that state, which brings it to a new environment. Environment gives the agent a reward, can be +ve or _ve (but must be number). Reward received in next stage by agent.The goal of agent is to maximize the reward.

Epsilon-Greedy(Bandit game)

It is Simple solution to exploration to exploitation dilemma problem. In this choose small number “epsilon” as probability of exploration, the typical value is 5%, 10%.

eventually, we will discover which arm is the best, since this allows us to update every arms estimate.

how to estimate bandit reward:

problem is to track all x in element.

code: comparing_epsilons.py

Plot:

Above plot shows the cumulative average for respective epsilon.

--

--

Responses (2)