welcome to this lecture series on reinforcing learning in the brain in this video we will look at three-factor learning rules three Factor learning rules arise from reinforcement learning what's the difference between the three Factor rules and two-factor rules in applications we'll see that newer modulators in biology can act as a third factor and we also see experimental support for three-factor learning rules a learning rule with two factors is Happy learning it's essentially unsupervised learning there is no notion of success no notion of reward the way change depends on the state of the presynaptic neuron Spike
arrival at the synapse at the state of the postsynaptic neuron for example spiking activity of the posting option and on the weight itself now the three factors we will have in addition a notion of reward or of success let's have a look at this here's a situation where a mouse is in a maze suppose the mouse moves forward now it's here and has to decide whether it goes left or right it decides to go to the right so at this location where it sits right now a certain subset of cells goes on which represent the
place and then the mouse decides to move right which means at the representation of the action plan for example in storage atom it uh it's also a subset of neurons adaptive now what we have in this three-factor rules is that the co-activation of sending neuron pre neuron and postsynaptic neuron receiving neuron leads to a synaptic eligibility traits it sets a flag it marks the synapse but itself it's not either way change and now this synaptic eligibility Trace keeps memory over a second or two or three but then it decays and so when moving to the
right there is no success signal it will just Decay and go away later the mouse will again be at the same location and this time he decides to go live to go to turn left now it's a different combination of active neurons it's the same set of place cells but different active neurons which again are marked by an eligibility trace and this time there is indeed a success signal the successional is broadly distributed across the whole network it comes with a delay of one second because the red has actually to move towards the Swiss cheese
and it likes the sweet swiss cheese it acts like a reward signal it acts as a success signal and so in this case you now have a transformation of the eligibility Trace in an actual weight change and inside the brain the dopamine signal is able to transmit reward or success and reinforcement learning uses these success signals defined as reward minus expected reward and if you have TD learning then the expected reward is the difference between two values same in sarsa in policy gradient it could be the reboot just compared to a running average for expected
reward so different combinations of reinforcement learning algorithms that we can see in the book of Saturn and battle all give rise to three Factor learning rules so three-factor learning rules have a presynaptic part A post-synetic Part weight dependence and in addition the success part so pre and post together is a condition in addition we need the success success signal and then the weight will change the two signals pre and post are local in the sense they are synapse specific they pick out the synapse whereas the success signal is a broadly diffused signal or a broadcast
signal and in the brain neuromed modulators can take over this role so again a comparison on the left hand side we have happy learning happy learning is unsupervised learning it's sort of passive changes there's no idea of action there's no idea of success and it exploits statistical correlations between the two neurons and with that you can do PCA you can do ICA it's good for development setting up a state representation to develop good filters with reinforcement learning we have in addition the success signal and it's useful for learning a new behavior and that is indeed
the topic of reinforcement learning so again the newer modulator is the new signal the new mode later can submit success in the sense of reward minus expected reward it could also signal interestingness or surprise or attention or novelty and whatever its specific role we need it as a third factor to implement the rate change now what are these newer modulators a first candidate a famous candidate is dopamine as I mentioned earlier dopamine is dopamine neurons cells that send out dopamine sit somewhere deep down in the brain and then they make projections basically all over the
brain it's really a broadcast signal except this region here which is related to visual cortex which is not reached so we should really think of dopamine signals as a broadcast signal that are sent all over the network so they are need they have near Global actions and dopamine is one example another one is noradrenaline also called norepinephrine again it comes from special neurons that sit very deep in the brain and send out signals nearly everywhere now the classic idea that was established by Wolfram Schultz and colleagues a long time ago is that dopamine is related
to success where success is defined as reward minus expected reward and it will show evidence in a minute now dopamine will also react a little bit to novelty and surprise there's no need that the dopamine axis is perfectly aligned with this success axis now other neural modulators like noradrenaline could then also signal for surprise and in the end different newer modulators span different axes of novelty of surprise of success of reward of stress of attention so these are often emotional signals that are spread out all over the brain the formalism in a general framework is
then I have a signal from the activity of presynaptic noon so it could be some function of XJ I have a signal arising from the activity of postsynaptic neurons and again some function of VI and this co-activation sets the eligibility Trace that's step one the eligibility Trace would Decay over time Lambda is smaller than one but if in time we have a success signal then this success signal is transmitted by a neural modulator that acts as a factor as a third factor in the learning Rule and this essentially in the end implements the change of
the connection the connection will be strengthened the connection for new and J to new and I what I showed in an earlier video is that hepian changes can be induced by co-activation of pre and postsynaptic neuron so a strong signal is pre post post I would also call this a hepian signal the postsynaptic neuron does not emit spikes but it's variable the voltage the cell membrane potential is elevated above the standard value and so I have some kind of coincidence between pre-activity and post activity now the new thing is the three-factor rule yes we have
pre and post together and pre-postpost post is a very strong signal but then we have the modulator the modulator could be dopamine and importantly the modulator could potentially come with a delay of a second the mouse has to walk a little bit before it finds the cheese first I take the action then I get my reward now Neuroscience as experimental Community has for a long time focused on different forms of hepian paradigms of inducing synaptic changes more recently the focus has turned to three Factor rules again here the idea I have please not the activity
the green neuron is active I have postsynaptic activity these neuron is active the green neuron makes also a contact here but the postsynaptic cell is not active hence it's only this synapse here that sits in eligibility Trace but because this is the synapse that has seen joint activity of pre and post synaptic neuron the synaptic flag splits the role of the mathematical eligibility Trace and if later a success signal comes and this success signal dopamine will be distributed all over the brain then this one specific synapse is increased there's other synapses that only receive the
success signal will not change so step one co-activation sets the eligibility Trace Step 2 it can Decay step 3 but if a newer modulator comes in before the Decay then the eligibility Trace is translated into a weight change and that means that the connection is strengthened and this is now how these experiments were done experiments in 2014 by yakishita adal they used this stimulus pre-postpost repeated several times a very strong induction protocol and then in addition they gave a dopamine signal here denoted by D A and this dopamine signal could for example come with a
delay of one segment and you see that with a delay of one second I have a positive change of this synapse if it comes immediately afterwards then the change is slightly stronger and even if there's a slight overlap that's even better but if the signal comes way too early or way too late it will not induce a weight change now these experiments were done in striator that's interesting because the stratum is involved in action selection moreover the experimentalists could check that after this short protocol that lasts a second and maybe another segment of delay for
dopamine the weight change remains for at least 50 minutes and this is seen on these figures that you may have seen in an earlier video so the little where the red triangle points the riddle plots these are the spines of the synapse and at this point here they give the stimulation protocol and then you see two minutes later this synapse is stronger 50 minutes later the same synapses still strong so yes it's induced rapidly and yes the changes are stable they persist over 50 minutes so three factors are needed according to this experiment for synaptic
changes a presynaptic factor this could be spikes of the presynaptic neuron what effect of Spike arrival at the synapse some low pass filter of the spike a postsynaptic factor which could be the spikes of the postsynaptic neuron or increased voltage or spikes minus expected number of spikes for this situation and then we need a third factor and the third Factor could be a neuromodolator such as dopamine so this part is standard long-term potentiation learning rules in the brain yes learning rules is happy and learning which depends on pre and positive activity reinforcement learning depends in
addition on neural modulators such as dopamine indicating reward and the three Factor rule needs presynaptic signal postsynaptic signal neuromodolator signal but they do not have to be at the same time it can come with a delay