Hello, again. So now were are moving on to calculating information in spike trains. And in this section of the lecture, we're going to be talking about two methods one of which is how to compute information in spike patterns. And the other is how to compute information in single spikes. So let's go back to our, our grandma's information recipe. So remember that we're calculating the mutual information, which is the difference between the total response entropy and the mean noise entropy. So, what was the strategy, we're going to, test strategy. We're going to take a single stimulus S, repeat it many times to obtain the probability of the responses given S, in that response distribution, via the noise entropy. We're going to repeat that for all s, and then average it over s. Finally, we'll compute the probability of response, and from that the total response entropy. So now, let's go ahead and compute information in spike patterns. So far we've really only dealt with single spikes or firing rates, so what we'd like to ask here is, what information is carried by patterns of spikes? By these interesting sequences of 0s and 1s that occur here in the code. What this allows us to do is to analyze. Patterns of the code and to ask how informative they are. So the way we're going to turn out our spike train into, into a pattern code, is that we're going to chop up segments of these responses, so we take our voltage train when we divided into time bins Of size delta t. If there's a spike in, in that time, then we'll put a one. If there's no spike, we'll put a zero. And now we'll chunk up these zeros and ones into words of some length, big T. So now that we, we've defined these binary words. With the letter size delta t and length of T, we can now walk through our data. So, so, here's a raster plot produced by a stimulus that was randomly chosen on every trial. And so, if one converts such a raster plot into sequences of zeros and ones, you can look through that and pull out many, many examples of these words, again of length T and type in delta t. So now, one can form a distribution over these words. So here, the most common word was silence, there was no spike in this set of eight consecutive time bins, the next most common was that one spike appeared and of course, we can have that one appearing at different locations throughout the word. These are the next most common set of words. Then one starts to get combinations of spikes occurring at different locations throughout the word. So now we can walk through our data and calculate these probabilities and then calculate the entropy of that word distribution. Now the information, is the difference between, that entropy, and the variability, due to noise, averaged over stimuli. So here was our total entropy. Here's how we're going to compute our noise entropy. So, in this case, the same stimulus was given every time, and now, what one sees, over many repetitions of that stimulus. Is that on the first trial, you see a word, zero, zero, one, zero, zero, zero, zero. On the next trial, you have the same word, but now you see that there are some times when there was no spike, and some times when that spike appeared in a different bin. What that's going to do is generate a distribution of different wads. Now that distribution is going to be considerably narrower than the total distribution. And it's exactly this reduction in the entropy from knowing nothing about the stimulus, to knowing something about the stimulus that information will be capturing. Alright, so let's go ahead and apply Grandma's recipe. We'll take a stimulus sequence and repeat it many times, by how we're sampling the, this probability of stimulus. We're going to use a bit of a trick, which is that instead of averaging over all possible stimuli, we're going to take a long random stimulus and average it over time. So, now, time is standing in, for the average over stimulus. So now, for each time, in the repeated stimulus, we're going to get a set of words, p of w, given stimulus at time t. And our noise entropy, our average noise entropy, is now going to be averaged over those different time points, i. So if we choose a length of repeated sequences long enough, that will allow us to sample the noise entropy adequately. So let's have a look at the application of this idea to data from the LGN in a classic paper by Pam Reinagel and Clay Reid. They carried out this exact procedure, so as you saw before they ran a random stimulus over many trials. Then they ran a fixed stimulus, call it frozen white noise, which has some structure, in fact here it is. It's the stimulus as a function of time and you can see that in response to the stimulus spikes appeared in a time lock sequence. And now for an averages across those repeats one finds a PSTH, that is a Post Stimulus Time Histogram, Where these events show these large modulations in the time varying fine rate produced in response to that stimulus. Now, if one zooms in on a tiny piece of these responses, you'll see something like this. So, at, at very Fine time scales. There's quite a bit of jitter in those responses. Now our goal in computing the information, and what the author has examined in this paper was ask on what time scale, do these responses continue to convey information about the stimulus? So one can see by looking at this picture that there's quite a bit of variability in the spike train, and so that defines some kind of window around which a spike can jitter and still signal the same information about the input. So the questions we'd like to understand is how finely do we have to bend our spike train and pay attention to the individual timings of spikes in order to extract all that the neural code has to tell us about the stimulus? So one can do that by exploring the information produced by the spike train as a function of these two parameters, as a function of delta t, the Binning time width and also of the length of the word, as the word gets longer our coding symbol is able to capture more and more of the correlations in the input. And so, to what extent does increasing L continue to capture more and more information about the stimulus. So here's what the authors found in the LGN, they varied both DT, both the temporal resolution of their words and the total word length. So, here drawn as a function of 1 over L. And have plotted here the information that they calculated for different choices of those parameters of the definition of the word. So, clearly, there's going to be a problem in going to this limit of very large word lengths. So, as the word gets longer an longer, for a finite amount of data, you're going to have very few samples of a word of that length. And so when one tries to estimate the entropy of the distribution of words of this length, it's very unlikely that you will have seen them all. And so not surprisingly, if you now look at the entropy, plotted as one over the word length The entropy drops off at this limit indicating that the information is not completely sampled. So what can be done is to compute the entropy for different lengths of words and you can see that these form almost a line. And so one can simply extrapolate the tendency of this line back toward infinite word length. And extract an estimated value for the entropy at that limit. That's not what was done in this figure this was purely the information directly captured. And so one can look over different delta t's and different word length to see how information depended on these parameters. So what you should notice is that there is some limit. To DT, beyond which the information doesn't grow anymore. As one looks at the woods in higher and higher temperol resolution. So one takes into account finer and finer details about how those spike patterns are generated. and so that's what's being quantified as we move down this axis. As the time discordization of the wood. These bin sizes, is getting smaller and smaller, that's able to capture more and more of the variability, in the spike train, that's actually signaling something different about the stimulus. But that at some point, it seems that that, information, stops increasing. So, this red, we're at about, you know, between 80 and 100 bits per second, is the information rate. And you see that that stops increasing with delta t, and of delta t of about 2 milliseconds. So hopefully you'll remember from the jitter in the spike trance that we looked at, that they seem to be repeatable on a time scale of about a millisecond or 2 milliseconds. So that time scale dt corresponds to the time scale in which the jitter in the spike train. Still allows one to read that off as an encoding of the same stimulus. It's going to quantify approximately what's the temporal with that one can discatize this spike train and still extract all the information about the stimulus that distinguishes it from other stimuli. So in this example we've seen one case where we didn't have enough data to be able to sample say very long words. In general this is always true. When one's trying to calculate information theoretic quantities, one needs to know the full distribution of responses, and the full distribution of stimuli. And there's simply never enough data to come up with really reliable estimates for information, unless one has very simple experimental setups. And so a lot of effort has been put into finding ways to correct the sample distributions for the fact that there is a finite amount of data. And there's been some very interesting work by a number of groups over the last 15 years or so, that has made significant advances in being able to compute information theoretic quantities from finite amounts of data. Now we're going to turn to a different approach, this one proposed by [UNKNOWN] Brenner and [UNKNOWN]. How much does the observation of a single spike tell us about the stimulus? Now this is similar to the case that we started with at the beginning of this lecture, but now we're going to address the question that we noted then What if we don't know exactly what it is about the stimulus that triggered the spike. It turns out that, as in the case we just went through, is straightforward to compute information with an explicit knowledge of what exactly in the input is being encoded. This is because the mutual information allows us away to quantify the relationship between input and output without needing to make any particular model of that relationship relationship. So, the paradigm is exactly the same as before. We're going to compute the entropy of responses, when the stimulus is random, and the entropy, when given a specific stimulus. So, here, things are a little simpler, than in the case of Wuds/g, without knowing the stimulus, the probability that a single spike acud/g, is given by the average firing rate times the bin size. Similarly, the probability of no spike is just 1 minus that. Now the probability of a spike at a given time during the presentation of a stimulus r of t times the time then, when now r of t is the time varying rate caused by the changing stimulus We can get an estimate of that time varying rate by repeating the input over and over again. The variability in these responses means that these events show a continuous variation, and have some width as we saw before, depending on the jitter and the spike times. So let's go ahead and compute the entropy. We're going to define, for the moment, p equals r bar delta t and p of t to be r of t delta t. The information will simply be the difference between the total entropy, we've already computed that in the beginning of the lecture For, for this binomial case to minus p log p minus 1 minus p log 1 minus p and we need to subtract from that the noise entropy. Now the noise entropy would take on a value at every time t depending on the time variant firing rate. Now again every time t represents a sample of stimulus S. And averaging over time is equivalent to averaging over the distribution of s. This ability to swap an average over the ensemble stimuli, for an average over time, is known as ergodicity. At different values of S are visited in time with the frequency that's equivalent to their probability. So now we have our expression for the information between response and stimulus, we can do some manipulations on it. So we're placing back P by R delta T. We can take the time average firing rate, to be equal, to the mean firing rate, so that's equivalent here to this, to the integral, over, the probability as a function of time, in the mean, going toward that main firing rate. And getting rid of some small terms, we have here a couple extra, extra pieces that turn out to be small, we end up with a rather neat expression for the information per spike. let's take a closer looks at this expression, as we've emphasized already This method of computing information has no explicit stimulus dependence. Meaning no need for any explicit coding or decoding model. It relies on the repeated part of the stimulus being a good estimate of the distribution of a possible stimuli. Note also that although we computed this for the arrival or not of a single spike, this formulism could be applied to the rate of any event. For example the occurrence of a specific symbol in the code. So this is a way to evaluate how much information might be conveyed by a particular pattern of spikes, for example a sudden inter spike interval. We can also examine what determines the amount of information in the spike train /g. So looking again at this expression, we can see that it's going to be determined by two things. One is timing precision. That's going to blur this function R of T. So if events are blurred so that R of T increases and decreases slowly, without reaching large values, this will reduce the information. At the extreme, let's imagine, that the response is barely modulated at all by this particular stimulas. In that case, r of t goes towards the average firing rate. And one gets no information. The more sharply and strongly modulated r of t is the more information it contains. The other factor is the main firing rate. If the spike rate is very low then the average firing rate is small and information is likely to be the large. The intuition is that the low firing rate signifies that the neuron response to a very small number of possible stimuli so that when it does spike its extremely informative about the stimulus. Note that this is the information per spike. The information transmitted is a function of time, for the information rate is going to be small for such a neuron. So let's look at some hypothetical examples. Rat hippocampal neurons have what's known as a place field such that when the rat runs through that region in space, the cell fires. Let's imagine the place cell looks like this. As the rat runs around the field, Is going to pass through that place field, and what's the firing rate going to look like? Here, as it moves through the field is going to go from zero, ramp up kind of slowly, go down again. Because that place field is quite large, the red is likely to pass through it farely often. So we're going to get some R of T of that form. Now let's imagine that the place field is very small. Now, rat runs around. Very, very rarely passes through that, that place field. And so, now, going to get almost no firing and then some blip of firing as it passes through that field. Now, what if the edges of the place fill the very shop? So now again rat runs around. Very, very rarely passes through that field, so now as the rat runs around, it passes through that place field very rarely, but when it does, the firing rate increases very sharply toward its maximum. So that's going to increase the information we get from such a receptor field. Okay, so now we're done with computing information in spike trains. Next up we'll be talking about information and coding efficiency. We'll be looking at natural stimuli. What are the challenges posed to our nervous systems by natural stimuli? What do information theoretic concepts suggest that neural systems should do when they encode such stimuli? And finally, what principles seem to be at work in shaping the neural code?