Sensory Processing in the Retina

EE376A (Winter 2019)

By Saarthak Sarup, Raman Vilkhu, and Louis Blankemeier

Modeling the Vertebrate Retina

“To suppose that the eye with all its inimitable contrivances for adjusting the focus to different distances, for admitting different amounts of light, and for the correction of spherical and chromatic aberration, could have been formed by natural selection, seems, I confess, absurd in the highest degree…The difficulty of believing that a perfect and complex eye could be formed by natural selection , though insuperable by our imagination, should not be considered subversive of the theory.” – Charles Darwin

The sentiment behind this quote truly captures the miracle that is the human eye… even Darwin had to specifically address its unique case in his argument for natural selection. In this project, we aim to capture some of this complexity in the structure and function of the eye and study it under the lens of information theory.

A short biology lesson…

Figure 1: Anatomy of the human eye, with exploded view of the retina. [6]

The human eye is composed of various complex structures, however the one studied and modeled throughout this blog is the retina. The retina is responsible for converting light responses captured by the cornea into electrical signals sent to the visual cortex in the brain via the optic nerve. As seen in Figure 1, the retina lies in the back of the eye and can be further decomposed into layers of neurons and photoreceptors. As a brief introduction to the biology, the light passes through the layers from the leftmost to the rightmost: nerve fiber layer (RNFL), retinal ganglion cells (RGC), amacrine cells (AC), bipolar cells (BC), horizontal cells (HC), photoreceptors (PR), and the retinal pigment epithelium (RPE). For the scope of this blog, it is sufficient having a high level understanding that the retina translates varying light responses to electrical spikes (known as action potentials) which can be transmitted along the neurons in the optic nerve to the visual cortex, where perception takes place. For those interested in these specific retina layers and their unique contributions to this encoding process, please refer to [4].

Communicating through the eye

The primary objective of the project was to model the eye under the lens of information theory and see what analytical results can be derived to answer questions such as: what is the maximum rate achievable by the eye? What is the best model for the encoding/decoding done in the human visual system? In order to study these overarching questions, we had to start by modeling the system as an information processing channel.

Figure 2: Vision mapped as an information processing problem.

First, the eye takes in a light response captured at the cornea and encodes that into electrical spikes defined by a spiking rate. In an information theory sense, this is equivalent to encoding a source using some encoder that takes intensity and maps it to discrete spikes. Here, as seen in Figure 2, we assume that the eye does rate-encoding, where the majority of the useful “information” is captured by the spiking rate rather than precise spatio-temporal information about each individual spike. This particular assumption is one heavily debated in literature and the true encoding scheme used by the eye is still a mystery, however this assumption is nonetheless a good starting point. Additionally, for the scope of our project, we generated the visual inputs with known statistics matching that of natural scenes, as discussed in later sections. Therefore, mapping the biology to the information theory, the encoder in the system models the retina — mapping light responses to spikes.

In the visual system, the communication channel is represented by the optic nerve, which runs from the retina to the visual cortex in the brain. In our model, this communication channel is modeled as a parallel, inhomogeneous Poisson channel with some additive Poisson noise. Specifically, the number of parallel channels corresponds to the number of retinal ganglion cells connected to the optic nerve bundle, each transmitting some spike rate on the channel.

Finally, in our model, the communication channel feeds into a decoder which implements a strategy to capture the spike rates and reconstruct the perceived visual inputs from these estimated rates. In biology, this corresponds to the visual cortex of the brain, which perceives the spikes as some visual image.

Overall, this general framework of modelling the visual system as an information theory problem is particularly interesting because we can apply well-known information theory techniques to gain insights into otherwise mysterious processes.

Joint Source-Channel Coding: A Review

Our model of the early visual system depicts both compression via neural downsampling and communication via Poisson spiking. All together, our work fits into the joint source-channel coding framework. The optimal strategy in this framework according to Shannon’s separation theorem is to independently optimize both our source coding compression scheme and our channel coding communication protocol.

Source coding for optimal compression

In source coding, we deal with a sequence of inputs X_1,X_2,\dots,X_N ~ p(x), x\in \mathcal{X}, which is encoded by an index f_n(X^n)\in {1,2,\dots,2^{nR}} and then decoded by an estimate \hat{X}^n from the output alphabet \hat{\mathcal{X}}. This procedure is outlined in Figure 4 below. An important quantity in this framework is the compression-rate R, defined as the average number of bits per input symbol. According to Shannon’s source coding theorem, it is impossible to perfectly compress input data such that the compression-rate is less than the entropy of the source. In the case where lossy compression is tolerated, the tradeoff between minimizing the compression-rate and minimizing the loss is given by the rate-distortion curve. Points along this curve R(D) define the smallest achievable compression-rates (largest compression factors) for a given distortion D.

Figure 3: Standard source coding framework

Channel coding for reliable communication

Channel coding consists of finding a set of channel input symbol sequences such that each sequence is sufficiently “far” from the other sequences and each input sequence maps to a disjoint set of output sequences. In this case, the channel input sequence can be decoded without error. Shannon showed that for every channel, there exists some channel code such that max_{p(x)} I(Y;X) bits can be sent per channel use. max_{p(x)} I(Y;X) is known as the channel capacity. This idea can be generalized to a channel which takes a continuous waveform as input. In this case, channel coding consists of finding waveforms, which last for duration T, and are spaced sufficiently far apart such that they can be decoded without error. This is shown in Figure 4 for the case of four independent Poisson channels. Here, the codebook consists of waveforms which take the form of pulse modulated signals. As discussed in sections “Channel Capacity Under Poisson Noise” and a “Mathematical Aside”, a code generated with such waveforms can reach capacity. Thus, each continuous input is mapped to a pulse modulated signal which is then transmitted across the Poisson channel.

Figure 4: Channel coding framework

Compressing a Gaussian Source with Finite Bandwidth

How can we statistically characterize our visual inputs? In the space of continuous and unbounded alphabets, Gaussian distributions maximize the available entropy under a bounded power (technically, second moment) constraint. As a result, we take a Gaussian source to be a general description of our 2D image data. To model a general bandwidth assumption on our images, we employ shift-invariant autocorrelation function given by the squared-exponential kernel K(\textbf{x}-\textbf{x'}) = \exp\Big(\frac{-|\textbf{x}-\textbf{x'}|^2}{2l^2}\Big). This Gaussian process allows us to draw random S\times S images as S^2-length vectors from \mathcal{N} \big(\textbf{0}, \Sigma_l \big), where the covariance matrix \Sigma_l is parameterized by the length constant l. This process gives a non-uniform power spectrum with analytic form given below:

S(f) = \sqrt{2\pi l^2}\exp(-2\pi^2 l^2 f^2)

To further simplify our analysis, we consider images in gray-scale, as color representation in the retinal code can be handled somewhat orthogonally to intensity by color-tuned photoreceptors and retinal ganglion cells. Example frames from this procedure can be seen in Fig. 5 below.

Figure 5: Typical frames for this isotropic Gaussian process with l=0.05, l=0.1, and l=0.2 from left to right

How good can we theoretically do?

When compressing a memoryless Gaussian source with a uniform power spectrum \sigma^2, the maximum achievable compression rate for a minimum mean-squared distortion D is given by the well-known expression, R(D) = \frac{1}{2}\log\frac{\sigma^2}{D}.

For our non-memoryless case, the optimal compression strategy can be clarified in the spectral domain. Here, the single spatially-correlated source can be decomposed into infinitely many parallel independent sources, each with variance given by the power S(f). This transformation inspires a reverse water-filling solution [7], where we use a given distortion budget D to determine a threshold \theta, such that compression is done by only encoding the components where S(f)>\theta. In this parametric form, the new rate-distortion curve is given by the following equations:

D(\theta) = \int_{-\frac{1}{2}}^{\frac{1}{2}}\min(\theta,S(f)) \ df

R(\theta) = \int_{-\frac{1}{2}}^{\frac{1}{2}}\max\Big(0,\frac{1}{2}\log\frac{S(f)}{\theta}\Big) \ df

To reiterate an important point, these R(D) curves depend on the length scale of the source l. In the limit where l \rightarrow \infty, our image becomes uniform and its power spectrum is concentrated at f=0. This source distribution corresponds to the largest rate-distortion curve, since the all the input pixels can be encoded with the single uniform intensity and all the distortion is concentrated in the quantization of that value. In the limit where l\rightarrow 0, our bandwidth becomes unbounded and we recover the white Gaussian source, with a constant power spectrum S(f) = \sigma^2. At this extreme, each input pixel is uncorrelated with the other pixels, and for compression under the same fixed distortion D, each symbol must share this budget with a more aggressive quantization. This intuition is validated in Fig. 6a, where the rate-distortion curves are computed for a variety of length-scales.

Figure 6: (a) Rate-distortion curve for various length scales l. (b) Shaded region describes achievable compression rates

How good do we practically do?

Given these fundamental limits, we now describe our neural compression scheme and compare our results. In the pioneering work by Dr. Haldan Hartline in 1938 [3], researchers discovered evidence of a receptive field in retinal ganglion cells, corresponding to a particular region of sensory space in which a stimulus would modify the firing properties of the neuron. In more mathematical terms, this property is modeled through a convolutional spatial filter which the neuron applies to its input to determine its response.

We thus interpret each pixel value as the encoded firing intensity of the primary photoreceptors, which we compress by applying spatial filters k(x-x_c,y-y_c) at each neuron location (x_c,y_c). Our model uses an isotropic Gaussian for each neuron’s spatial filter, producing an output r(x_c,y_c) = k(x-x_c,y-y_c) * I(x,y) which is used as the neuron’s firing rate. If the neurons are spaced \Delta x pixels apart, a S\times S image is compressed into a \frac{S}{\Delta x}\times\frac{S}{\Delta x} output, corresponding to a compression rate of \frac{1}{\Delta x^2}.

To measure the distortion of this compression scheme, we propose decoding schemes for reconstructing the image. These schemes are not biologically motivated, as its unclear what neural process, if any, attempts to reconstruct the visual inputs to the photoreceptors. It is nevertheless instructive to consider how our biologically-inspired encoding scheme could be reversed. A simple procedure is one which deconvolves a neuron’s firing rate r with its receptive field, producing an estimate \hat{I}(x,y) = \sum_{(x_c,y_c)} r(x_c,y_c)k(x-x_c,y-y_c). This decoding procedure is nearly optimal under the fewest assumptions on the input source distribution, as is shown in our appendix below. A more complex procedure which is allowed perfect knowledge of the distribution’s covariance and global firing rate information can decode with much lower distortion, though its estimate \hat{I}(x,y) is difficult to express analytically and is instead the solution to a convex optimization problem. It’s derivation is tied closely to the minimization of the Mahalanobis distance d_M(\textbf{x};\mu, \Sigma)^2 = (\textbf{x}-\mu)^TW(\textbf{x}-\mu), and is also shown more explicitly in the appendix below. An example compression and decompression is shown in Fig. 7.

Figure 7: Example compression and reconstruction using the naive and optimal decoders

Are we even measuring this right?

Commonly used image fidelity metrics, such as mean-squared error (MSE), are simple to calculate and have clear physical meaning, however these metrics often do not reflect perceived visual quality. Therefore, modern research has been done on fidelity metrics that incorporate high-level properties of the human visual system. One such metric is known as the structural similarity (SSIM) metric. For the scope of this blog post it is enough knowing that SSIM incorporates image properties such as averages, variances, co-variances, and dynamic range in order to better gauge perceived similarity between images.

Figure 8: SSIM vs MSE. [8]

Channel Capacity under Poisson Noise

After compression, how well can this information be transmitted?

Here we outline some key results from Wyner [1][2]. The Poisson channel takes as input a continuous waveform input, \lambda(t), with domain 0\leq t < \infty. We assume the following constraints: 0\leq \lambda(t) \leq A and (1/T)\int_0^T \lambda(t)dt \leq \sigma A where 0<\sigma \leq 1. Now, the channel output is a Poisson process with intensity \lambda(t)+\lambda_0 where \lambda_0 represents additive Poisson noise. \nu (t) defines a Poisson counting function which gives the number of spikes received before time t. More specifically, \nu (t) is a staircase function where each step represents detection of a spike at the output. We then have that Pr\{\nu (t+\tau)-\nu (t)=j\}=\frac{\exp^{-\Lambda}\Lambda^j}{j!} where \Lambda=\int_t^{t+\tau}(\lambda(t')+\lambda_0)dt'.

We now describe a general Poisson channel code. The input codebook consists of a set of M waveforms \lambda_m(t) where 0\leq t \leq T which satisfy 0\leq \lambda_m(t) \leq A and (1/T)\int_0^T \lambda_m(t)dt \leq \sigma A. The output alphabet consists of the set S(T) where S(T) is the set of all \nu (t) for 0\leq t \leq T. The channel decoder is the mapping: S(T) \rightarrow \{1,2,...,M\}. The output of the channel takes on a value \nu_0^T from the output alphabet S(T). \nu_0^T is a staircase function with domain [0,T). We then have an error probability given by: P_e = \frac{1}{M}\sum _{m=1}^M Pr\{D(\nu_0^T)\neq m\}. The rate, R of the code in bits per second is therefore \frac{\log M}{T}.

We say the R is achievable if for all \epsilon>0 there exists a code such that M\geq 2^{RT} and P_e \leq \epsilon. The channel capacity is then the supremum of all such achievable rates. For any set of codes satisfying these constraints, we have the following capacity:

C=A[q^*(1+s)\log (1+s)+(1-q^*)s \log s- (q^* + s) \log (q^* + s)]

Here, s=\lambda_0/A, q^*=\min{\sigma,q_0(s)}, and q_0(s)=\frac{(1+s)^{1+s}}{s^s}-s.

Figure 9: (a) Channel capacity v. A, (b) Channel capacity v. \sigma, (c) Channel capacity v. \lambda_0, (d) Channel capacity v. number of neurons

But what if we had more noise?

Interestingly, work by Frey [5] has shown that in the presence of random, nondeterministic background noise \lambda_0, feedback can improve channel capacity (under mild assumptions on the distribution of the noise). In particular the This is surprising because the Poisson channel is still memoryless, and in general, memoryless channels cannot increase their capacity with the addition of feedback. Nevertheless, biologically speaking, this is a more realistic setting, given that seemingly stochastic vesicle release appears to inject random noise into the propagation of spikes within the nervous system. Thus this is a particularly encouraging result from information theory as recurrent connections and backpropagating dendritic action potentials are a common feature within biological neural networks, and may jointly be facilitating this feedback mechanism to counter the intrinsic background noise.

A Mathematical Aside

For the more mathematically inclined:

Optimal decoders

We first set up the problem. Each S\times S image frame I is vectorized into S^2-length vectors f. The receptive field of neuron i as a w\times w patch, when centered at its location (x_c,y_c) within the S\times S frame, is also vectorized into an S^2-length vector k_i. In this framework, we can compute the rates of all N neurons by arranging k_1 \dots k_N receptive field vectors into a matrix C \in \mathbb{R}^{N\times S^2} such that the rates r are computed as r = Cf.

This operation necessarily loses information as the solution to r=C\hat{f} is underdetermined. A common technique in probabilistic inference is determining the maximum likelihood estimate (MLE). Since our source is a Gaussian source with \mu =0 and covariance \Sigma_l (where the subscript l designates its parametrization by the length constant l), we write out our log-likelihood function L(\mu,\Sigma):

L(0,\Sigma_l) = -\frac{1}{2}\log(|\Sigma_l|) - \frac{1}{2}x^T\Sigma_l^{-1}x - \frac{S^2}{2}\log(2\pi)

Within this expression we recognize the square of the Mahalanobis distance d, which can be considered an expression of the distance between a point x to the distribution \mathcal{N}(0, \Sigma_l). Rewriting this expression along with a constant term c = \frac{1}{2}\log(|\Sigma_l|) + \frac{S^2}{2}\log(2\pi), we get:

L(\mu,\Sigma_l) = -\frac{1}{2}d^2 - c

With this, we see we can maximize the log-likelihood by minimizing the Mahalanobis distance d(\hat{f};0,\Sigma_l) = (\hat{f}-\mu)^T\Sigma_l^{-1}(\hat{f}-\mu) subject to the constraint C\hat{f} = r. Moreover, this is a simple convex optimization problem, since the matrix \Sigma_l^{-1} is a positive definite precision matrix.

min_f \quad f^T\Sigma_l^{-1}f

s.t. \quad Cf = r

This decoder is optimal under the assumption that the source is a stationary Gaussian distribution with \mu=0, covariance \Sigma_l, and the decoder has perfect knowledge of the distribution. We also consider the case where the decoder makes no assumptions on a finite bandwidth of the source, i.e., \Sigma_l \rightarrow \mathbb{I}. The resulting objective corresponds to a least-norm problem on f:

min_f \quad f^T f

s.t. \quad Cf = r

This optimization problem has an explicit solution f = C^T(CC^T)^{-1}r, which we now compare to our naive deconvolution solution \hat{I}(x,y) = \sum_{(x_c,y_c)} r(x_c,y_c) k(x-x_c,y-y_c). When vectorized into the format of this framework, this is equivalent to f = C^Tr. Now the comparison becomes more clear, since the least-norm solution and the deconvolution solution are equivalent if CC^T = \mathbb{I}. This is the case when each neuron’s receptive field k_i is unit length and pairwise orthogonal (k_i^Tk_j = 0), or in other words, have no overlap over the image.

Important to note is that this decoder is optimal in the noiseless channel case. In the case where the receiver’s estimate \hat{r} \neq r, the decompression process minimizes the objective in the nullspace of Cf = \hat{r} and fails to find the appropriately reconstruction. We thus modify our constraints to respect our noisy channel by assuming the true rate r falls in the set [\hat{r}-\sigma_{\hat{r}}, \hat{r}+\sigma_{\hat{r}} ]. The new stochastic optimization problem is reformulated below, where the channel noise is known to take a Poisson distribution.

min_f \quad f^T\Sigma_l^{-1}f

s.t. \quad Cf \leq \hat{r}+\sqrt{\hat{r}}, Cf \geq \hat{r}-\sqrt{\hat{r}}

Channel capacity lower bound

We now derive an expression for the lower bound of the Poisson channel capacity [1][2]. The main idea of this proof is that we can analyze max_{p(x)} I(Y;X) when X is constrained to a subset all possible inputs to the Poisson channel. Clearly, if this constraint is lifted, the same distribution over the inputs can be achieved by setting p(x)=0 for any x which is not a member of the constrained subset. Therefore, max_{p(x)} I(Y;X) when X is constrained to a subset of inputs to the channel is a lower bound on the capacity of the Poisson channel.

Specifically, we impose the constraint that \lambda(t) is piecewise constant so that it takes one of two values, 0 or A, over each interval ((n-1)\Delta,n\Delta]. We then define the discrete signal x_n such that x_n = 1 if \lambda(t)=A in the interval ((n-1)\Delta,n\Delta]. Otherwise, x_n=0. Additionally, we assume that the detector sees only the values \nu(n\Delta). Thus, it sees the number of spikes received in the interval ((n-1)\Delta,n\Delta] which we define as y_n=\nu(n\Delta)-\nu((n-1)\Delta). Although there is a finite probability that \nu(n\Delta)-\nu((n-1)\Delta)\geq 2, we take y_n = 0 for all cases where \nu(n\Delta)-\nu((n-1)\Delta)\geq\neq 1. This is a reasonable approximation for small enough \Delta. Thus, y_n is binary valued. We have now reduced the channel to a binary input, binary output channel characterized by P(y_n|x_n). Any capacity achieved by this channel will be a lower bound on the capacity achieved by the general Poisson channel as we are dealing with a restriction of the input and output spaces. However, it will turn out that this bound is tight.

From the Poisson distribution, we have that P(y_n=1|x_n=0)=\lambda_0\Delta\exp^{-\lambda_0\Delta}. We have that \lambda_0=s A. Therefore, P(y_n=1|x_n=0)=s A\Delta \exp^{-s A\Delta}. Likewise, P(y_n=1|x_n=1)=(1+s)A\Delta \exp^{-(1+s)A\Delta}. Now, the average power constraint takes the form \frac{1}{N}\sum_{n=1}^N x_{mn} \leq \sigma. Where m indexes the codeword. Therefore, the lower bound on capacity is given by \max I(x_n;y_n)/\Delta where the maximum is taken over all distributions of x_n. Clearly, p(x_n) must satisfy E(x_n)\leq \sigma. Now we define q=Pr\{x_n=1\}, a=sA\Delta\exp^{-sA\Delta}, and b=(1+s)A\Delta\exp^{-(1+s)A\Delta}. We then have that I(x_n;y_n)=h(qb+(1-q)a)-qh(b)-(1-q)h(a) where h(.) is the binary entropy. We then have \Delta C\geq \max_{0\leq q \leq \sigma} I(x_n;y_n). After making a few simplifying approximations, we have an expression which is convex in q. We then maximize with respect to q subject to the constraint q\leq \sigma. The mutual information is maximized when q=\frac{(1+s)^{s+1}}{s^s}-s. We define q_0(s)=\frac{(1+s)^{s+1}}{s^s}-s. We then have that the capacity is lower bounded by C=A[q^*(1+s)\log (1+s)+(1-q^*)s \log s- (q^* + s) \log (q^* + s)] where q^*=\min{\sigma,q_0(s)}.

Taking our channel to consist of m neurons, each modeled as an independent Poisson channel, we have the total capacity m A[q^*(1+s)\log (1+s)+(1-q^*)s \log s- (q^* + s) \log (q^* + s)].

Outreach Activity

For the outreach activity at Nixon Elementary school, our group decided to create a board game called A Polar (codes) Expedition. The purpose of this outreach activity was to help the kids walk away with an ability to XOR, and, furthermore, use that ability to decode a simple 4-bit polar code.

The overarching theme of the board game is that the player is trapped in a polar storm and needs to use polar codes to figure out where the rescue team is coming. The outreach activity was a huge success (see some pictures from the event below). It was especially amazing to see kids even smaller than we anticipated being the target age for our game (3rd and 4th graders) pick up XOR-ing really fast and walk through the game. Additionally, it was inspiring to see the kids so excited after solving the polar code and then realizing this decoding scheme is an actual one used in modern day. Overall, our group believes the outreach component of the course was truly a wholesome experience and we all came out of it hoping to continue taking part in STEM outreach in some capacity throughout the rest of our PhDs. It was special to see the genuine curiosity that the kids had while learning about polar codes through our game and end the end of the day it really reminded us why we are here to do a PhD in the first place — it boils down to the simple fact that we find these topics to be cool and interesting and we derive happiness from learning more and more about EE.

The transcript for our board game can be found at, please take a look if you would like to read the story arcs that the kids got to take on their polar expedition: https://docs.google.com/document/d/1gQk1USQOqbW8xdxMENY7QQ4kr4LFIazgFS149aJIhuQ/edit?usp=sharing


References

[1] A.D. Wyner, “Capacity and error exponent for the direct detection photon channel–Part 1,” in IEEE Trans. Inform. Theory, vol. 34, pp. 1449-1461, 1988.

[2] A.D. Wyner, “Capacity and error exponent for the direct detection photon channel–Part 2,” in IEEE Trans. Inform. Theory, vol. 34, pp. 1462-1471, 1988.

[3] H.K. Hartline, “The response of single optic nerve fibers to illumination of the retina,” in American Journal of Physiology, vol. 121, no. 2, pp. 400-415, 1938.

[4] Kolb H, Fernandez E, Nelson R, editors. Webvision: The Organization of the Retina and Visual System [Internet]. Salt Lake City (UT): University of Utah Health Sciences Center; 1995-.

[5] M. Frey, “Information capacity of the Poisson channel,” in IEEE Trans. Inform. Theory, vol. 37, no. 2, pp. 244-256, 1991.

[6] Palanker D – Own work, CC BY-SA 4.0 https://commons.wikimedia.org/w/index.php?curid=49156032.

[7] R. Zamir, Y. Kochman, U. Erez, “Achieving the Gaussian rate-distortion function by prediction,” in IEEE Trans. Inform. Theory, vol. 54, no. 7, pp. 3354-3364, 2008.

[8] Z. Wang et al., “Image Quality Assessment: From Error Visibility to Structural Similarity,” in IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, 2004.

Leave a Reply