Statistics of Natural Scenes

CNS*95 Meeting Workshops, Monterey CA

The following are my notes from this very informal workshop. The purpose is to record as much as possible of the discussion. An introduction was added for the benefit of people who can't imagine why we met to have such a discussion.

Why this is interesting

The responses of sensory pathway neurons to external stimuli are said to encode information about those stimuli. Researchers who are interested in the structure of such neural codes are becoming increasingly aware of the fact that the description of a code must include a description of the signal that is being encoded. Using a different stimulus ensemble, we can potentially get different answers to many key questions, such as: is the information encoded in a linear or nonlinear form? Are the signals from multiple cells redundant, independent, or synergistic in encoding the sensory stimulus? Is the neural representation efficient (does it exploit the maximum information capacity of the channel)?

This is because a "code" is esentially a way of discriminating between different possible messages. A good code may be great at finely discriminating many subtly different messages that it expects to see frequently, while it may be abysmal at discriminating among many messages that it expects never to see, and which it therefore hasn't been designed to handle well. So to test how the neural code works we need to know what kinds of messages to use. In the case of sensory stimuli, we would like to know to what extent the neural code has taken advantage of the fact that certain stimuli never occur in nature while others are exceedingly common or at least fairly predictable. In other words, are the statistics of natural scenes reflected in sensory neural codes? This discussion makes more sense if you have some familiarity with the concepts of Entropy, Information, and Coding.

Workshop Participants

Tony Bell (Salk), David Field (Cornell), Iris Ginzburg (Salk), Bruno Olshausen (Cornell), Leslie Osborne (Berkeley), Clay Spence (Sarnoff). Workshop organized by Pam Reinagel (Caltech).

Workshop Discussion

Some definitions in this context

stationary: over the population of images the statistics of the image are the same everywhere in space.

ergodic: the statistics of one image are a good estimate of the statistics of the whole ensemble of images.

homogenous: in each image, the statistics at one location predict the statistics at other locations in space.

amplitude spectrum vs. power spectrum: The power spectrum is just the amplitude spectrum squared. Thus when the spatial frequency distribution of natural visual scenes is said to be "1/f" the amplitude spectrum is meant; the power spectrum is (1/f)^2.

correllation vs. statistical dependence: correllation refers only to the linear dependence of x on y, not the entire statistical relationship between x and y. It is possible to entirely decorrellate x and y (eg. by Principle Component Analysis) and still have statistical dependency between them.

Statistics of natural visual scenes

First, the statistical properties of natural scenes are not entirely stationary. For example the upper parts of natural scenes are brighter on average than the lower parts. The upper field also has less variance. Second, natural scenes are not particularly homogenous. Third, natural scenes are fairly ergotic, at least for lower order statistical measures.

The spatial frequencies in natural visual scenes are not equally represented; the amplitude spectrum of such scenes goes as 1/f. Intuitively what this means that there are objects at all scales, and that we view them from all distances. It was pointed out that "1/f is everywhere," even the level of the Nile goes as a function of 1/f in time, and this just means that things happen at all scales. What is more specific is the exponent which tells you with what relative likelihood things at different scales happen. In this case, (1/f)^1 means that things happen at all scales with equal amplitude (there is equal energy in each octave).

However, a visual image constructed to have 1/f amplitude spectrum does not look at all natural, because this statistic fails to capture phase information in the image. Nonetheless from the point of view of coding, the entropy of the ensemble of [all possible 1/f images] is incredibly tiny compared to the entropy of the ensemble of [all possible images], so this statistic helps a lot.

The phase problem

An argument was presented that global phase structure in the image is not going to help very much. The visual system is operating locally and this may reflect the fact that in the world, there are objects that have location in space. The discussion here was an elaboration of part of the material presented in Bruno's talk earlier in the meeting. If visual scenes are encoded as the strength in each "channel," where each channel encodes some specific orientation at some specific scale (spatial frequency) over some local section of the image, then one finds that random scenes constructed to have the same strength on each channel do a good job at reproducing overall image textures. Such a measurement over an ensemble of natural scenes would constitute a statistical description of natural scenes that would implicitly carry phase information (because the channels are local in space). In this respect, wavelets are argued to be a better basis function set for visual scenes than fourier components. There remained some confusion as to whether using the known properties of visual processing to motivate our description of the statistics of natural scenes will now make it a circular question whether the nervous system codes that information efficiently.

Sampling

We discussed at length the problem of how natural scenes are sampled nonrandomly, both by experimentors (accidentally) and by other animals (deliberately). For example if you take a camera into the woods you will not get an accurate, random sample of all scenes because:

you tend to center things, creating a low frequency peak nonstationarily.

the camera's depth of field is limited.

the film has much less dynamic range than reality. Real scenes have a dynamic range of 2000:1 while photos have a range of 30:1. David points out that linear film just truncates the range, which is much worse than using nonlinear film and a high resolution scanner, and then invert the nonlinearity later.

However even if you get a true random sample of visual scenes, is this really what you want? We discussed the need for more measurements (in addition) of the statistics of the subset of scenes that animals actually see. Animals orient their eyes by saccading or moving their bodies, and thereby very nonrandomly sample the ensemble of natural scenes. In this respect we noted the neuroethological experiments of Bob Barlow with the limulus as an example of using an animal-selected scene ensemble. In connection with this problem, we also discussed the temporal statistics of natural visual scenes (cf. the work of Atick and Dong), and noted that it would be particularly interesting to look at how these statistics are filtered by the animal behaviour.

Application to image compression

Clay explained some of his work on video compression for TV. He pointed out that Wiener Filtering is widely assumed to be optimal but that this only holds if everything is gaussian and independent, which turns out never to be the case in real images. On the other hand the fact that the PDF of the signal isn't gaussian could be helpful, because one can judge whether a given signal is more likely to have come from the (gaussian) noise PDF or the (non-gaussian) signal PDF. There were several more points made on this topic which I just didn't follow well enough to write down.

What about other modalities?

Bruno and David have used the natural sound library at Cornell to analyze the statistics of natural sounds. They found that here the (temporal) frequency amplitude spectrum was (1/f)^(1/2) i.e. the power spectrum was 1/f. One intuitive slant on this was that big things tend to resonate a long time.

2*pi*f * (1/f)^2

where the 2*pi*f comes from going around all orientations at that frequency f, so we end up integrating 1/f, so we end up with log (which is what we need for scale invariance). Power over an octave in sound is just the integral of 1/f, leading to the same result.

It was pointed out that the natural sounds are dominated by sounds made by animals for the purpose of being perceived by (or not perceived by) other animals. This makes natural auditory scenes very different from natural visual scenes, which are dominated by images that happen to be formed by rocks, plants, clouds, etc. with no regard to the signal processing needs of the animals perceiving them.

Throughout the workshop we considered how the concepts we were discussing for the statistics of natural visual scenes might be generalized to the study of other sensory modalities. In particular we used the example that Leslie raised of the cricket cercal system, which senses wind velocity and which is used by the cricket for example in distinquishing approaching predators (eg., wasps) from conspecifics. In general we found it not at all obvious how to generalize from the visual stimulus situation to the issues such as locality and phase information for this modality. On the question of how to actually measure natural stimulus ensembles, there were technical problems analogous to those we discussed for visual scenes, but apparently harder. While it is possible to generate arbitrary wind patterns, there aren't good sensors for recording wind patterns. The suggestions were either to use a cricket as a microphone (record from neurons whose response to different wind directions is known), or simply generate a random wind pattern and vary its properties until it elicits a specific behaviour like wasp-evasion.