Statistics of Natural Scenes
CNS*95 Meeting Workshops, Monterey CA
The following are my notes from this very informal workshop. The
purpose is to record as much as possible of the discussion. An
introduction was added for the benefit of people who can't imagine why
we met to have such a discussion.
Why this is interesting
The responses of sensory pathway neurons to external stimuli are said
to encode information about those stimuli. Researchers who are
interested in the structure of such neural codes are becoming
increasingly aware of the fact that the description of a code must
include a description of the signal that is being encoded. Using a
different stimulus ensemble, we can potentially get different answers
to many key questions, such as: is the information encoded in a linear
or nonlinear form? Are the signals from multiple cells redundant,
independent, or synergistic in encoding the sensory stimulus? Is the
neural representation efficient (does it exploit the maximum
information capacity of the channel)?
This is because a "code" is esentially a way of discriminating between
different possible messages. A good code may be great at finely
discriminating many subtly different messages that it expects to see
frequently, while it may be abysmal at discriminating among many
messages that it expects never to see, and which it therefore hasn't
been designed to handle well. So to test how the neural code works we
need to know what kinds of messages to use. In the case of sensory
stimuli, we would like to know to what extent the neural code has
taken advantage of the fact that certain stimuli never occur in nature
while others are exceedingly common or at least fairly predictable. In
other words, are the statistics of natural scenes reflected in sensory
neural codes?
This discussion makes more sense if you have some familiarity with the
concepts of Entropy, Information, and Coding.
Workshop Participants
Tony Bell (Salk), David Field
(Cornell), Iris Ginzburg (Salk), Bruno Olshausen (Cornell), Leslie
Osborne (Berkeley), Clay Spence
(Sarnoff). Workshop organized by Pam Reinagel (Caltech).
Workshop Discussion
Some definitions in this context
stationary: over the population of images the statistics of the
image are the same everywhere in space.
ergodic: the statistics of one image are a good estimate of the
statistics of the whole ensemble of images.
homogenous: in each image, the statistics at one location predict
the statistics at other locations in space.
amplitude spectrum vs. power spectrum: The power spectrum is just
the amplitude spectrum squared. Thus when the spatial frequency
distribution of natural visual scenes is said to be "1/f" the
amplitude spectrum is meant; the power spectrum is (1/f)^2.
correllation vs. statistical dependence: correllation refers only
to the linear dependence of x on y, not the entire statistical
relationship between x and y. It is possible to entirely decorrellate
x and y (eg. by Principle Component Analysis) and still have
statistical dependency between them.
Statistics of natural visual scenes
First, the statistical properties of natural scenes are not entirely
stationary. For example the upper parts of natural scenes are brighter
on average than the lower parts. The upper field also has less
variance. Second, natural scenes are not particularly homogenous. Third,
natural scenes are fairly ergotic, at least for lower order
statistical measures.
The spatial frequencies in natural visual scenes are not equally
represented; the amplitude spectrum of such scenes goes as
1/f. Intuitively what this means that there are objects at all scales,
and that we view them from all distances. It was pointed out that "1/f
is everywhere," even the level of the Nile goes as a function of 1/f
in time, and this just means that things happen at all scales. What
is more specific is the exponent which tells you with what relative
likelihood things at different scales happen. In this case, (1/f)^1
means that things happen at all scales with equal amplitude
(there is equal energy in each octave).
However, a visual image constructed to have 1/f amplitude spectrum
does not look at all natural, because this statistic fails to capture
phase information in the image. Nonetheless from the point of view of
coding, the entropy of the ensemble of [all possible 1/f images] is
incredibly tiny compared to the entropy of the ensemble of [all
possible images], so this statistic helps a lot.
The phase problem
An argument was presented that
global phase structure in the image is not going to help very
much. The visual system is operating locally and this may
reflect the fact that in the world, there are objects that have
location in space. The discussion here was an elaboration of part of
the material presented in Bruno's talk earlier in the meeting. If
visual scenes are encoded as the strength in each "channel," where
each channel encodes some specific orientation at some specific scale
(spatial frequency) over some local section of the image, then one
finds that random scenes constructed to have the same strength on each
channel do a good job at reproducing overall image textures. Such a
measurement over an ensemble of natural scenes would constitute a
statistical description of natural scenes that would implicitly carry
phase information (because the channels are local in space). In this
respect, wavelets are argued to be a better basis function set for
visual scenes than fourier components. There remained some confusion
as to whether using the known properties of visual processing to
motivate our description of the statistics of natural scenes will now
make it a circular question whether the nervous system codes that
information efficiently.
Sampling
We discussed at length the problem of how natural
scenes are sampled nonrandomly, both by experimentors (accidentally)
and by other animals (deliberately). For example if you take a camera
into the woods you will not get an accurate, random sample of all
scenes because: you tend to center things, creating a low
frequency peak nonstationarily. the camera's depth of field is
limited. the film has much less dynamic range than reality. Real
scenes have a dynamic range of 2000:1 while photos have a range of
30:1. David points out that linear film just truncates the range,
which is much worse than using nonlinear film and a high resolution
scanner, and then invert the nonlinearity later. However even if
you get a true random sample of visual scenes, is this really what you
want? We discussed the need for more measurements (in addition) of the
statistics of the subset of scenes that animals actually see. Animals
orient their eyes by saccading or moving their bodies, and thereby
very nonrandomly sample the ensemble of natural scenes. In this
respect we noted the neuroethological experiments of Bob Barlow with
the limulus as an example of using an animal-selected scene
ensemble. In connection with this problem, we also discussed the
temporal statistics of natural visual scenes (cf. the work of
Atick and Dong), and noted that it would be particularly interesting
to look at how these statistics are filtered by the animal behaviour.
Application to image compression
Clay explained some of his
work on video compression for TV. He pointed out that Wiener
Filtering is widely assumed to be optimal but that this only holds if
everything is gaussian and independent, which turns out never to be
the case in real images. On the other hand the fact that the PDF of
the signal isn't gaussian could be helpful, because one can judge
whether a given signal is more likely to have come from the (gaussian)
noise PDF or the (non-gaussian) signal PDF. There were several more
points made on this topic which I just didn't follow well enough to
write down.
What about other modalities?
Bruno and David have used the natural
sound library at Cornell to analyze the statistics of natural
sounds. They found that here the (temporal) frequency amplitude
spectrum was (1/f)^(1/2) i.e. the power spectrum was 1/f. One
intuitive slant on this was that big things tend to resonate a long time.
Aside: at a later date Carlos Brody (Caltech) suggested
another way of looking at the result:
The power spectrum going like 1/f as opposed to (1/f)^2 as in images
can also be interpreted as simply meaning that there is the same power
in all octaves, i.e. scale invariance in that sense.
This is because sound occurs in only one dimension (time) whereas
vision occurs in two (up/down and right/left). Thus, when we integrate
power over an octave in vision, we integrate
2*pi*f * (1/f)^2
where the 2*pi*f comes from going around all orientations at that
frequency f, so we end up integrating 1/f, so we end up with log
(which is what we need for scale invariance).
Power over an octave in sound is just the integral of 1/f, leading to
the same result.
It was pointed out that the natural sounds are dominated by sounds
made by animals for the purpose of being perceived by (or not
perceived by) other animals. This makes natural auditory scenes very
different from natural visual scenes, which are dominated by images
that happen to be formed by rocks, plants, clouds, etc. with no regard
to the signal processing needs of the animals perceiving them.
Throughout the workshop we considered how the concepts we were
discussing for the statistics of natural visual scenes might be
generalized to the study of other sensory modalities. In particular we
used the example that Leslie raised of the cricket cercal system,
which senses wind velocity and which is used by the cricket for
example in distinquishing approaching predators (eg., wasps)
from conspecifics. In general we found it not at all obvious how to
generalize from the visual stimulus situation to the issues such as
locality and phase information for this modality.
On the question of how to actually measure natural stimulus ensembles,
there were technical problems analogous to those we discussed for
visual scenes, but apparently harder. While it is possible to
generate arbitrary wind patterns, there aren't good sensors for
recording wind patterns. The suggestions were either to use a cricket
as a microphone (record from neurons whose response to different wind
directions is known), or simply generate a random wind pattern and
vary its properties until it elicits a specific behaviour like
wasp-evasion.
copyright 1995 Pam Reinagel