gmin specification (first draft)

Goal

Image segmentation without relying on stored representations of familiar objects. Here, image segmentation is understood as parsing a scene into (complete) objects and a background with an appropriate depth ordering. Note that this definition is different from a common definition in computer vision literature where segmentation is treated as a “process of partitioning a digital image into multiple segments”.

Scope/Assumptions

Only the perceptual part of the visual system is considered (no “vision for action”).
No saccades since we can correctly parse a given scene without moving the eyes no matter where we’re fixating.
Uniform receptive field sizes, which holds approximately true for central vision within a 4 deg radius (Fig. 1a in Freeman & Simoncelli, 2011). Importantly, this assumption simplifies the problem of dealing with cluttered scenes to one where only a few (possibly incomplete) objects are present at a time.
Grayscale processing (i.e., magnocellular pathway only). Color can be included later by choosing an appropriate transformation (color-opponent channels or the DKL space; milestone for version 2.0).
No motion for now (milestone for version 3.0).
No 3D reconstructions, only depth ordering of surfaces.
No object recognition (since we’re avoiding top-down effects).
No attention. Coarse segmentation is assumed to occur pre-attentively. Finer segmentation might involve incremental grouping (Roelfsema, 2006) but the goal of this model is to provide the initial segmentation.
No explicit task. Segmentation should happen pre-attentively.

Proposed processing sequence

A similar, yet not as complete, approach has been taken in Regan (2000), Self & Roelfsema (2013), Geisler & Super (2000), Shi & Malik (2000).

1. Feature detection

Features are detected at every pixel by convolving with appropriate filters. Currently, this step is limited to a convolution with odd and even Gabors at multiple scales for orientation, polarity, and contrast magnitude detection. A maximum is computed over this space to extract features at the best scale.

2. Center-surround suppression

Since features usually span more than a single pixel, many nearly identical features are detected in a local neighborhood (possibly equivalent to the classical receptive field). This information is redundant and can be coarsened by computing a maximum within a local neighborhood and suppressing the remaining locations (see Sharon et al. (2006) for a similar approach).

3. Pooling over features (similarity grouping)

Various statistics over the extracted features are computed in the extra-classical receptive field. For example, features could be grouped using the proximity assumption: things close in spacetime / feature space are more likely to belong to the same collection (Földiák, 1991), giving rise to Gestalt grouping principles. Portilla & Simoncelli (2000), Rosenholtz et al. (2009) and Balas et al. (2009) explored other possible pooling statistics. The outcome of this step is the grouping strength, i.e., a probability of features belonging to the same collection.

4. More complex feature detection and pooling

The above-described steps could be performed again, at multiple levels of hierarchy, using more complex features (such as curved contours). These steps might be necessary for first- (e.g., orientation-defined) and second-order grouping displays. For example, in the orientation-defined displays, boundaries at texture discontinuities are detected at this stage of processing.

5. Segmentation into objects / collections

So far, grouping has occurred only locally. In the final step, we compute which elements go with which ones globally: if A and B go together, and B and C go together, then A and C go together even if they do not group well directly. This computation could be done by connecting all pairs that group above a certain threshold. Belonging to a collection can mean an increase the firing rate (Roelfsema et al., 2004), an activation of collection units (similar to the “grandmother cell” concept), firing synchronization (von der Malsburg, 1981) or aligning temporal sequence (Wehr & Laurent, 1996).

6. Border ownership

Once collections of features are found, border ownership computation can proceed using the convexity bias assumption: objects tend to be convex (Kogo et al., 2010).

7. Depth assignment / occlusion computation

Upon border ownership computation, parts of objects might appear to be missing. These missing parts potentially inform about the depth ordering and are relevant for predicting the possible shape behind the occluder. These predictions might be used later on to refine the segmentation.