We present a cortical-like strategy to obtain reliable estimates of the motions of objects in a scene toward/away from the observer (motion in depth), from local measurements of binocular parameters derived from direct comparison of the results of monocular spatiotemporal filtering operations performed on stereo image pairs. This approach is suitable for a hardware implementation, in which such parameters can be gained via a feedforward computation (i.e., collection, comparison, and punctual operations) on the outputs of the nodes of recurrent VLSI lattice networks, performing local computations. These networks act as efficient computational structures for embedded analog filtering operations in smart vision sensors. Extensive simulations on both synthetic and real-world image sequences prove the validity of the approach that allows to gain high-level information about the 3D structure of the scene, directly from sensorial data, without resorting to explicit scene reconstruction.