We present an efficient circulant approximation-based MIMO equalizer architecture for the CDMA downlink. This reduces the direct matrix inverse (DMI) of size ( N F × N F ) with O ( ( N F ) 3 ) complexity to some FFT operations with O ( N F Log 2 ( F ) ) complexity and the inverse of some ( N × N ) submatrices. We then propose parallel and pipelined VLSI architectures with Hermitian optimization and reduced-state FFT for further complexity optimization. Generic VLSI architectures are derived for the ( 4 × 4 ) high-order receiver from partitioned ( 2 × 2 ) submatrices. This leads to more parallel VLSI design with 3 × further complexity reduction. Comparative study with both the conjugate-gradient and DMI algorithms shows very promising performance/complexity tradeoff. VLSI design space in terms of area/time efficiency is explored extensively for layered parallelism and pipelining with a Catapult C high-level-synthesis methodology.