An approach to speech watermarking based on the time-frequency signal analysis is proposed. As a time-frequency representation suitable for speech analysis, the S-method is used. The time-frequency characteristics of watermark are modeled by using speech components in the selected region. The modeling procedure is based on the concept of time-varying filtering. A detector form that includes cross-terms in the Wigner distribution is proposed. Theoretical considerations are illustrated by the examples. Efficiency of the proposed procedure has been tested for several signals and under various attacks.