Hybrid architecture for a single-precision arithmetic processor.
Jurca, Lucian ; Gontean, Aurel ; Alexa, Florin 等
1. INTRODUCTION
The increase of integration density has permitted the development
of the logarithmic-number-system (LNS) processors out of which we
mention (Coleman et al., 2000) and (Arnold, 2001), but in these the main
difficulty was to implement the addition and subtraction operations.
Avoiding this disadvantage and at the same time keeping the qualities of
both floating-point (FP) and LNS can be achieved through the design of a
hybrid unit which combines the attributes of the FP processor with
logarithmic arithmetic. A solution in this direction had already been
proposed in (Lai, 1993), where addition and subtraction were performed
in FP and multiplication, division, square root and all the other
operations in LNS. But the format conversions FP-LNS and LNS-FP were
slow because in the linear-interpolation algorithm only non-redundant
adders were used.
To improve the format conversions, a new architecture was proposed
in (Jurca et al., 2007) where the method of redundant summation of
partial products with other inputs was applied because a multiplication
and a series of additions occur in sequence in the conversion algorithm.
Moreover, corrections of one or two LSB were applied in some memory
locations of the log and antilog look-up tables content. In this way,
the conversions became two times more accurate than in (Lai, 1993), e.g.
the conversion error was kept under 1.5 x [10.sub.-7], while the FP
single-precision error is still 1.2 x [10.sub.-7].
The aim of this paper is to offer an alternative of classical FP
units because, for some particular applications and with new
improvements, the logarithmic and hybrid units can run faster than the
floating-point ones.
Thus, in section 2 we present the pipelined architecture of the
logarithmic subunit and the conversion algorithms FP-LNS and LNS-FP.
For implementing the floating-point addition/subtraction, a
classical 3-stage pipelined subunit synchronized with the logarithmic
one (4ns per stage) was designed. This means that in our case we
don't need a high level of parallelism for the data paths and thus
we can save area. These aspects will be discussed in section 3. To
facilitate the comparison with related works, in our design the
propagation delays through different gates were adopted corresponding to
0.5-um CMOS technology and all the PSpice models in our digital
simulations were settled according to this.
In section 4 we will give an example of how we can implement a
simple DSP algorithm using a hybrid processor, we will also show the
main directions of our future work and we will conclude the paper.
2. LOGARITHMIC SUBUNIT
The stages of the logarithmic subunit and the conversion algorithms
will be briefly presented in this section because a more detailed
description can be found in (Jurca et al., 2007).
With the same hardware, the 6-stage pipelined structure presented
in Fig.1 can perform either multiplication or division for two FP
operands, A and B, depending on the bit-line SOP that acts on the third
stage, as we can see in equations (1):
A x B = antilog(logA + logB)
A / B = antilog(logA--logB) (1)
The linear interpolation algorithm for the direct conversion is
based on equation (2):
log (1+y) [congruent to] y + [E.sub.y] [+ or -] [DELTA][E.sub.y] x
[y.sub.2] (2)
where y is the fractional part of the 23-b significand, out of
which the least significant 12 bits represent [y.sub.2]; the values Ey
of the function log(1+y)-y for [2.sup.23-Ny2] = [2.sup.Ny1] = [2.sup.11]
points were memorized in internal ROM, as well as the values
[DELTA][E.sub.y], for its derivative function too, memorized in internal
ROM'. For the conversion LNS-FP the algorithm is similar and only
the ROM content changes.
The unit includes two format converters FP-LNS for the two operands
and one LNS-FP converter which produce the operation result in
floating-point format. In fact the direct conversions are fused with ALU operation in the very purpose to reduce the non-redundant additions to
the one performed in ALU. The Wallace trees are built with 4:2
compressors.
[FIGURE 1 OMITTED]
Of course, at the output of the second stage the latched data are
in carry-save form ([A.sub.1], [A.sub.2], [B.sub.1], [B.sub.2]) and the
binary logarithms will never be produced explicitly among the pipelined
structure. The exponents [E.sub.A] and [E.sub.B] are concatenated as
integer part of the data ([A.sub.1] , [B.sub.1]) which is to be operated
in fixed-point in ALU. The digital simulations proved that the
propagation delays across all stages were 4ns.
3. FLOATING-POINT SUBUNIT
The first stage of the floating-point subunit (the alignment of
mantissas) is shown in Fig.2. The adders "ExpA-ExpB" and
"ExpB-ExpA" work simultaneously and the MSB of
"ExpA-ExpB" (whose fan-out was multiplicated with the block
Mlt) will select the bigger exponent and the associated significand
(mantissa) at the output of the stage. It will select the positive
exponent difference also to command the barrel-shifter BS which shifts
to the right the other significand.
The second stage is an adder/subtracter circuit (Fig.3). Adder1 and
Adder2 are 25-b adders because the sign bit and the implicit 1 are
attached to the 23-b significand. In our design we used very fast
2-level hybrid adders (Jurca & Maranescu, 2004), with five (1+2x2)
8-b carry look-ahead adders (CLA, with input carry 0 and 1 respectively)
plus in the most significant position two 9-b CLA, these ones on the 1st
level and a carry select mechanism on the 2nd level. The same type of
adder was used in the third stage of the logarithmic subunit.
[FIGURE 2 OMITTED]
[FIGURE 3 OMITTED]
The normalization of the final result and the adjusting of the
exponent are done in the third stage (Fig.4). In the case of
subtraction, the number of leading zeros, counted with LZC, is
subtracted from the transmitted exponent and the output of the
barrel-shifter BS will be selected. In the case of addition, the data
will be either shifted one bit to the right, or unshifted.
[FIGURE 4 OMITTED]
4. APPLICATIONS. CONCLUSIONS
A DSP algorithm can be very easily implemented in the hybrid unit
as we can see in Fig.5 where the 20th and 29th steps of a 20-term
Livermore-loop computing are presented. The total time is only 140ns (35
cycles), but our aim is to improve further the hybrid unit, by reducing
its latency and by implementing 3-operand multiplication and shift
capability in logarithmic ALU for square root or any power capabilities.
A remarkable feature of this unit is that it performs a division faster
than the most recent FPUs (Nikmehr & al., 2007).
[FIGURE 5 OMITTED]
5. REFERENCES
Arnold, M. (2001). A Pipelined LNS ALU, Workshop on VLSI.
Proceedings, pp. 155-161, ISBN: 0-7695-1056-6, Orlando, FL., USA, April
19-20, 2001, IEEE Computer Society.
Coleman, J.N.; Chester, J.E.; Softley, C. & Kadlec, J. (2000).
Arithmetic on the European Logarithmic Microprocessor, IEEE Transactions
on Computers, Special Edition on Computer Arithmetic, Vol. 49, No. 7,
pp.702-715, July 2000, ISSN: 0018-9340.
Jurca, L. & Maranescu, V. (2004). A New Way to Build a Very
Fast Binary Adder, Scientific Bulletin of the Politehnica University of
Timisoara. Transactions on Electronics and Communications, Tom 49 (63),
Fascicola 1, 2004, pp. 193-198, ISSN: 1583-3380.
Jurca, L.; Gontean, A.; Alexa, F. & Curiac, D.I. (2007).
Proposal to Improve Data Format Conversions for a Hybrid Number System
Processor, Proceedings of the 11th WSEAS International Conference on
COMPUTERS, Agios Nikolaos, Crete Island, Greece, July 26-28, 2007, ISSN:
1790-5117, ISBN: 978-960-8457-95-9.
Lai, F. (1993). The Efficient Implementation and Analysis of a
Hybrid Number System Processor, IEEE Transactions on Circuits and
Systems, Part II, Vol. 40, No. 6, June 1993, pp. 382-392, ISSN
1057-7130.
Nikmehr, H.; Phillips, B. & Lim, C.C. (2007). A Fast Radix-4
Floating-Point Divider with Quotient Digit Selection by Comparison
Multiples, The Computer Journal, Vol. 50 Issue 1, pp.81-92, Jan. 2007,
Oxford University Press. ISSN: 0010-4620.