Signal driven partitioning large circuits.
Bucur, Ion ; Cupcea, Nicolae ; Stefanescu, Costin 等
1. INTRODUCTION
Field Programmable Gate Arrays (FPGAs), providing both large-scale
integration and user-programmability, are essential architectures. FPGA packages, as a general feature, have maximum size Combinational Logic Blocks (CLBs) constraints much larger than the number of input-output
pins IOBs. Therefore, realization of a large logic network into working
FPGA involves network division into a near balanced packing of CLBs and
Input-Output Blocks (IOBs). Partitioning is a modus operandi of
separating a circuit or system into a set of smaller sub-circuits with
almost equal sizes targeting to lessen the amount of links between the
sub-circuits. Market trend pressure yields larger systems, both to make
manufacture cheaper and to take advantage of the optimization probable
of the entire structure. In this work is presented new multilevel
multi-resource partitioning algorithm for partitioning large
combinational circuits in order to efficiently use existing and
commercially available FPGAs packages.
2. SPECIFIC ISSUES
FPGA circuit realization has two main phases. Placement phase, the
first one, is dedicated to assign useful locations within the FPGA
structure, to the obtained blocks. This could be, however, an iterative
process. Routing phase, the last one, provides the links between these
blocks. Circuit partitioning is used, however, twice in FPGA
implementation. First usage concerns too large designs to fit available
FPGA packages. A less obvious usage of network partitioning is used in
the blocks' placement phase. Placement procedures based on circuit
partitioning yields amazing results.
3. PREVIOUS WORK
Typical partitioning objectives such as minimum-width bisection and
minimum ratio cut are NP-complete and require such heuristics as
simulated annealing, greedy k-opt interchange or quadratic optimization--via relaxation or spectral methods (Bucur, 2006).
In (Brasen & Saucier, 1998) two new partitioning algorithms are
presented that use cone structures to partition large hierarchical
blocks into FPGA's. Cone structures are minimum cut partitioning
structures for net lists with low fan-out, and clustering structures for
partitioning net lists with high fan-out. Cone structures also allow for
full containment of critical paths.
The authors of the paper (Kwon & Kyung, 2004) proposed a
circuit partitioning algorithm called SCheduling driven Algorithm for
TOMi (SCATOMi) for multi-FPGA systems with interconnection architecture
called Time-multiplexed, Off-chip, Multi-casting interconnection (TOMi).
SCATOMi improves the performance of the TOMi architecture by limiting
the number of inter-FPGA signal transfers on the critical path and
considering the scheduling of inter-FPGA signal transfer.
The paper (Muthukumar & Selvaraj, 2003) deals with the problem
of determining the set of best free and bound variables (variable
partitioning problem) for disjoint (disjoint serial) decomposition; such
that the decomposed circuits are smaller in size and its truth table
representation have maximal don't cares. The proposed approach has
been successfully implemented and tested with MCNC and Espresso
benchmarks.
4. PROBLEM FORMULATION
In this section, is introduced new multilevel partitioning
algorithm for combinational Boolean networks. A combinational Boolean
network C can be represented traditionally as a directed acyclic graph G
= (V, E) where each node n (n [member of] V) represents a logic gate and
a directed edge (i, j), ((i, j) [member of] E) exists if the output of
gate i is an input of gate j. A primary input (PI) node has no incoming
edge and a primary output (PO) node has no outgoing edge (De Micheli,
1994). A disjoint Q-way partitioning solution S = ([A.sub.1], [A.sub.2],
... [A.sub.K]) satisfies the following conditions:
[A.sub.i] [intersection] [A.sub.j] = [empty set], i [not equal to]
j and [U.sub.o<i<Q+1] [A.sub.i] contains all the gates in the
network, than [A.sub.1], [A.sub.2], ... [A.sub.K] are known as clusters
of G(C).
Each node in C has only one output line and limited number of input
lines. It is used input(v) to denote the set of fanins of gate. Given a
subgraph H of the Boolean network, let input(H) denotes the set of
distinct nodes outside H, which supply inputs to the nodes in H (fanins
of H). For a node n in the network, a w-feasible cone at n, denoted
[K.sub.n], is a subgraph consisting of node n and its predecessors (u is
a predecessor of n if there is a directed path from u to n), such that
[absolute value of input([K.sub.n])] [less than or equal to] w and any
path connecting a node in [K.sub.n] and n lies entirely in [K.sub.n].
The level of a node is the length of the longest path from any PI node
to n. The level of a PI node is zero. The depth of a network is the
largest node level in the network (Hachtel & Somenzi, 1996). A
Boolean network is p-bounded if [absolute value of input(n)] [less than
or equal to] p for each node n in the network.
5. SIGNAL DRIVEN PARTITIONING
Signal driven partitioning procedure was implemented atop of
SIS-1.2 (***, 1994). Implemented algorithm is splitting-up circuit C,
before mapping it. Combinational circuits could be very large and
cluster partitioning helps obtaining more technological compliant
mapping over the initial circuit. Each internal node structure has
additional array denoted po_label, mapping all POs nodes of the circuit.
First traversal, depth first search from outputs, establish nodes
affiliation with respect to the PO nodes. An internal node having more
than one element not zero in its po_label array belongs to more than one
PO transitive cluster, and it's said to be multiple dominated. If
node n belongs to the transitive clusters of P[O.sub.1], P[O.sub.2] and
P[O.sub.3], as an example, than po_label(1) = po_label(2) = po_label(3)
= 1. Such nodes are defining sub-cluster(1,2,3) as the intersection of
the three mentioned clusters. Mapping for k-LUT is made over homogenous dominated clusters. It means that nodes dominated only by PO z, or by PO
z and PO t, as an example, will be mapped in separate mapping sessions.
This strategy separates nodes having fan-outs in more than one single
output cluster avoiding interactions during mappings in different
primary output clusters. Additionally, clusters with multiple domination
identification are making simpler the task of mapping for critical
performance.
6. EXPERIMENTAL RESULTS
Implemented cluster partitioning algorithm working with
minLevelMapIVv3 (cluster mapping tool) was tested against minDepth, our
reference mapping tool, used without cluster partitioning. Results are
presented in Tab. 1. Circuits, listed in Tab. 1, are taken from MCNC91
multilevel examples benchmark; are selected the most representative ones
(used in mostly similar works). Cluster partitioning procedure was
designed first to minimize the critical path delay. This was implemented
by merging those clusters containing nodes belonging to the critical
path but having enough slack delay in order to avoid introducing other
costs to the partitioning objective.
Actual algorithm is computing all clusters Clusters(n) rooted on
each internal node n of the network and having less inputs than k (k =
number of input lines of each LUT in FPGA) practically. Comparing
results, in Tab. 1, for minDepth and minLevelMapIVv3 it is obvious that
almost all results are a little bit less competitive for minLevelMapIVv3
with cluster partitioning. However, these results are better than those
previously reported in (Bucur, 2006), because were introduced several
new and powerful heuristics. These heuristics are working progressively
and are searching best clusters based on distance among clusters having
the same slack delay.
Results obtained, using actual algorithm minLevelMapIVv3 with
cluster partitioning, are on average only 0.7% larger than those
computed with minDepth (no cluster partitioning). Close examination of
circuits are showing that several nodes are still avoidable duplicated.
LUT count values are pointing out that, keeping the same depth of the
homologues circuits, the area is mostly the same in the partitioned
mapped circuit and in the not-partitioned mapped one. However, actual
model is using the simplest model of the depth (level) and no references
are made to the real interconnections and to their delay (Bucur et al.,
2008). Implemented algorithms are mainly oriented to find best
performance solution, using the depth model.
7. CONCLUSION AND FUTURE WORK
Actual heuristic guided cluster-based partitioning approach is
showing improvements, both in terms of the cut size and in the run time,
compared with initial approach (partitioning the given circuit before
mapping).
Since wholly automatic partitioning is essential for fast
iterations in the design and optimization steps, substantial study is
made in university circles as well as in industry to make possible and
get better the not so easy decisions on functional stage.
Both mapping algorithms are, actually, under research and progress
in order to be able to allow a variety of delay models simultaneously
with new mapping routines in order to gain also enhanced area results.
Cluster partitioning algorithm will be enhanced with better fast
cost estimators (comparing with the actual distance). This will improve
efficiency mainly on non-critical path cones process.
8. REFERENCES
Brasen, D. R. & Saucier, G. (1998). Using cone structures for
circuit partitioning into FPGA packages. IEEE Trans. on Computer-Aided
Design of Integrated Circuits and Systems, Vol. 17, No7, pp. 592-600,
ISSN: 0278-0070
Bucur, I. (2006). Partitioning combinational circuits for k-LUT
FPGA mapping. University Politehnica of Bucharest Scientific Bulletin,
Series C: Electrical Engineering, Vol. 68, No. 2, (2006),pp. 91-100,
ISSN: 1454-234x
Bucur, I.; Fagarasan, I.; Popescu, C.; Boiangiu, C.-A. & Culea,
G. (2008). On K-LUT Based FPGA Optimum Delay and Optimal Area Mapping.
Proc. of the 10th WSEAS Intnl. Conf. on Mathematical and Computational
Methods in Science and Engineering, pts I and II, Iliescu M et al.
(Eds), pp.137-142,ISBN:978-960-474-019-2,Bucharest,Nov.2008, World
Scientific and Engineering ACAD and SOC, Athens
De Micheli, G. (1994). Synthesis and Optimization of Digital
Circuits, McGraw-Hill, Inc., ISBN 0-07-016333-2, USA
Hachtel, G. D. & Somenzi, F. (1996). Logic Synthesis and
Verification Algorithms, Springer Science+Business Media, Inc., ISBN
978--0-387-31004-6, USA
Kwon, Y.-S. & Kyung, C.-M. (2004). Scheduling driven circuit
partitioning algorithm for multiple FPGAs using time-multiplexed,
off-chip, multi-casting interconnection architecture. Microprocessors
and Microsystems Vol. 28, Issues 5-6, 2 August 2004, pp. 341-350, ISSN:
0141-9331
Muthukumar, V. & Selvaraj, H. (2003). Comparison of Heuristic
Algorithms for Variable Partitioning in Circuit Implementation.
Proceedings of 16th International Conference on VLSI Design, pp. 51-57,
New Delhi, India, January 04-08 2003, IEEE Computer Science, USA NJ
*** (1994). SIS-1.2, Synthesis of both synchronous and asynchronous sequential circuits, Available from:
http://embedded.eecs.berkeley.edu/pubs/downloads/sis/inde x.htm,
Accessed on: 2009-07-12
Tab. 1. Comparative experimental results
Circuit Depth minDepth minLevelMapIVv3
LUT count LUT count
9sym 5 7 8
C499 4 67 68
C5315 8 500 501
C880 7 130 133
alu2 5 129 130
alu4 5 549 550
apex2 5 150 152
apex4 5 875 881
apex6 4 222 222
apex7 4 67 71
b9 3 37 39
clip 3 44 45
count 3 74 74
des 5 1014 1020
duke2 4 151 152
e64 3 338 339
f51m 3 51 51
rd84 3 13 13
rot 6 204 206
sao2 4 57 57
vg2 3 35 35
4714 4747