文章基本信息

标题：Signal driven partitioning large circuits.
作者：Bucur, Ion ; Cupcea, Nicolae ; Stefanescu, Costin 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2009
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：Field Programmable Gate Arrays (FPGAs), providing both large-scale integration and user-programmability, are essential architectures. FPGA packages, as a general feature, have maximum size Combinational Logic Blocks (CLBs) constraints much larger than the number of input-output pins IOBs. Therefore, realization of a large logic network into working FPGA involves network division into a near balanced packing of CLBs and Input-Output Blocks (IOBs). Partitioning is a modus operandi of separating a circuit or system into a set of smaller sub-circuits with almost equal sizes targeting to lessen the amount of links between the sub-circuits. Market trend pressure yields larger systems, both to make manufacture cheaper and to take advantage of the optimization probable of the entire structure. In this work is presented new multilevel multi-resource partitioning algorithm for partitioning large combinational circuits in order to efficiently use existing and commercially available FPGAs packages.
关键词：Circuit design;Integrated circuits;Semiconductor chips;Very large scale integration;Very-large-scale integration

Signal driven partitioning large circuits.

Bucur, Ion ; Cupcea, Nicolae ; Stefanescu, Costin 等

1. INTRODUCTION

Field Programmable Gate Arrays (FPGAs), providing both large-scale integration and user-programmability, are essential architectures. FPGA packages, as a general feature, have maximum size Combinational Logic Blocks (CLBs) constraints much larger than the number of input-output pins IOBs. Therefore, realization of a large logic network into working FPGA involves network division into a near balanced packing of CLBs and Input-Output Blocks (IOBs). Partitioning is a modus operandi of separating a circuit or system into a set of smaller sub-circuits with almost equal sizes targeting to lessen the amount of links between the sub-circuits. Market trend pressure yields larger systems, both to make manufacture cheaper and to take advantage of the optimization probable of the entire structure. In this work is presented new multilevel multi-resource partitioning algorithm for partitioning large combinational circuits in order to efficiently use existing and commercially available FPGAs packages.

2. SPECIFIC ISSUES

FPGA circuit realization has two main phases. Placement phase, the first one, is dedicated to assign useful locations within the FPGA structure, to the obtained blocks. This could be, however, an iterative process. Routing phase, the last one, provides the links between these blocks. Circuit partitioning is used, however, twice in FPGA implementation. First usage concerns too large designs to fit available FPGA packages. A less obvious usage of network partitioning is used in the blocks' placement phase. Placement procedures based on circuit partitioning yields amazing results.

3. PREVIOUS WORK

Typical partitioning objectives such as minimum-width bisection and minimum ratio cut are NP-complete and require such heuristics as simulated annealing, greedy k-opt interchange or quadratic optimization--via relaxation or spectral methods (Bucur, 2006).

In (Brasen & Saucier, 1998) two new partitioning algorithms are presented that use cone structures to partition large hierarchical blocks into FPGA's. Cone structures are minimum cut partitioning structures for net lists with low fan-out, and clustering structures for partitioning net lists with high fan-out. Cone structures also allow for full containment of critical paths.

The authors of the paper (Kwon & Kyung, 2004) proposed a circuit partitioning algorithm called SCheduling driven Algorithm for TOMi (SCATOMi) for multi-FPGA systems with interconnection architecture called Time-multiplexed, Off-chip, Multi-casting interconnection (TOMi). SCATOMi improves the performance of the TOMi architecture by limiting the number of inter-FPGA signal transfers on the critical path and considering the scheduling of inter-FPGA signal transfer.

The paper (Muthukumar & Selvaraj, 2003) deals with the problem of determining the set of best free and bound variables (variable partitioning problem) for disjoint (disjoint serial) decomposition; such that the decomposed circuits are smaller in size and its truth table representation have maximal don't cares. The proposed approach has been successfully implemented and tested with MCNC and Espresso benchmarks.

4. PROBLEM FORMULATION

In this section, is introduced new multilevel partitioning algorithm for combinational Boolean networks. A combinational Boolean network C can be represented traditionally as a directed acyclic graph G = (V, E) where each node n (n [member of] V) represents a logic gate and a directed edge (i, j), ((i, j) [member of] E) exists if the output of gate i is an input of gate j. A primary input (PI) node has no incoming edge and a primary output (PO) node has no outgoing edge (De Micheli, 1994). A disjoint Q-way partitioning solution S = ([A.sub.1], [A.sub.2], ... [A.sub.K]) satisfies the following conditions:

[A.sub.i] [intersection] [A.sub.j] = [empty set], i [not equal to] j and [U.sub.o<i<Q+1] [A.sub.i] contains all the gates in the network, than [A.sub.1], [A.sub.2], ... [A.sub.K] are known as clusters of G(C).

Each node in C has only one output line and limited number of input lines. It is used input(v) to denote the set of fanins of gate. Given a subgraph H of the Boolean network, let input(H) denotes the set of distinct nodes outside H, which supply inputs to the nodes in H (fanins of H). For a node n in the network, a w-feasible cone at n, denoted [K.sub.n], is a subgraph consisting of node n and its predecessors (u is a predecessor of n if there is a directed path from u to n), such that [absolute value of input([K.sub.n])] [less than or equal to] w and any path connecting a node in [K.sub.n] and n lies entirely in [K.sub.n]. The level of a node is the length of the longest path from any PI node to n. The level of a PI node is zero. The depth of a network is the largest node level in the network (Hachtel & Somenzi, 1996). A Boolean network is p-bounded if [absolute value of input(n)] [less than or equal to] p for each node n in the network.

5. SIGNAL DRIVEN PARTITIONING

Signal driven partitioning procedure was implemented atop of SIS-1.2 (***, 1994). Implemented algorithm is splitting-up circuit C, before mapping it. Combinational circuits could be very large and cluster partitioning helps obtaining more technological compliant mapping over the initial circuit. Each internal node structure has additional array denoted po_label, mapping all POs nodes of the circuit. First traversal, depth first search from outputs, establish nodes affiliation with respect to the PO nodes. An internal node having more than one element not zero in its po_label array belongs to more than one PO transitive cluster, and it's said to be multiple dominated. If node n belongs to the transitive clusters of P[O.sub.1], P[O.sub.2] and P[O.sub.3], as an example, than po_label(1) = po_label(2) = po_label(3) = 1. Such nodes are defining sub-cluster(1,2,3) as the intersection of the three mentioned clusters. Mapping for k-LUT is made over homogenous dominated clusters. It means that nodes dominated only by PO z, or by PO z and PO t, as an example, will be mapped in separate mapping sessions. This strategy separates nodes having fan-outs in more than one single output cluster avoiding interactions during mappings in different primary output clusters. Additionally, clusters with multiple domination identification are making simpler the task of mapping for critical performance.

6. EXPERIMENTAL RESULTS

Implemented cluster partitioning algorithm working with minLevelMapIVv3 (cluster mapping tool) was tested against minDepth, our reference mapping tool, used without cluster partitioning. Results are presented in Tab. 1. Circuits, listed in Tab. 1, are taken from MCNC91 multilevel examples benchmark; are selected the most representative ones (used in mostly similar works). Cluster partitioning procedure was designed first to minimize the critical path delay. This was implemented by merging those clusters containing nodes belonging to the critical path but having enough slack delay in order to avoid introducing other costs to the partitioning objective.

Actual algorithm is computing all clusters Clusters(n) rooted on each internal node n of the network and having less inputs than k (k = number of input lines of each LUT in FPGA) practically. Comparing results, in Tab. 1, for minDepth and minLevelMapIVv3 it is obvious that almost all results are a little bit less competitive for minLevelMapIVv3 with cluster partitioning. However, these results are better than those previously reported in (Bucur, 2006), because were introduced several new and powerful heuristics. These heuristics are working progressively and are searching best clusters based on distance among clusters having the same slack delay.

Results obtained, using actual algorithm minLevelMapIVv3 with cluster partitioning, are on average only 0.7% larger than those computed with minDepth (no cluster partitioning). Close examination of circuits are showing that several nodes are still avoidable duplicated. LUT count values are pointing out that, keeping the same depth of the homologues circuits, the area is mostly the same in the partitioned mapped circuit and in the not-partitioned mapped one. However, actual model is using the simplest model of the depth (level) and no references are made to the real interconnections and to their delay (Bucur et al., 2008). Implemented algorithms are mainly oriented to find best performance solution, using the depth model.

7. CONCLUSION AND FUTURE WORK

Actual heuristic guided cluster-based partitioning approach is showing improvements, both in terms of the cut size and in the run time, compared with initial approach (partitioning the given circuit before mapping).

Since wholly automatic partitioning is essential for fast iterations in the design and optimization steps, substantial study is made in university circles as well as in industry to make possible and get better the not so easy decisions on functional stage.

Both mapping algorithms are, actually, under research and progress in order to be able to allow a variety of delay models simultaneously with new mapping routines in order to gain also enhanced area results.

Cluster partitioning algorithm will be enhanced with better fast cost estimators (comparing with the actual distance). This will improve efficiency mainly on non-critical path cones process.

8. REFERENCES

Brasen, D. R. & Saucier, G. (1998). Using cone structures for circuit partitioning into FPGA packages. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, Vol. 17, No7, pp. 592-600, ISSN: 0278-0070

Bucur, I. (2006). Partitioning combinational circuits for k-LUT FPGA mapping. University Politehnica of Bucharest Scientific Bulletin, Series C: Electrical Engineering, Vol. 68, No. 2, (2006),pp. 91-100, ISSN: 1454-234x

Bucur, I.; Fagarasan, I.; Popescu, C.; Boiangiu, C.-A. & Culea, G. (2008). On K-LUT Based FPGA Optimum Delay and Optimal Area Mapping. Proc. of the 10th WSEAS Intnl. Conf. on Mathematical and Computational Methods in Science and Engineering, pts I and II, Iliescu M et al. (Eds), pp.137-142,ISBN:978-960-474-019-2,Bucharest,Nov.2008, World Scientific and Engineering ACAD and SOC, Athens

De Micheli, G. (1994). Synthesis and Optimization of Digital Circuits, McGraw-Hill, Inc., ISBN 0-07-016333-2, USA

Hachtel, G. D. & Somenzi, F. (1996). Logic Synthesis and Verification Algorithms, Springer Science+Business Media, Inc., ISBN 978--0-387-31004-6, USA

Kwon, Y.-S. & Kyung, C.-M. (2004). Scheduling driven circuit partitioning algorithm for multiple FPGAs using time-multiplexed, off-chip, multi-casting interconnection architecture. Microprocessors and Microsystems Vol. 28, Issues 5-6, 2 August 2004, pp. 341-350, ISSN: 0141-9331

Muthukumar, V. & Selvaraj, H. (2003). Comparison of Heuristic Algorithms for Variable Partitioning in Circuit Implementation. Proceedings of 16th International Conference on VLSI Design, pp. 51-57, New Delhi, India, January 04-08 2003, IEEE Computer Science, USA NJ

*** (1994). SIS-1.2, Synthesis of both synchronous and asynchronous sequential circuits, Available from: http://embedded.eecs.berkeley.edu/pubs/downloads/sis/inde x.htm, Accessed on: 2009-07-12

Tab. 1. Comparative experimental results

Circuit Depth minDepth minLevelMapIVv3
 LUT count LUT count

9sym 5 7 8
C499 4 67 68
C5315 8 500 501
C880 7 130 133
alu2 5 129 130
alu4 5 549 550
apex2 5 150 152
apex4 5 875 881
apex6 4 222 222
apex7 4 67 71
b9 3 37 39
clip 3 44 45
count 3 74 74
des 5 1014 1020
duke2 4 151 152
e64 3 338 339
f51m 3 51 51
rd84 3 13 13
rot 6 204 206
sao2 4 57 57
vg2 3 35 35
 4714 4747