



# **Delay Optimized Matrix Multiplication using 3:2 Compressors for Signal Processing Applications**

S.Subathradevi<sup>1</sup> | C.Vennila<sup>2</sup> | S.Balavinayagam<sup>3</sup> | M.Bhagavan<sup>4</sup> | V.Muruganantham<sup>5</sup> | M.Pradeep<sup>6</sup>

<sup>1</sup>Assistant Professor, Department of ECE, Anna University-BIT Campus, Trichy. <sup>2</sup>Professor, Department of ECE, Saranathan College of Engineering, Trichy. <sup>3-6</sup>UG Scholar, Department of ECE, Anna University-BIT Campus, Trichy.

## To Cite this Article

S.Subathradevi, C.Vennila, S.Balavinayagam, M.Bhagavan, V.Muruganantham and M.Pradeep, "Delay Optimized Matrix Multiplication using 3:2 Compressors for Signal Processing Applications", *International Journal for Modern Trends in Science and Technology*, Vol. 03, Issue 04, April 2017, pp. 146-149.

# ABSTRACT

Matrix multiplication is a binary operation that produces a product matrix from two matrices. In this work, a normal Matrix Multiplication was implemented using a 3:2 compressors for the realization of 3-bit addition as a technique to result lower delay. And also proposed that method as a modified method from conventional matrix multiplication technique to result in delay efficient Matrix Multiplication. By using MAC operation the computation time of Matrix Multiplication is reduced and at the same time delay has got reduced. Finally High speed and less computation time was achieved for Matrix Multiplication using VERILOG HDL in Xilinx ISE (version14.5) to obtain the expected results and it was synthesized using the tool XST and the simulation was being done using ISim. To identify the high speed in terms of number of computation and path delay, the veilog code was implemented on FPGA by having the target device as Spartan 3E. This work shows the increased in speed by computing the frequency it was realized.

KEYWORDS: Matrix Multiplication, VERILOG HDL code, 3:2 compressor, FPGA.

Copyright © 2017 International Journal for Modern Trends in Science and Technology All rights reserved.

# I. INTRODUCTION

Integration (VLSI) technology, the real time realization with Signal Processing with less hardware requirement, less power consumption and less latency has become more and more important nowadays to use the signal processing system design in real time applications.

For any system design, the arithmetic units are much important to make it as a successful design. In arithmetic unit adders took much more contribution to play a role since subtraction and multiplication all computed by addition and subtraction computation. So, if the speed is achieved by minimizing the delay in the architecture of adder and multiplier surely it leads to better performance in the speed of the design in the system. So, various multiplier based matrix multiplication architectures are realized and their delay is compared.

A combinational multiplier is a good example of how simple logic functions (gates, half adders and full adders) can be combined to construct a much more complex function. In particular, it is possible to construct a 4x4 combinational multiplier from an array of AND gates, half-adders and full-adders, taking what was learned recently and extending it to a more complex circuit. The purpose of this work is to introduce how a relatively complex arithmetic function, such as binary multiplication, can be realized using simple logic building blocks. It is useful to consider how binary multiplication can be performed. S.Subathradevi, C.Vennila, S.Balavinayagam, M.Bhagavan, V.Muruganantham and M.Pradeep : Delay Optimized Matrix Multiplication using 3:2 Compressors for Signal Processing Applications



Fig.1 Four bit binary array multiplier.

In order to achieve the high speed and low power demand in DSP applications, parallel array multipliers are widely used. In DSP applications, most of the power is consumed by the multipliers. Hence low power multipliers must be designed in order to reduce the power dissipation in DSP applications. IO path delay specifies the cell delay. Interconnect delay specifies the point to point delay



Fig.2. Partial Products and Interconnect path length

The flow of the paper is organized in the following way as, Section - I gives about the Introduction. Section - II gives out the details about Literature Survey on the matrix multiplication in delay reduction and Section –III gives out the proposed work and Section IV- narrates experimental results and discussion and Sections says the conclusion.

#### **II. LITERATURE SURVEY**

Various delay targeted architecture of Matrix multiplication with their sub-modules in structural form is being surveyed, implemented, compared and their contribution was viewed in connection with delay. These architectures was briefed in various literature papers and they are listed with their work related to the proposed work.

In paper [1], they explained the Reconfigurable architecture of Matrix Multiplication of VLIW processor and to analyse the operation characteristics and implementation of symmetric algorithm. Whereas in paper [2], they explained parellization of sparse Matrix vector and Matrix transpose vector multiplication on many core processors. And identify the five quality criteria for finding the local aware1D matrix.

In paper [3], they explained Blocking and scheduling of Floating point Matrix Multiplication emphasing the role of On-chip Memory.

In paper [4], they proposes the Matrix Multiplication includes many Algorithms. This paper reveals the PARELLEL STRASSEN Algorithm becomes more efficient for Matrix Multiplication.

In paper [5] the authors explained Polynomial Matrix Multiplication computed by Novel reconfigurable Architecture. The hardware implementation was achieved by FPGA. The fast convolution technique to MIMO method is the main algorithm.

In paper [6], they proposes the Medical Sensor application and low energy classification systems. This system instrumentation is followed by ADC Conversion .System mapped to the MMADC as an ECG based cardiac arrhythmia

In paper [7],they explained Sparse Matrix multiplication on an associative processor enables high level of parallelism where a row of one Matrix Multiplied in Parallel with the entire Matrix vector dot product does not depend on vector single.

#### III. PROPOSED ARCHITECTURE

The conventional architecture of 3:2 compressor shown in fig.2 has two XOR gates in its critical path. The sum is generated by the second XOR and carry output is generated by the multiplexer. The equations governing the conventional 3:2 compressor output is shown in fig.4.



The 3-2 compressor architecture shown in fig. 3 has less delay as compared to other architectures, as some of the xor circuits are replaced the multiplexer circuits. In this compressor the select bt=it at multiplexer present before the input arrives, so reduces the delay.

S.Subathradevi, C.Vennila, S.Balavinayagam, M.Bhagavan, V.Muruganantham and M.Pradeep : Delay Optimized Matrix Multiplication using 3:2 Compressors for Signal Processing Applications



Fig.4. 3:2 compressor and modified 3:2 compressor

Thus the switching time of the transistors in the critical path is decreased. This minimizes the delay to a significant amount. This architecture shows a critical path delay of xor+mux. Output functions of modified 3:2 compressors are shown by the equations in fig.5.

 $Sum = X1 \oplus X2 \oplus X3$ 

 $Carry = (X1 \oplus X2) \bullet X3 + \overline{(X1 \oplus X2)} \bullet X1$ 

Fig.5. rearranged expression for 3:2 compressor

So, these two compressors are used in the 3X3 matrix multiplication as the sub-module and it was implemented in this proposed work to achieve the target parameter as delay reduction in trade of with area by a significant amount.

#### IV. RESULTS AND DISCUSSION

In this proposed modified method of matrix multiplication, the delay has got reduced than the conventional method of matrix multiplication. In the conventional method the array multiplier and ripple carry adder was used for matrix multiplication. In the modified method the 3:2 compressor and the modified 3:2 compressors are used in multipliers and adders to obtain the delay efficient architecture for matrix multiplication than the conventional method.

| Property Name                | Value               |   |
|------------------------------|---------------------|---|
| Top-Level Source Type        | HDL                 | ~ |
| Evaluation Development Board | None Specified      | ~ |
| Product Category             | All                 | ~ |
| Family                       | Spartan3E           | ~ |
| Device                       | XC3S100E            | ~ |
| Package                      | VQ100               | ~ |
| Speed                        | -4                  | ~ |
| Synthesis Tool               | XST (VHDL/Verilog)  | ~ |
| Simulator                    | ISim (VHDL/Verilog) | ~ |

#### Fig.5 Target device

Verilog code is used for modeling in the proposed modified method which leads to the optimized delay in trade of with area.

And the implementation is done on FPGA. Thus the target parameters Low delay, High speed is obtained.

By doing analysis on survey of various literatures, it was concluded that the reduced delay can be obtained for matrix multiplication. Due to this architecture reduced delav will find applications in high speed system design like system on-chip and Network on chip along with signal processing applications. Also found applications in high speed system design in making decision in the medical field by analyzing various parameters which are related. In future these architectures may be implemented using FPGA can be done as in the form of micro chip.

These architectures of matrix multiplication for 4-bit and 8-bit was implemented on the FPGA using Spartan 3E with the tool Xilinx.



Fig. 7. Simulation output for first architecture with 4-bit

| File Edit View Simu       | lation Window L        | evout Help |            | ISim (P.         | ISim (P.581) - [Detault.wcfg] |                                     |                           |  |  |
|---------------------------|------------------------|------------|------------|------------------|-------------------------------|-------------------------------------|---------------------------|--|--|
|                           |                        |            |            |                  | / 🗟 🖻                         | 2 1 1 1 1 <b>0</b> + 1 <sup>2</sup> | LODus 🔽 🕼 🔲 📿 Re-launch   |  |  |
| nstances and Pr. + D & X  | Objects                |            | *08×       | 1                |                               |                                     |                           |  |  |
| Instance and Process Name | Simulation Objects for |            |            | Anne Name        | Value                         | 1,999,995 ps 1,999,996 ps           | 1,999,997,05 1,999,998,05 |  |  |
| Instance and Process Name | Object Name            | Value      | Data Typ A | 🔎 🕨 📲 (21() S () | 1000000110                    |                                     | 000000110010110           |  |  |
| gibl                      | b. 24 a112.0           | 01010101   | ACTIN      |                  | 11000001001                   |                                     | 1 200000 2001111120       |  |  |
| C 995                     | a127d                  | 10000100   | han h      | A 12 (2305.0)    | 0010111110                    |                                     | 0101111011110             |  |  |
|                           | b a1377.0              | 11100001   | ACTIN      | 0 . 4 01050      | 1010100001                    |                                     | 01010000101010100         |  |  |
|                           | A2170                  | 11111100   | Actay      | 12 1 14 (33150)  | 1111101100                    |                                     | 111101100000100           |  |  |
|                           | b 20 a2207.01          | 01000010   | Actav      |                  |                               |                                     |                           |  |  |
|                           | 5 A2307.0              | 11000010   | Acces      | 2 🕨 📲 (3)(5.0)   | 0011111000                    |                                     | 0011111000000300          |  |  |
|                           | b a31[7:0]             | 00000111   | Acray      | 🛊 🕨 🖬 y112154    | 01100001001                   |                                     | 1100001001001             |  |  |
|                           | b a32[7.0]             | 10010010   | Acray      | ► 🖬 y12[15:d]    | 11110110104                   |                                     | \$113012030001120         |  |  |
|                           | b 👬 a33[7:5]           | 11111100   | Array      | I > 14 y13(15.0) | 1110310011                    |                                     | 110010011011101           |  |  |
|                           | 5 💑 611[7/0]           | 11100010   | Acray      | 1 N 121/150      | 1010000000                    |                                     | 0 1000000000000           |  |  |
|                           | b 12(7:0)              | 01000010   | Actay      |                  |                               |                                     |                           |  |  |
|                           | b 💑 013(7/0)           | 10101010   | Actay      | 🗄 🕨 🍟 y22[15.0]  | 0003381031                    |                                     | 0000001001113001          |  |  |
|                           | b 21(7:0)              | 11111000   | Actay      | 21 🕨 📲 y23(15/3  | 00010501104                   |                                     | 0001000110011100          |  |  |
|                           | 0 022(7.0)             | 00000001   | Acray      | 🚞 🕨 📢 y31(15-0)  | 0011101111                    |                                     | Q011301111130010          |  |  |
|                           | b 23(7:0)              | 11100011   | Actay      | ► ₩ y32[15:0]    | 1111110101                    |                                     | 11111001010010010         |  |  |
|                           | b 🛐 b31(7)0            | 10101011   | Acray      | ► 📢 y33(15:0)    | 11000100000                   |                                     | 12000 10000 100000        |  |  |
|                           | D 32[7:0]              | 11111111   | Acray      | P 10 922(351     | 1100010000                    |                                     | 100100010000              |  |  |
|                           | D - 0 833(7/0)         | 00111111   | Acray      |                  |                               |                                     |                           |  |  |
|                           | b antipol              |            | Acray      |                  |                               | X1/ 2,000,000 ps                    |                           |  |  |
|                           | b cin2(0:0)            | 1          | Actay      |                  |                               | ALL BITTERITE DE                    |                           |  |  |



|   | 8-bit                           |        |            |                  |       |  |  |  |  |
|---|---------------------------------|--------|------------|------------------|-------|--|--|--|--|
| 1 | TYPE OF THE<br>ARCHITCTURE      | D      | ELAY<br>ns | FREQUENCY<br>MHz |       |  |  |  |  |
|   |                                 | 4 bit  | 8-bit      | 4-bit            | 8-bit |  |  |  |  |
|   | Normal MM                       | 28.388 | 54.509     | 35.22            | 18.34 |  |  |  |  |
| : | 3:2 compressor/<br>Improved 3:2 | 28.107 | 54.104     | 35.57            | 18.48 |  |  |  |  |

Fig. 8. Synthesis Report

S.Subathradevi, C.Vennila, S.Balavinayagam, M.Bhagavan, V.Muruganantham and M.Pradeep : Delay Optimized Matrix Multiplication using 3:2 Compressors for Signal Processing Applications

| TYPES OF<br>ARCHITECTU<br>RE                                  | NUMBER OF SLICES<br>OUT OF<br>4656<br>(AREA) |       | OUT   | ER OF LUTs<br>OF 9312<br>AREA) | DELAY in us |        |  |
|---------------------------------------------------------------|----------------------------------------------|-------|-------|--------------------------------|-------------|--------|--|
|                                                               | 4-bit                                        | 8-bit | 4-bit | 8-bit                          | 4-bit       | 8-bit  |  |
| Normal Matrix<br>Multiplication                               | 68                                           | 2272  | 119   | 3951                           | 28.388      | 54.509 |  |
| 3:2 compressor<br>/modified using<br>matrix<br>multiplication | 600                                          | 2245  | 1071  | 4005                           | 28.107      | 54.104 |  |

Fig. 9. Synthesis Report



By doing this analysis on survey of various literatures, it was concluded that the 3:2 compressor using Matrix Multiplication is produced the more delay efficient than the previous method as an average of 5% for having the data size as 4-bits and 8-bits. Due to reduced delay this architecture will find applications in high speed system design.

### **V. CONCLUSION**

In this proposed work, a conventional 3X3 Matrix Multiplication was implemented using a 3:2 compressors for the realization of 3-bit addition as a technique to result lower delay in the product matrix. Also as a conventional method a matrix multiplication was implemented using binary array multiplier and ripple carry adder. By using the computation time of the MAC operation, computation time of the Matrix Multiplication is reduced and at the same time delay also has got reduced. Finally High speed and less computation time was achieved for Matrix Multiplication while using 3:2 compressors using VERILOG HDL code in Xilinx ISE (version14.5).

The metrics are obtained as a expected results and it was synthesized using the tool XST and the simulation was being done using ISim. From the metics the high speed was achieved in terms of number of computation and path delay, the veilog code was implemented on FPGA by having the target device as Spartan 3E. This work shows the increased in speed as 4% as an average (for 4-bits and 8-bits), by computing the frequency it was realized. Matrix multiplication is a binary operation that produces a product matrix from two matrices which is much useful in signal processing applications.

#### REFERENCES

- [1] Ahmad Khayyat, *Member*, Naraig Manjikian , "Analysis of Blocking and Scheduling for FPGA-Based Floating-Point Matrix Multiplication," Canadian Journal of ELECTRICAL AND COMPUTER ENGINEERING, VOL. 37, NO. 2, Spring 2014
- [2] Zhuo Wang, Jintao Zhang, Naveen Verma, " Realizing Low-Energy Classification Systemsby Implementing Matrix Multiplication Directly Within an ADC", IEEE Transaction on Biomedical Circuits and Systems 1.
- [3] L. Yavits, A. Morad, R. Ginosar ,"Sparse Matrix Multiplication On Anassociative Processor", IEEE Transactions on Parallel and Distributed Systems
- [4] Yang su,Xiaoyuan Yang,Yuechuan Wei, "Research and design of Reconfigurable Matrix Multiplication over Finite field in VLIW processor"
- [5] M. Ozan Karsavuran, Kadir Akbudak, and Cevdet Aykanat, "Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors" IEEE Transactions on Parallel and Distributed Systems ,Vol. 6, No. 1, January 2007.
- [6] Soydan Redif, Server Kasap, "Novel Reconfigurable Hardware Architecture for Polynomial Matrix Multiplications" IEEE Transaction on Very Large Scale Integration (VLSI) systems.
- [7] Khaled Thabet, Sumaia AL-Ghuribi"Matrix Multiplication Algorithms", JJCSNS International Journal of Computer Science and Network Security, VOL.12 No.2, February 2012Manuscript received February 5, 2012Manuscript revised February 20, 2012
- [8] Jorge tonfat, Ricardo Reis, Low power 3:2 compressor and 4:2 adder compressor implemented using ASTRAN, Grupo de microelectronica.

asuaisz

