Hi all,
Now let's take a look at the Brent-Kung adder, a fairly advanced design prefix-tree adder I previously discussed with the Kogge-Stone adder. The Brent-Kung (BK) adder is a good balance between area and power cost (where the Kogge-Stone adder lacks) and performance. This adder has a complex carry and inverse-carry tree that can be quite challenging to implement generically using VHDL. Here's a look at the overall carry tree for the Brent-Kung adder:
An examination of the tree reveals that the tree can be divided into a tree and an inverse tree. The upper tree is sparse based on periodic powers of 2. The inverse tree is offset 1, beginning from the bottom of the matrix and expanding outward at powers of 2 (not to interfere with previously carried bits). The rest of the code is nearly identical to the Kogge-Stone adder. Let's take a look at the VHDL code for the Brent-Kung adder:
VHDL Code:
LIBRARY ieee; USE ieee.std_logic_1164.all; USE work.my_funs.all; ENTITY bk_adder IS GENERIC ( width : INTEGER := 7 ); PORT ( a : IN STD_LOGIC_VECTOR(width-1 DOWNTO 0); b : IN STD_LOGIC_VECTOR(width-1 DOWNTO 0); c_in : IN STD_LOGIC; sum : OUT STD_LOGIC_VECTOR(width-1 DOWNTO 0); c_out : OUT STD_LOGIC ); END bk_adder; ARCHITECTURE behavioral OF bk_adder IS CONSTANT nn: INTEGER := clogb2(width); CONSTANT inv_nn: INTEGER := clogb2(width+2**(nn-2))-2; TYPE T_type IS ARRAY(nn+inv_nn-1 DOWNTO 0, width-1 DOWNTO 0) OF STD_LOGIC_VECTOR(1 DOWNTO 0); SIGNAL T: T_type; BEGIN -- Carry tree with maximum number of stages tree_proc: PROCESS(T,a,b,c_in) BEGIN -- First bit is a full adder T(0,0)(0) <= (a(0) AND b(0)) OR (c_in AND (a(0) XOR b(0))); T(0,0)(1) <= a(0) XOR b(0) XOR c_in; -- Leaves of tree FOR j IN width-1 DOWNTO 1 LOOP T(0,j)(0) <= a(j) AND b(j); -- Generate bit base T(0,j)(1) <= a(j) XOR b(j); -- Propagate bit base END LOOP; -- Carry tree FOR i IN 1 TO nn-1 LOOP FOR j IN width-1 DOWNTO 0 LOOP IF(j mod 2**i = (2**i)-1) THEN IF((j-2**(i-1)) >= 0) THEN T(i,j)(0) <= (T(i-1,j)(1) AND T(i-1,j-2**(i-1))(0)) OR T(i-1,j)(0); -- G = (P_i and G_i_prev) or G_i T(i,j)(1) <= T(i-1,j)(1) AND T(i-1,j-2**(i-1))(1); -- P = P_i and P_i_prev ELSE T(i,j)(0) <= T(i-1,j)(0); -- G = G_i (since we are at tree's edge, there is no G_i_prev) T(i,j)(1) <= T(i-1,j)(1); -- P = P_i (since we are at tree's edge, there is no P_i_prev) END IF; ELSE T(i,j)(0) <= T(i-1,j)(0); T(i,j)(1) <= T(i-1,j)(1); END IF; END LOOP; END LOOP; -- Inverse carry tree FOR i IN nn+inv_nn DOWNTO nn+1 LOOP FOR j IN width-1 DOWNTO 0 LOOP IF((j-2**(nn+inv_nn-(i))) mod 2**((nn+inv_nn-(i))+1) = 2**((nn+inv_nn-(i))+1)-1) THEN IF(j >= 2**(nn+inv_nn-i)) THEN T(i-1,j)(0) <= (T(i-2,j)(1) AND T(i-2,j-2**((nn+inv_nn-(i))))(0)) OR T(i-2,j)(0); -- G = (P_i and G_i_prev) or G_i T(i-1,j)(1) <= T(i-2,j)(1) AND T(i-2,j-2**((nn+inv_nn-(i))))(1); -- P = P_i and P_i_prev ELSE T(i-1,j)(0) <= T(i-2,j)(0); T(i-1,j)(1) <= T(i-2,j)(1); END IF; ELSE T(i-1,j)(0) <= T(i-2,j)(0); T(i-1,j)(1) <= T(i-2,j)(1); END IF; END LOOP; END LOOP; END PROCESS; -- Basic summation for carry tree sum_proc: PROCESS(T) BEGIN sum(0) <= T(0,0)(1); FOR i IN width-1 DOWNTO 1 LOOP sum(i) <= T(0,i)(1) XOR T(nn+inv_nn-1,i-1)(0); END LOOP; END PROCESS; c_out <= T(nn+inv_nn-1,width-1)(0) OR (T(nn+inv_nn-1,width-1)(1) AND T(nn+inv_nn-1,width-2)(0)); END behavioral;
I created two separate processes for the tree and inverse tree. The inverse tree can be challenging to understand but functions correctly (however, if you find any bugs, please let me know!). Next, I decided to write the test bench for the most tricky implementation of the Brent-Kung adder using operands of width 6. Here's a look at the bench VHDL code:
VHDL Code:
LIBRARY ieee; USE ieee.std_logic_1164.all; USE ieee.numeric_std.all; USE std.textio.all; ENTITY bk_adder_tb IS GENERIC ( width: INTEGER := 6 ); END bk_adder_tb; ARCHITECTURE tb OF bk_adder_tb IS SIGNAL t_a: STD_LOGIC_VECTOR(width-1 DOWNTO 0); SIGNAL t_b: STD_LOGIC_VECTOR(width-1 DOWNTO 0); SIGNAL t_sum: STD_LOGIC_VECTOR(width-1 DOWNTO 0); SIGNAL t_c_in: STD_LOGIC := '1'; SIGNAL t_c_out: STD_LOGIC; COMPONENT bk_adder GENERIC ( width: INTEGER := 16 ); PORT ( a : IN STD_LOGIC_VECTOR(width-1 DOWNTO 0); b : IN STD_LOGIC_VECTOR(width-1 DOWNTO 0); c_in : IN STD_LOGIC; sum : OUT STD_LOGIC_VECTOR(width-1 DOWNTO 0); c_out : OUT STD_LOGIC ); END COMPONENT; FUNCTION to_string(sv: Std_Logic_Vector) return string is USE Std.TextIO.all; USE ieee.std_logic_textio.all; VARIABLE lp: line; BEGIN write(lp, to_integer(unsigned(sv))); RETURN lp.all; END; BEGIN U_bk_adder: bk_adder GENERIC MAP ( width => width ) PORT MAP ( a => t_a, b => t_b, c_in => t_c_in, sum => t_sum, c_out => t_c_out ); -- Input Processes inp_prc: PROCESS VARIABLE v_a: INTEGER := 0; VARIABLE v_b: INTEGER := 2**(width-1); VARIABLE v_c_in: INTEGER := 0; BEGIN FOR i IN 0 TO 2**width LOOP v_a := v_a + 1; v_b := v_b - i; IF(v_b < 0) THEN v_b := v_b + 2**width-1; END IF; t_a <= std_logic_vector(to_unsigned(v_a,width)); t_b <= std_logic_vector(to_unsigned(v_b,width)); WAIT FOR 1 ns; IF t_c_in = '1' THEN v_c_in := 1; ELSE v_c_in := 0; END IF; ASSERT TO_INTEGER(UNSIGNED(t_sum)) = (v_a + v_b + v_c_in) REPORT "Invalid sum! "&to_string(t_sum)&" != "&to_string(std_logic_vector(to_unsigned(v_a + v_b + v_c_in,width)))&"!\n"&to_string(std_logic_vector(t o_unsigned(v_a,width)))&" + "&to_string(std_logic_vector(to_unsigned(v_b,width )))&" + "&to_string(std_logic_vector(to_unsigned(v_c_in,wi dth))); WAIT FOR 9 ns; END LOOP; END PROCESS; c_in_process: PROCESS BEGIN t_c_in <= not(t_c_in); WAIT FOR 10 ns; END PROCESS; END tb;
And, running the test bench through Modelsim, we can verify proper operation of the design:
For our 6-bit operand BK adder, here's the synthesized logic:
Synthesis of this design set to 64-bit operands, using Synplify Pro and choosing Xilinx Virtex2 XC2V40 with CS144 Package and -6 Speed yields the following performance data:
Upon inspection, we can see that this 64-bit Brent-Kung adder runs slightly slower (due to the greater number of stages) but occupies about half the area and significantly less routing than the 64-bit Kogge-Stone adder previously discussed.Code:Performance Summary ******************* Worst slack in design: -1.028 Requested Estimated Requested Estimated Clock Clock Starting Clock Frequency Frequency Period Period Slack Type Group ------------------------------------------------------------------------------------------------------------------------ top|clk 167.6 MHz 143.0 MHz 5.967 6.994 -1.028 inferred Autoconstr_clkgroup_0 ======================================================================================================================== ... Resource Usage Report for bk_adder Mapping to part: xc2v40cs144-6 Cell usage: FD 229 uses MUXF5 1 use LUT2 52 uses LUT3 86 uses LUT4 270 uses Mapping Summary: Total LUTs: 408 (79%)
Take care!
VHDLCoder



LinkBack URL
About LinkBacks




Reply With Quote