RTL-Language Data for LLM Finetune

PreviousOther modality information NextData for PPA prediction

Last updated 5 months ago

RTL-Language Data for LLM Finetune

example code completion for LLM training：

    {
        "instruction": "Given the detailed specifications of a module, generate the corresponding Verilog code.",
        "input": "Module-level description:\nThe dout_monitor module tracks the rising edge of a sample clock signal to increment an address register. Implementation entails two primary blocks: one capturing the previous state of the sample clock using a temporary register, and another using this captured state to detect rising edges and conditionally increment the address, which is reset to 448 on a low reset signal.\nModule head:\n\nmodule dout_monitor(\ninput clk,rst_n,\noutput reg [`ADDRESS_WIDTH-1:0] addr,\ninput [`DATA_WIDTH-1:0] d_ch0,\ninput [`DATA_WIDTH-1:0] d_ch1,\ninput sample_clk\n)",
        "output": "`include \"defines.v\"\nmodule dout_monitor(\ninput clk,rst_n,\noutput reg [`ADDRESS_WIDTH-1:0] addr,\ninput [`DATA_WIDTH-1:0] d_ch0,\ninput [`DATA_WIDTH-1:0] d_ch1,\ninput sample_clk\n);\nreg sample_clk_t;\nalways @(posedge clk)\nbegin\n\tsample_clk_t<=sample_clk;\nend\nalways @(posedge clk or negedge rst_n)\nbegin\n\tif (!rst_n) begin\n\t\taddr<=10'd448;\n\tend else begin\n\t\tif (sample_clk_t == 1'b0 && sample_clk == 1'b1)\n\t\tbegin\n\t\t\taddr <= addr + 10'd1;\n\t\tend\n\tend\nend\nendmodule"
    },

The code completion dataset download url:

example code generation for LLM training：

    {
        "instruction": "Given the detailed specifications of a module, generate the corresponding Verilog code.",
        "input": "Module name: conv_encoder. Module specification: The 'conv_encoder' is a Verilog module designed for digital communications, specifically intended to implement a convolutional encoder that helps enhance data transmission reliability through error detection and correction. This encoder takes a single input bit 'b' and generates two encoded output bits 'c[0]' and 'c[1]'. The input ports include 'b' for the data bit, 'clk' for the clock signal governing the register updates, and 'reset' for initializing the internal register values to zero under specific conditions. The module outputs a two-bit signal 'c', consisting of bits 'c[0]' and 'c[1]', calculated via XOR operations involving the current and past input bits. Internal signals consist of two registers, 'D_0' and 'D_1', used to store the immediate and next immediate past values of 'b', respectively. The Verilog code is structured into distinct always blocks: one block updates 'D_0' (determined by the 'b' value) and 'D_1' (takes the previous value of 'D_0') at each positive clock edge, handling resets as necessary; another block computes the output 'c' based on the values of 'b', 'D_0', and 'D_1' using XOR logic. This structure ensures the module effectively encodes input data, leveraging past inputs for future error correction in noisy environments.",
        "output": "module conv_encoder (output [1:0] c, input b, input clk, input reset);\n\treg [1:0] c;\n\treg D_0, D_1;\n\talways @ (posedge clk)\n\tbegin\n\t\tif (reset)\n\t\tbegin\n\t\t\tD_0 <= 1'b0;\n\t\tend\n\t\telse\n\t\tbegin\n\t\t\tD_0 <= b;\n\t\tend\n\tend\n\talways @ (posedge clk)\n\tbegin\n\t\tif (reset)\n\t\tbegin\n\t\t\tD_1 <= 1'b0;\n\t\tend\n\t\telse\n\t\tbegin\n\t\t\tD_1 <= D_0;\n\t\tend\n\tend\n\talways @ (D_0 or D_1 or b)\n\tbegin\n\t\tc[0]  <= b^D_0^D_1;\n\t\tc[1]  <= b^D_1;\n\tend \nendmodule"
    },

The code generation dataset download url:

example code understanding for LLM training：

    {
        "instruction": "Generate a module-level description for the given Verilog code.",
        "input": "module four_bit_adder (\n  input [3:0] ADD_1,  \n  input [3:0] ADD_2,  \n  output [3:0] S, \n  output G, \n  input C,            \n  output C_OUT, \n  output P \n);\nwire [3:0]Prop;\nwire [3:0]Gen;\nwire [3:0]Ci;\nassign Gen = ADD_1 & ADD_2;\nassign Prop = ADD_1 ^ ADD_2;\nassign Ci[0] = C;\nassign Ci[1]= Gen[0]|(Prop[0]&Ci[0]);\nassign Ci[2] = Gen[1]|(Prop[1]&Ci[1]);\nassign Ci[3] = Gen[2]|(Prop[2]&Ci[2]);\nassign C_OUT = Gen[3]|(Prop[3]&Ci[3]);\nassign S = Prop^Ci;\nassign P = Prop[3] & Prop[2] & Prop[1] & Prop[0];\nassign G = Gen[3] | (Gen[2] & Prop[3]) | (Gen[1] & Prop[2] & Prop[3]) | (Gen[0] & Prop[1] & Prop[2]);\nendmodule",
        "output": "The 'four_bit_adder' module performs 4-bit binary addition of inputs ADD_1 and ADD_2 with an input carry C, producing a 4-bit sum S, carry-out C_OUT, generate G, and propagate P signals. It uses bitwise AND, OR, and XOR operations to calculate internal carry (Ci) and perform the addition, with dedicated logic to determine the G and P signals for carry creation and propagation across the bits."
    },

The code understanding dataset download url:

Testing Dataset & Benchmark：

Dataset Counts for RTL Code Tasks

Task

Module

RISC-V

Chip

Total

RTL Code Understanding

6,386

14,499

1,348

3,922

26,155

RTL Code Completion

6,178

14,131

1,312

3,822

25,443

RTL Code Generation

6,479

16,511

1,393

3,950

28,333

This table displays the data distribution for code understanding, completion, and generation tasks across different RTL categories.

RTL Code Understanding

This task evaluates the model’s ability to interpret and describe RTL code. Given a module’s RTL code as input, the model generates a detailed, concise description, covering key aspects such as the module’s purpose, input/output signals, internal logic, and overall behavior. This task is crucial for assessing the model’s ability to generate human-readable explanations for code analysis and documentation.

RTL Code Completion

In this task, the model is provided with a partial RTL code (typically the module header with input/output ports and parameters). The goal is for the model to complete the code by generating the missing internal logic, control structures, and signal definitions. This task mirrors autocompletion functionality found in modern code editors and evaluates the model’s ability to infer and generate code from context.

RTL Code Generation

In the RTL code generation task, the model is tasked with producing a full implementation of RTL code based on a high-level description and specified input and output parameters. The goal is to generate a fully functional Verilog module that adheres to the provided specifications. This task assesses the model’s ability to translate design requirements into precise RTL implementations, which is critical for automating the hardware design process.

PreviousOther modality information NextData for PPA prediction

Last updated 5 months ago

example code completion for LLM training：

    {
        "instruction": "Given the detailed specifications of a module, generate the corresponding Verilog code.",
        "input": "Module-level description:\nThe dout_monitor module tracks the rising edge of a sample clock signal to increment an address register. Implementation entails two primary blocks: one capturing the previous state of the sample clock using a temporary register, and another using this captured state to detect rising edges and conditionally increment the address, which is reset to 448 on a low reset signal.\nModule head:\n\nmodule dout_monitor(\ninput clk,rst_n,\noutput reg [`ADDRESS_WIDTH-1:0] addr,\ninput [`DATA_WIDTH-1:0] d_ch0,\ninput [`DATA_WIDTH-1:0] d_ch1,\ninput sample_clk\n)",
        "output": "`include \"defines.v\"\nmodule dout_monitor(\ninput clk,rst_n,\noutput reg [`ADDRESS_WIDTH-1:0] addr,\ninput [`DATA_WIDTH-1:0] d_ch0,\ninput [`DATA_WIDTH-1:0] d_ch1,\ninput sample_clk\n);\nreg sample_clk_t;\nalways @(posedge clk)\nbegin\n\tsample_clk_t<=sample_clk;\nend\nalways @(posedge clk or negedge rst_n)\nbegin\n\tif (!rst_n) begin\n\t\taddr<=10'd448;\n\tend else begin\n\t\tif (sample_clk_t == 1'b0 && sample_clk == 1'b1)\n\t\tbegin\n\t\t\taddr <= addr + 10'd1;\n\t\tend\n\tend\nend\nendmodule"
    },

The code completion dataset download url:

example code generation for LLM training：

    {
        "instruction": "Given the detailed specifications of a module, generate the corresponding Verilog code.",
        "input": "Module name: conv_encoder. Module specification: The 'conv_encoder' is a Verilog module designed for digital communications, specifically intended to implement a convolutional encoder that helps enhance data transmission reliability through error detection and correction. This encoder takes a single input bit 'b' and generates two encoded output bits 'c[0]' and 'c[1]'. The input ports include 'b' for the data bit, 'clk' for the clock signal governing the register updates, and 'reset' for initializing the internal register values to zero under specific conditions. The module outputs a two-bit signal 'c', consisting of bits 'c[0]' and 'c[1]', calculated via XOR operations involving the current and past input bits. Internal signals consist of two registers, 'D_0' and 'D_1', used to store the immediate and next immediate past values of 'b', respectively. The Verilog code is structured into distinct always blocks: one block updates 'D_0' (determined by the 'b' value) and 'D_1' (takes the previous value of 'D_0') at each positive clock edge, handling resets as necessary; another block computes the output 'c' based on the values of 'b', 'D_0', and 'D_1' using XOR logic. This structure ensures the module effectively encodes input data, leveraging past inputs for future error correction in noisy environments.",
        "output": "module conv_encoder (output [1:0] c, input b, input clk, input reset);\n\treg [1:0] c;\n\treg D_0, D_1;\n\talways @ (posedge clk)\n\tbegin\n\t\tif (reset)\n\t\tbegin\n\t\t\tD_0 <= 1'b0;\n\t\tend\n\t\telse\n\t\tbegin\n\t\t\tD_0 <= b;\n\t\tend\n\tend\n\talways @ (posedge clk)\n\tbegin\n\t\tif (reset)\n\t\tbegin\n\t\t\tD_1 <= 1'b0;\n\t\tend\n\t\telse\n\t\tbegin\n\t\t\tD_1 <= D_0;\n\t\tend\n\tend\n\talways @ (D_0 or D_1 or b)\n\tbegin\n\t\tc[0]  <= b^D_0^D_1;\n\t\tc[1]  <= b^D_1;\n\tend \nendmodule"
    },

The code generation dataset download url:

1125_training_code_generation.jsonGoogle Docs

example code understanding for LLM training：

    {
        "instruction": "Generate a module-level description for the given Verilog code.",
        "input": "module four_bit_adder (\n  input [3:0] ADD_1,  \n  input [3:0] ADD_2,  \n  output [3:0] S, \n  output G, \n  input C,            \n  output C_OUT, \n  output P \n);\nwire [3:0]Prop;\nwire [3:0]Gen;\nwire [3:0]Ci;\nassign Gen = ADD_1 & ADD_2;\nassign Prop = ADD_1 ^ ADD_2;\nassign Ci[0] = C;\nassign Ci[1]= Gen[0]|(Prop[0]&Ci[0]);\nassign Ci[2] = Gen[1]|(Prop[1]&Ci[1]);\nassign Ci[3] = Gen[2]|(Prop[2]&Ci[2]);\nassign C_OUT = Gen[3]|(Prop[3]&Ci[3]);\nassign S = Prop^Ci;\nassign P = Prop[3] & Prop[2] & Prop[1] & Prop[0];\nassign G = Gen[3] | (Gen[2] & Prop[3]) | (Gen[1] & Prop[2] & Prop[3]) | (Gen[0] & Prop[1] & Prop[2]);\nendmodule",
        "output": "The 'four_bit_adder' module performs 4-bit binary addition of inputs ADD_1 and ADD_2 with an input carry C, producing a 4-bit sum S, carry-out C_OUT, generate G, and propagate P signals. It uses bitwise AND, OR, and XOR operations to calculate internal carry (Ci) and perform the addition, with dedicated logic to determine the G and P signals for carry creation and propagation across the bits."
    },

The code understanding dataset download url:

1125_training_comment_generation.jsonGoogle Docs

Testing Dataset & Benchmark：

eval_data.zipGoogle Docs

Dataset Counts for RTL Code Tasks

Task

Module

RISC-V

Chip

Total

RTL Code Understanding

6,386

14,499

1,348

3,922

26,155

RTL Code Completion

6,178

14,131

1,312

3,822

25,443

RTL Code Generation

6,479

16,511

1,393

3,950

28,333

This table displays the data distribution for code understanding, completion, and generation tasks across different RTL categories.

RTL Code Understanding

RTL Code Completion

RTL Code Generation