asteriskRTL code annotations by GPT

To construct the RTL-language dataset, we organize the data into four distinct levels: repository, file, module, and block. The detailed example shown in the figure.

We employ a Chain of Thought (CoT) approach for RTL code annotation, leveraging GPT-4 and Claude to generate detailed comments, descriptions, and question-answer pairs.

RTL Category
Module-Level Annotations
Block-Level Annotations
Repository-Level Annotations

Chip

5,471

36,955

84

IP

12,863

20,101

183

Module

28,901

-

1,389

RISC-V

2,116

-

560

The table illustrates the number of annotations at the module, block, and repository levels for various RTL categories.

The annotation download url:

All the RTL code and corresponding different level annotations

if you get some problems when using this link, such as 'cannot unzip' , 'cannot download' and 'the link is not valid', you can try this new link: https://huggingface.co/datasets/zeju-0727/DeepCirCuitX_Datasetarrow-up-right

The annotation test case download url:

One complete case of our annotation data (with RTL code)

One case of our data structure:

chip/Communications_Processor/Design-of-reduced-latency-and-increased-throughput-Polar-Decoder

The structure for the annotated Verilog code in the 'Design-of-reduced-latency-and-increased-throughput-Polar-Decoder' project.

The design_files folder contains individual Verilog files, with pe_1 serving as an example. Each module's source code (e.g., pe_1.v) is accompanied by various annotation files, such as intermediate comments, specifications, and a textual description (pe_1.txt).

These annotations are organized into subdirectories like intermediate_comment and spec. This structure enables detailed documentation and analysis of the Verilog code for various modules across the project.

Illustration of the dataset repository structure with multi-level annotations

Last updated