DeepCircuitX is a holistic, repository-level dataset curated to address limitations in existing datasets. It provides data and annotations across multiple levels:
Chip Level: 109repositories, 5508files.
IP Level: 225repositories, 12,961files.
Module Level: 2,383 repositories, 38,692 files.
RISCV: 2,078 repositories, 98,450 files.
Key Features:
Multi-level Source RTL code:Repository, file, module, and block
Multi-level annotations by GPT4o : Repository, file, module, and block.
Includes synthesized netlists, PPA metrics, and layout designs.
Benchmarks for RTL understanding, generation, and completion.
Pipeline overview of the proposed framework, illustrating the key stages: data collection from GitHub using keywords, data annotation via chain-of-thought (COT), circuit transformation, and evaluation, including RTL code tasks for LLM and PPA prediction.
Table 1: Dataset Summary of DeepCircuitX
Level
Functional Categories
Number of Repositories
Number of RTL Files
Chip
17
109
5,508
IP
3
225
12,961
Module
57
2,383
38,692
RISC-V
-
2,078
98,450
This table summarizes the number of functional categories, repositories, and RTL files across different levels of DeepCircuitX, including Chip, IP, Module, and RISC-V levels.
Table 2: Overview of Annotations in DeepCircuitX
RTL Category
Module-Level Annotations
Block-Level Annotations
Repository-Level Annotations
Chip
5,471
36,955
84
IP
12,863
20,101
183
Module
28,901
-
1,389
RISC-V
2,116
-
560
The table illustrates the number of annotations at the module, block, and repository levels for various RTL categories.
Table 3: Dataset Counts for RTL Code Tasks
Task
IP
Module
RISC-V
Chip
Total
RTL Code Understanding
6,386
14,499
1,348
3,922
26,155
RTL Code Completion
6,178
14,131
1,312
3,822
25,443
RTL Code Generation
6,479
16,511
1,393
3,950
28,333
This table displays the data distribution for code understanding, completion, and generation tasks across different RTL categories.