LCM-Team
  • Datasets of Large Circuit Model
  • DeepCircuitX
    • Introduction
    • Source RTL code
    • RTL code annotations by GPT
    • Other modality information
    • RTL-Language Data for LLM Finetune
    • Data for PPA prediction
    • Tasks, experiments and results
      • LLM Finetune Results
      • PPA Prediction Results
  • ForgeEDA
    • Introduction
    • Data Preparation
    • Dataset
    • Practical Downstream Tasks
      • Practical EDA Applications
      • AI for EDA Applications
  • ForgeHLS
Powered by GitBook
On this page
  1. DeepCircuitX

Introduction

PreviousDatasets of Large Circuit ModelNextSource RTL code

Last updated 3 months ago

Overview

DeepCircuitX is a holistic, repository-level dataset curated to address limitations in existing datasets. It provides data and annotations across multiple levels:

  • Chip Level: 109repositories, 5508files.

  • IP Level: 225repositories, 12,961files.

  • Module Level: 2,383 repositories, 38,692 files.

  • RISCV: 2,078 repositories, 98,450 files.

Key Features:

  • Multi-level Source RTL code:Repository, file, module, and block

  • Multi-level annotations by GPT4o : Repository, file, module, and block.

  • Includes synthesized netlists, PPA metrics, and layout designs.

  • Benchmarks for RTL understanding, generation, and completion.

Table 1: Dataset Summary of DeepCircuitX

Level
Functional Categories
Number of Repositories
Number of RTL Files

Chip

17

109

5,508

IP

3

225

12,961

Module

57

2,383

38,692

RISC-V

-

2,078

98,450

This table summarizes the number of functional categories, repositories, and RTL files across different levels of DeepCircuitX, including Chip, IP, Module, and RISC-V levels.

Table 2: Overview of Annotations in DeepCircuitX

RTL Category
Module-Level Annotations
Block-Level Annotations
Repository-Level Annotations

Chip

5,471

36,955

84

IP

12,863

20,101

183

Module

28,901

-

1,389

RISC-V

2,116

-

560

The table illustrates the number of annotations at the module, block, and repository levels for various RTL categories.

Table 3: Dataset Counts for RTL Code Tasks

Task
IP
Module
RISC-V
Chip
Total

RTL Code Understanding

6,386

14,499

1,348

3,922

26,155

RTL Code Completion

6,178

14,131

1,312

3,822

25,443

RTL Code Generation

6,479

16,511

1,393

3,950

28,333

This table displays the data distribution for code understanding, completion, and generation tasks across different RTL categories.

Pipeline overview of the proposed framework, illustrating the key stages: data collection from GitHub using keywords, data annotation via chain-of-thought (COT), circuit transformation, and evaluation, including RTL code tasks for LLM and PPA prediction.