> For the complete documentation index, see [llms.txt](https://zeju.gitbook.io/lcm-team/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://zeju.gitbook.io/lcm-team/deepcircuitx/introduction-of-our-dataset.md).

# Introduction

### Overview

DeepCircuitX is a holistic, repository-level dataset curated to address limitations in existing datasets. It provides data and annotations across multiple levels:

* **Chip Level**: 109repositories, 5508files.
* **IP Level**: 225repositories, 12,961files.
* **Module Level**: 2,383 repositories, 38,692 files.
* **RISCV:** 2,078 repositories,  98,450 files.

**Key Features:**

* Multi-level Source RTL code：Repository, file, module, and block
* Multi-level annotations by GPT4o : Repository, file, module, and block.
* Includes synthesized netlists, PPA metrics, and layout designs.
* Benchmarks for RTL understanding, generation, and completion.

<figure><img src="/files/MKf7Yz20L1DtwIjpFOPY" alt=""><figcaption><p>Pipeline overview of the proposed framework, illustrating the key stages: data collection from GitHub using keywords, data annotation via chain-of-thought (COT), circuit transformation, and evaluation, including RTL code tasks for LLM and PPA prediction.</p></figcaption></figure>

#### &#x20;                                           Table 1: Dataset Summary of DeepCircuitX

| Level      | Functional Categories | Number of Repositories | Number of RTL Files |
| ---------- | --------------------- | ---------------------- | ------------------- |
| **Chip**   | 17                    | 109                    | 5,508               |
| **IP**     | 3                     | 225                    | 12,961              |
| **Module** | 57                    | 2,383                  | 38,692              |
| **RISC-V** | -                     | 2,078                  | 98,450              |

This table summarizes the number of functional categories, repositories, and RTL files across different levels of DeepCircuitX, including Chip, IP, Module, and RISC-V levels.

#### &#x20;                                    Table 2: Overview of Annotations in DeepCircuitX

| RTL Category | Module-Level Annotations | Block-Level Annotations | Repository-Level Annotations |
| ------------ | ------------------------ | ----------------------- | ---------------------------- |
| **Chip**     | 5,471                    | 36,955                  | 84                           |
| **IP**       | 12,863                   | 20,101                  | 183                          |
| **Module**   | 28,901                   | -                       | 1,389                        |
| **RISC-V**   | 2,116                    | -                       | 560                          |

The table illustrates the number of annotations at the module, block, and repository levels for various RTL categories.

#### &#x20;                                                    Table 3: Dataset Counts for RTL Code Tasks

<table><thead><tr><th width="163">Task</th><th width="107">IP</th><th width="143">Module</th><th width="138">RISC-V</th><th>Chip</th><th>Total</th></tr></thead><tbody><tr><td><strong>RTL Code Understanding</strong></td><td>6,386</td><td>14,499</td><td>1,348</td><td>3,922</td><td>26,155</td></tr><tr><td><strong>RTL Code Completion</strong></td><td>6,178</td><td>14,131</td><td>1,312</td><td>3,822</td><td>25,443</td></tr><tr><td><strong>RTL Code Generation</strong></td><td>6,479</td><td>16,511</td><td>1,393</td><td>3,950</td><td>28,333</td></tr></tbody></table>

This table displays the data distribution for code understanding, completion, and generation tasks across different RTL categories.
