Introduction
High-Level Synthesis (HLS) plays a crucial role in modern hardware design by transforming high-level code into optimized hardware implementations. However, progress in applying machine learning (ML) to HLS optimization has been hindered by a shortage of sufficiently large and diverse datasets.
To bridge this gap, we introduce ForgeHLS, a large-scale, open-source dataset explicitly designed for ML-driven HLS research. ForgeHLS comprises over 400,000 diverse designs generated from 536 kernels covering a broad range of application domains. Each kernel includes systematically automated pragma insertions (loop unrolling, pipelining, array partitioning), combined with extensive design space exploration using Bayesian optimization.
Compared to existing datasets, ForgeHLS significantly enhances scale, diversity, and design coverage. We further define and evaluate representative downstream tasks, such as Quality of Result (QoR) prediction and automated pragma exploration, clearly demonstrating ForgeHLS’s utility for developing and improving ML-based HLS optimization methodologies.
Dataset Comparison Summary
Table 1 provides a feature-level comparison across mainstream HLS datasets, highlighting ForgeHLS’s comprehensive support across both source code and post-HLS annotations. Most existing datasets either lack pragma-annotated source code or do not support graph-based representations essential for modern ML pipelines. In contrast, ForgeHLS offers fully open-source kernels with rich annotations, including pragma directives, latency/resource reports, and IR-level graphs.
Table 2 further details the scale and structure of ForgeHLS. Our dataset spans 536 kernels and over 429,000 HLS designsacross both real-world and synthetic domains, significantly surpassing prior datasets in volume and coverage. This diversity ensures a robust foundation for training ML models and evaluating generalization across applications.


Last updated