LLM Finetune Results

Code Understanding with LLM

The experimental results, shown in Table VI, demon- strate the performance of various base Large Language Models (LLMs) and fine-tuned LLMs on our RTL code understanding benchmark, using evaluation metrics BLEU-4, METEOR, ROUGE- 1, ROUGE-2, and ROUGE-L, which offer valuable insights into the quality of generated RTL code in terms of surface-level linguistic similarity.

Initially, the original versions of the LLMs, such as CodeL- Lama, CodeT5+, CodeGen2, and DeepSeek, exhibit relatively low performance across most metrics.

After fine-tuning on our dataset, every large model demonstrates significantly better performance across BLEU-4, METEOR, ROUGE- 1, ROUGE-2, and ROUGE-L metrics compared to their original, non-fine-tuned counterparts. This highlights the effectiveness of our dataset. Moreover, models of various sizes, such as the 220M CodeT5, as well as larger 7B and 16B models, all show substantial improvements after fine-tuning. This indicates that our dataset is well- suited for models of different scales, providing strong adaptability and generalization.

Code Generation and Completion with LLM

In the realm of RTL code completion and generation, the evaluation of model performance is critical to advancing intelligent programming tools. The Pass@k metric serves as a pivotal measure in this domain, quantifying the accuracy of code generation models by assessing their ability to produce valid solutions within the top-k predictions. Specifically, Pass@k evaluates whether the correct code snippet appears among the model’s top k outputs, thereby providing insights into both the effectiveness and reliability of the model’s predictions.

Table VII compares the performance of both original and fine-tuned LLMs on RTL code completion and generation tasks, focusing on Pass@1 and Pass@5 on two evaluation benchmarks, RTLLM and VerilogEval. We choose the baseline LLMs as original versions of CodeLlama, CodeT5+, CodeGen2, and CodeGen2.5 exhibit negligible performance, with most Pass@K scores at 0% or near 0%.

Notably, every model fine-tuned with our dataset significantly outperforms its original, non-fine-tuned counterpart, demonstrating the effectiveness of our data. Additionally, for models of different scales, such as the 220M CodeT5 and 7B models, the results after fine-tuning show substantial improvements. This highlights the adaptability and generalization capability of our dataset across various model sizes. Moreover, we include CodeV (QW-7B) as an additional baseline, which achieves 14.80% Pass@1 and Pass@5 on RTLLM, and 4.5% on VerilogEval. Although CodeV has undergone prior fine-tuning for general-purpose code generation, its performance remains lower than our fine-tuned CodeGen2.5 (7B).

These findings highlight the effectiveness of our dataset in enhancing LLMs’ capability to generate syntactically and functionally accurate RTL code.

PreviousTasks, experiments and results NextPPA Prediction Results

Last updated 4 months ago