LLM Finetune Results
Last updated
Last updated
The experimental results, shown in Table VI, demon- strate the performance of various base Large Language Models (LLMs) and fine-tuned LLMs on our RTL code understanding benchmark, using evaluation metrics BLEU-4, METEOR, ROUGE- 1, ROUGE-2, and ROUGE-L, which offer valuable insights into the quality of generated RTL code in terms of surface-level linguistic similarity.
Initially, the original versions of the LLMs, such as CodeL- Lama, CodeT5+, CodeGen2, and DeepSeek, exhibit relatively low performance across most metrics.
After fine-tuning on our dataset, every large model demonstrates significantly better performance across BLEU-4, METEOR, ROUGE- 1, ROUGE-2, and ROUGE-L metrics compared to their original, non-fine-tuned counterparts. This highlights the effectiveness of our dataset. Moreover, models of various sizes, such as the 220M CodeT5, as well as larger 7B and 16B models, all show substantial improvements after fine-tuning. This indicates that our dataset is well- suited for models of different scales, providing strong adaptability and generalization.
In the realm of RTL code completion and generation, the evaluation of model performance is critical to advancing intelligent programming tools. The Pass@k metric serves as a pivotal measure in this domain, quantifying the accuracy of code generation models by assessing their ability to produce valid solutions within the top-k predictions. Specifically, Pass@k evaluates whether the correct code snippet appears among the model’s top k outputs, thereby providing insights into both the effectiveness and reliability of the model’s predictions.
Table VII compares the performance of both original and fine-tuned LLMs on RTL code completion and generation tasks, focusing on Pass@1 and Pass@5 on two evaluation benchmarks, RTLLM [8] and VerilogEval [28]. We choose the baseline LLMs as original versions of CodeLlama, CodeT5+, CodeGen2, and Code- Gen2.5 exhibit negligible performance, with most Pass@K scores at 0% or near 0%. Notably, every model fine-tuned with our dataset significantly out- performs its original, non-fine-tuned counterpart, demonstrating the effectiveness of our data. Additionally, for models of different scales, such as the 220M CodeT5 and 7B models, the results after fine-tuning show substantial improvements. This highlights the adaptability and generalization capability of our dataset across various model sizes.