

# LLMs and the Future of Chip Design: Unveiling Security Risks and Building Trust

Zeng Wang,<sup>†</sup> Lilas Alrahis,<sup>‡</sup> Likhitha Mankali,<sup>†</sup> Johann Knechtel,<sup>‡</sup> and Ozgur Sinanoglu<sup>‡</sup>

<sup>†</sup>New York University, USA

<sup>‡</sup>New York University Abu Dhabi, UAE

Email:{zw3464, lma387, likhitha.mankali, johann, ozgursin}@nyu.edu

**Abstract**—Chip design is about to be revolutionized by the integration of large language, multimodal, and circuit models (collectively LxMs). While exploring this exciting frontier with tremendous potential, the community must also carefully consider the related security risks and the need for building trust into using LxMs for chip design. First, we review the recent surge of using LxMs for chip design in general. We cover state-of-the-art works for the automation of hardware description language code generation and for scripting and guidance of essential but cumbersome tasks for electronic design automation tools, e.g., design-space exploration, tuning, or designer training. Second, we raise and provide initial answers to novel research questions on critical issues for security and trustworthiness of LxM-powered chip design from both the attack and defense perspectives.

**Index Terms**—Electronic Design Automation; Integrated Circuits; Large Language Models; Hardware Security.

## I. INTRODUCTION

The application of artificial intelligence (AI) in chip design workflows, particularly within electronic design automation (EDA), has significantly accelerated the design process in recent years for critical tasks such as high-level synthesis (HLS), etc. [1]. Besides, recent advancements in natural language processing (NLP) and large language models (LLMs) have demonstrated remarkable success in tasks such as creative writing, chatbot interaction, and commonsense reasoning [2], [3]. Given the similarities for NLP applications and certain tasks in EDA, including pattern recognition, code generation, and more, along with the demonstrated success of other AI models in EDA, researchers have recently begun to investigate the potential of LLMs across various aspects of EDA.<sup>1</sup>

In this paper, we explore a broad spectrum of critical issues, presenting selected research that focuses on the use of LLMs, large multimodal models (LMMs), and large circuit models (LCMs), or collectively LxMs, for chip design, verification, and security analysis. Our contributions are as follows.

- 1) We provide a comprehensive survey, reviewing selected state-of-the-art and providing detailed descriptions of LLM tasks, framework designs, training scheme, etc.
- 2) We compile resources on LLMs for chip design, focusing on various applications of LLMs, security concerns, and open sources. We summarize these findings in our **LLM4IC hub** [<https://github.com/DfX-NYUAD/LLM4IC>].

<sup>1</sup>In fact, LLMs are already used for commercial EDA applications; e.g., *Synopsys.ai Copilot* is a generative AI tool that aids in reducing time to market and systemic complexity through conversational intelligence [4].

## II. BACKGROUND

### A. Large Language Models

LLMs are transformer-based deep neural networks. The transformer architecture comprises encoder and decoder layers that utilize self-attention mechanisms to process input sequences, such as text sentences or paragraphs. These models are trained on massive text datasets and are capable of performing various NLP tasks, including text recognition, translation, prediction, and generation.

**1) Learning Paradigms:** We focus on four learning/usage paradigms in the field of LLMs, as follows.

**Training.** Generic LLMs are trained on large-scale datasets and can be directly used for various tasks without modification [2]. *Domain-adaptive pretraining* (DAPT) involves pretraining the LLM on domain-specific text, allowing it to fully adapt to the specifics of a particular domain [5].

**Prompting** involves crafting specific prompts that guide the pretrained LLM in generating desired outputs without modifying the model. Prompts that guide LLM with a few examples or illustrations are known as *few-shot*. Similarly, *one-shot* uses one example, and *zero-shot* uses none [6].

**Fine-tuning (FT)** is the process of further training a pretrained LLM on task-specific data to improve its performance on these seen tasks [2]. Note that this differs from DAPT.

**Instruct-tuning (IT)** is the process of enhancing the performance of pretrained LLMs by further training them with data for instruction-and-example pairs to handle unseen tasks [7].

**2) Interacting Mechanisms:** Retrieval-augmented generation (RAG) is a mechanism that extracts queries from input prompts and then fetches relevant information from external databases. This information is used by a generative component/agent to produce follow-up responses to the LLM [8]. Besides, the reasoning and action (ReAct) mechanism prompts LLMs to provide both reasoning traces and actions related to a task in an interleaved fashion, thereby stimulating the model's dynamic problem-solving abilities [9].

**3) LMM/LCM:** LMMs integrate multiple data types (e.g., text, images) and various neural network architectures. This enables them to effectively handle diverse data with complex relations often found in real-world problems [10]. Similarly, LCMs integrate multiple hardware representations of chip designs to work on complex chips and electronic systems [11].

### B. Electronic Design Automation

EDA utilizes a range of specialized software tools to realize automated work-flows for chip design. Detailed technology

information as well as reusable design components, so-called intellectual property (IP) modules, are utilized as well. A typical EDA flow incorporates distinct stages as follows.<sup>2</sup>

First, a system specification is translated or compiled by a HLS tool into a hardware description language (HDL), typically at the abstraction of register-transfer level (RTL) instructions. Toward that end, HLS tools initially devise and/or allocate functional units, both possibly with the help of IP modules. Then, HLS tools bind all the system-level task to those units. In other words, the chip's operation is planned out into a behavioural model described in HDL/RTL.

Second, that behavioural HDL/RTL model is translated by a logic-synthesis tool into a gate-level netlist. During this process, technology mapping is applied, *i.e.*, the generic design description is translated into a technology-specific one.

Third, the netlist is translated by a physical-synthesis tool into an actual layout. This process is known to be the most challenging task for chip design, especially for advanced technology nodes with complex geometries and design rules.

### C. Hardware Security

The increase in chip manufacturing costs due to continued advances in technology nodes has led design companies to adopt a globalized supply chain. In such a so-called fabless business model, design companies outsource fabrication, testing, and packaging to third-party facilities. However, outsourcing to potentially untrusted entities has led to different security threats. These threats could be enacted either at different stages of the supply chain or by end-users [14], [15].

**1) Hardware Trojans (HTs):** Attackers located either in the fabrication facility (foundry) or with a third-party design house could insert malicious modifications into the design.

**2) IP Piracy and Chip Overbuilding:** Attackers located in the foundry may illegally pirate the design IP or overproduce chips without the consent of the designer.

**3) Side-Channel Attacks (SCA):** Attackers can extract secret information by exploiting physical information from chips in operation, such as power consumption, timing behavior, etc.

**4) Further Considerations:** There are other inherent vulnerabilities (*i.e.*, flaws in the design) that can result in security breaks. For instance, vulnerabilities in micro-architecture design have raised severe implications [16], [17].

Besides, any hardware vulnerabilities discovered after fabrication usually cannot be fixed/patched. Thus, researchers have proposed different vulnerability detection techniques, such as fuzzing [18], information flow tracking [19], and formal verification [20]. These techniques often utilize assertions to detect hardware vulnerabilities [21].

## III. PROPOSED TAXONOMY

In this section, we discuss selected works on LLMs for various chip-related tasks, categorizing them primarily into



Fig. 1. Overview of selected LLM applications for EDA in general (blue) and for hardware security in particular (purple).

two parts, as shown in Fig. 1: LLMs for EDA (Section III-A) and LLMs for hardware security (Section III-B). We further summarize the key details of the discussed works in Table I.

### A. LLMs for Electronic Design Automation

LLMs are recently explored to support chip design, focusing primarily on HLS and HDL code generation. Still, researchers also started to investigate LLMs for other tasks, like generating architecture specifications [22], HDL bug fixes [23], streamlining the printed circuit board design [24], etc.

Here, we review prior art that applies LLMs for a variety of such tasks, and we also outline related limitations.

**1) Verilog RTL Generation:** The promising code generation ability of LLMs led researchers to investigate the potential for Verilog RTL generation, aiming to relieve the burden of manual design efforts. However, there are challenges due to the limited hardware-related knowledge with existing pretrained LLMs. To address these challenges, researchers proposed various prompting-based and FT-based frameworks.<sup>3</sup>

**1a) Prompting-based Frameworks:** Chip-Chat [33] interacts with conversational generative pretrained transformer (GPT)-4 with prompts, alongside experienced engineers, to collaboratively design an 8-bit accumulator-based microprocessor. This work shows the possibility of going from design specifications to a functional implementation with minimal effort, but it lacks verification and modern security features.

ChipGPT [39] interacts with GPT-3.5 through prompts and finalizes a design considering its functional correctness and Power/Performance/Area (PPA) optimization. This method is

<sup>2</sup>Contemporary flows also include verification procedures as well as feedback loops across each stage; for simplicity, this is not further discussed here. Besides, see also [12] for more background on EDA in general and [13] for a security-centric review of EDA.

<sup>3</sup>Recall that prompting-based frameworks iteratively guide pretrained models with external knowledge to deal with limited training datasets and avoid further training, whereas FT-based frameworks address the challenge directly and build training datasets to fine-tune pretrained LLMs.

TABLE I  
SUMMARY OF LLMs FOR CHIP DESIGN AND HARDWARE SECURITY.  
PT: PRETRAINING, P: PROMPTING, FT: FINE-TUNING, IT: INSTRUCT-TUNING, NA: NOT APPLICABLE.

| Platform               | LLM Task                                      | Base Model                           | Training | FT/IT Datasets                                                              | Expertise Needed                                                  | Access |
|------------------------|-----------------------------------------------|--------------------------------------|----------|-----------------------------------------------------------------------------|-------------------------------------------------------------------|--------|
| ChipNeMo [5]           | Interaction with EDA Tools                    | LLaMA2-70B                           | PT+ FT   | 24.1B samples from Nvidia's internal database                               | For training dataset generation and evaluation                    | Closed |
| ChatEDA [25]           |                                               | LLaMA2-70B                           | IT       | $\approx 1,500$ instructions                                                | For EDA expert knowledge                                          | Closed |
| DAVE [26]              |                                               | GPT-2                                | FT       | > 17K customized task-and-result pairs                                      | For training dataset generation                                   | Closed |
| VeriGen [27]           |                                               | CodeGen-16B                          | FT       | 400MB Verilog from GitHub & Textbooks                                       | For evaluation and hand-designed prompts                          | [28]   |
| VerilogEval [29]       | Verilog RTL Generation                        | VeriGen                              | FT       | 8,502 synthetic problem-and-code pairs                                      | For training dataset generation                                   | [30]   |
| RTLCoder [31]          |                                               | Mistral-7B-v0.1, DeepSeek-Coder-6.7b | FT       | 27K instruction-and-code pairs                                              | For keywords preparation in training dataset                      | [32]   |
| Chip-Chat [33]         |                                               | GPT-4                                | P        | NA                                                                          | For syntax check and verification                                 | [34]   |
| AutoChip [35]          |                                               | GPT-3.5, GPT-4                       | P        | NA                                                                          | To compile files, for simulation                                  | [36]   |
| RTLLM [37]             |                                               | GPT-4                                | P        | NA                                                                          | For syntax checks, and for functionality and quality evaluation   | [38]   |
| ChipGPT [39]           |                                               | GPT-3.5                              | P        | NA                                                                          | For human feedback with manual design correction                  | Closed |
| BetterV [40]           | Design Optimization                           | CodeLlama-7B-Instruct                | IT       | a set of Verilog-and-C pairs and a set of Verilog definition-and-body pairs | For training dataset preparation and generative discriminator     | Closed |
| DeLorenzo et al. [41]  |                                               | VeriGen-2B                           | P        | NA                                                                          | For optim. and manual prompts                                     | Closed |
| GPT4AIGChip [42]       | HDL Code Generation                           | GPT-4                                | P        | NA                                                                          | For demo library construction and final manual design corrections | Closed |
| Wan et al. [43]        |                                               | GPT-4                                | P        | NA                                                                          | For setting bugs and insertion rules                              | [44]   |
| RTLFixer [23]          |                                               | GPT-3.5                              | P        | NA                                                                          | For retrieval database (syntax check)                             | Closed |
| HDLdebugger [45]       | HDL Bugs Detection and Repair                 | CodeLlama-13b                        | FT       | 92,143 distinct Huawei's HDL instances                                      | For dataset generation and doc-and-code RAG                       | Closed |
| LLM4SecHW [46]         |                                               | Falcon-7B, StableLM-7B, LLaMA2-7B    | FT       | 15,000 samples of commit message and pre/post-revision code                 | For hardware version control training dataset                     | [47]   |
| Pearce et al. [48]     |                                               | GPT-2                                | P        | NA                                                                          | For bug local. & fixing suggestions                               | [49]   |
| Nair et al. [50]       |                                               | GPT-3.5                              | P        | NA                                                                          | For manual security-driven prompts                                | Closed |
| Ahmad et al. [51]      | Security Bugs Fixing                          | code-davinci-002                     | P        | NA                                                                          | For bug local. and repair instruction                             | [52]   |
| NSPG [53]              |                                               | BERT                                 | PT+ FT   | 4,427 sentences from hardware documentations                                | For labeled security property dataset                             | Closed |
| DIVAS [54]             | Assertions and Security Properties Generation | GPT-4, BARD                          | P        | NA                                                                          | For filtering related CWEs                                        | Closed |
| AutoSVA2 [55]          |                                               | GPT-4                                | P        | NA                                                                          | For SVA safety rules refinement                                   | [56]   |
| Kande et al. [57]      |                                               | code-davinci-002                     | P        | NA                                                                          | For selecting designs with CWEs                                   | Closed |
| ChiRAAG [58]           |                                               | GPT-4                                | P        | NA                                                                          | For manual error checks                                           | Closed |
| AssertLLM [59]         |                                               | GPT-4                                | P        | NA                                                                          | For SVA-related data extraction                                   | [60]   |
| Kokolakis et al. [61]  | HT Insertion                                  | ChatGPT                              | P        | NA                                                                          | For manual designed prompts                                       | Closed |
| Netlist Whisperer [62] | SCA Counter-measures                          | GTP-3 (1x Ada-based, 1x Curie-based) | FT       | 10K secure-vs-leaking tokens, 3.5k algebraic-normal-form pairs              | For training dataset generation                                   | Closed |
| SCAR [63]              |                                               | Falcon-7B                            | FT       | synthetic vulnerable lines from AES HDL codes                               | For leakage detection, local.                                     | Closed |

tested on simple designs, demonstrating improvements in area and correctness of the generated simple design.

Further enhancing the quality of generated designs, RTLLM [37] covers three objectives: syntax, functionality, and design quality. It evaluates the capability of GPT-3.5 and GPT-4 in generating RTL codes for these objectives and then improves performance by requiring LLMs to provide reasoning steps and syntax error checking in initial prompts.

AutoChip [35] utilizes a feedback loop to integrate with generated Verilog compilation and simulation results for more accurate code generation without human intervention. Although the iterative feedback improves the success rate, it cannot handle larger designs due to prompting limitations.

**1b) Fine-tuning-based Frameworks:** DAVE [26] fine-tunes GPT-2 with a dataset of English-task-and-Verilog-result pairs for deriving Verilog snippets from English.

VeriGen [27] fine-tunes the CodeGen-16B [64] LLM using compiled Verilog datasets sourced from GitHub and textbooks in an unsupervised manner. Its generated Verilog codes still contain functional errors, requiring designers to debug and refine them. For instance, VeriGen on complex RTLLM [37] benchmarks underperforms compared to GPT-3.5 [31].

VerilogEval [29] further fine-tunes VeriGen with synthetic

description-and-code pairs, demonstrating improved performance. It open-sources two synthetic benchmark datasets, called VerilogEval-machine and VerilogEval-human. However, it ignores downstream task performance.

RTLCoder [31] provides various instruction-and-Verilog pairs to fine-tune a LLM based on code quality feedback. It utilizes quantized parameters with minimal loss of accuracy, allowing it to serve as a local assistant for engineers. However, it underperforms compared to GPT-4 on RTLLM [37] and encounters functionality errors on complex designs.

**2) Design Optimization:** LLMs for Verilog design optimization are also being investigated. ChipGPT [39] proposes a post-LLM search to find generated designs with optimal PPA.

BetterV [40] utilizes domain-specific instruct-tuning methods to fine-tune LLMs for Verilog domain knowledge. Its generative discriminator, trained with augmented data from fine-tuned LLMs, serves for optimized Verilog implementations.

DeLorenzo et al. [41] integrate Monte Carlo tree search to guide VeriGen [27] in producing optimized designs for adders, multipliers and multiply-accumulate units of different sizes.

**3) HLS Code Generation:** GPT4AIGChip [42] employs decoupled HLS design templates and demo-augmented prompts for GPT-4 to generate AI accelerator designs. They

then test the accelerator efficiency on six different neural networks. However, the need for demo prompts and the human involvement for verification limits the effectiveness.

Wan et al. [43] use GPT-4 to insert bugs into open-source HLS designs utilizing refined prompts. However, theirs is limited by predefined bugs and response structures.

**4) Interaction with EDA Tools:** ChatEDA [25] applies instruct-tuning to train its model called AutoMage, which then decomposes complex EDA tasks into sub-tasks and generates corresponding EDA scripts for RTL-to-layout flows. While those scripts can be executed directly, their performance hinges on the specific tools setup, and generalization is elusive.

ChipNeMo [5] employs DAPT techniques to improve LLM performance in three specific applications: engineering assistant chatbot, EDA tool script generation, and bug summarization and analysis. It integrates RAG to minimize errors and utilizes domain-specific fine-tuning to enhance accuracy. The resulting DAPT models are shown to surpasses standard LLMs in domain-specific benchmarks and expert assessments.

**5) Detection and Repair of HDL Bugs:** RTLFixer [23] introduces a debugging framework that combines the RAG and ReAct prompting mechanisms, enabling LLMs to function as agents for interactively debugging code with feedback. The iterative code refinement succeeds to fix simple implementation errors but still lacks advanced reasoning and problem-solving skills for complex problems.

HDLDebugger [45] identifies that RTLFixer falls short in meeting industry standards. Thus, HDLDebugger introduces a fine-tuning process to enhance debugging capabilities. It supervises fine-tuning of the LLM through optimized HDL documents and RAG from search engines, along with generated thoughts for buggy code. However, handling complex scenarios is still challenging due to the limited scope of LLMs and the high cost of data gathering.

LLM4SECHW [46] fine-tunes LLMs on a debugging-oriented dataset, compiled from version control data of open-source designs and their remediation steps. This dataset address the scarcity of functional hardware designs and enables fine-tuning of LLMs for detecting and repairing hardware bugs. Its subtle performance indicates the need for an even larger dataset to improve effectiveness.

### B. LLMs for Hardware Security

Here, we discuss selected applications of LLMs for hardware security, including bug fixes, security properties and assertions generation, and handling threats like HTs and SCAs. As before, we outline related limitations as well.

**1) Fixing of Hardware-Security Bugs:** Pearce et al. [48] explore LLMs for repairing vulnerabilities in a zero-shot setting. This work showcases that the considered black-box LLMs can indeed generate fixes without additional training, albeit they require strategically devised prompts. Also, theirs is ineffective for generating plausible fixes related to real-world scenarios and, thus, is not reliable yet for critical fixes.

Nair et al. [50] systematically explore strategies to guide ChatGPT for secure HDL generation under 10 well-known

CWEs [65]. This work first shows the potential security vulnerabilities in the code generated and then proposes techniques for creating robust prompts that generate secure codes.

Ahmad et al. [51] explore the use of LLMs for automatically repairing security bugs. They evaluate repairs based on functionality and security gains. They utilizes domain knowledge of CWE bugs, along with proposed fixing instructions, to generate prompts. The limited use of generative instructions and the lack of a comprehensive functionality-versus-security evaluation, however, impacts the performance of such LLM-based fixes. Besides, this work shows that fixes which span multiple lines or require removing buggy lines are elusive.

**2) Generation of Security Properties and Assertions:** NSPG [53] is an NLP-based security property generator, extracting properties from system-on-chip (SoC) documentation. It pretrains the BERT [66] model with SoC documentations and then fine-tunes the model with customized hardware-domain datasets. Thus, its effectiveness relies on the quality of the documentation and the scope of the datasets.

Kande et al. [57] evaluate LLMs for security assertion generation. They generate prompts using designed templates, to guide LLMs to produce assertions, which are then used in a SystemVerilog simulator. The simulation logs are parsed for assertion violations and analyzed against a set of “golden assertions.” Designer expertise is needed to analyze the security properties and to accurately assess their effectiveness.

DIVAS [54] leverages LLMs to identify relevant CWEs [65] within selected SoC specifications, then generates SystemVerilog assertions (SVAs) using LLMs for verification, all toward automated, policy-based SoC design flows. While identifying CWEs using LLMs is feasible, this works also shows that there are limitations for accuracy and scope.

AutoSVA2 [55] combines formal verification and GPT-4 to generate SVAs directly from RTL, *i.e.*, without relying on specifications. This method successfully detects a previously unknown bug in a RISC-V Ariane SoC. It shows the potential to fine-tune LLMs using curated datasets of SVAs and RTL.

ChiRAAG [58] systematically deconstructs design specifications into standardized texts and then uses LLMs to generate assertions from those. It also generates testbenches to validate the generated assertions, serving as feedback to enhance the LLM’s performance. Still, further exploration is needed to assess both the quality of unstructured specifications and the extent of functional coverage of assertions.

AssertLLM [59] automatically generates assertions. It simplifies the process into three steps: extracting relevant details from the specifications, aligning signal names with their definitions in HDL codes, and generating SVAs. This approach utilizes three specialized LLMs to handle each step separately. Naturally, the quality of the input specification and the effectiveness of the LLMs dictate the quality of the generated SVAs.

**3) HT Insertion:** Kokolakis et al. [61] study the use of LLMs to assist attackers for HT insertion into SoC designs. Their method starts with a filtering process to identify suitable modules for HT insertions. Next, the attacker provides RTL codes of the selected modules into the LLM, which then aids

for implanting the HTs by modifying the modules' codes. The approach for automated insertion with fine-tuned LLMs still requires further exploration to support complex scenarios.

**4) SCA Countermeasures:** Netlist Whisperer [62] utilizes LLMs to identify SCA vulnerabilities in Verilog netlists based on a two-phase, pre-silicon flow. First, a fine-tuned Ada-based GPT-3 model identifies problematic nets that induce power leakage. Second, a fine-tuned Curie-based GPT-3 model generates a SCA-resistant version of the netlist. This approach streamlines and speeds-up the SCA assessment and fixing process, by eliminating the need for power trace collection.

SCAR [63] converts the RTL codes of cryptography accelerators to control-data flow graphs, to identify modules susceptible to SCA leakage, and employs a deep-learning explainer for detection and localization of vulnerabilities. It then utilizes a pretrained LLM to automatically generate and insert additional code into the localized areas.

#### IV. DISCUSSION AND FUTURE DIRECTIONS

Here, we raise important and novel research questions. We also outline possible answers/solutions and future directions.

##### A. Research Questions

###### Research Question 1

Can malicious modifications (HTs) be more easily introduced by integration of LLMs into design flows?

HTs can indeed be introduced through prompting by attackers [61], albeit with manual efforts. Still, leveraging the multi-reasoning capabilities of LLMs, it seems feasible to automate. This complex task could be sub-divided, e.g., locating vulnerable design parts, inserting HTs, verifying their functional correctness, and testing their effectiveness. Prompts can be augmented with related security assessment and/or simulation results toward more successful HT insertion.

###### Research Question 2

Can valuable design IP, from sensitive in-house training data, be leaked through LLMs?

Fine-tuning pretrained LLMs with domain-specific data enables more effective RTL codes [31], [40]. Thus, to achieve high-quality RTL generation, vendors may have to fine-tune these models using valuable IP designs. However, this process could inadvertently expose the actual IP to attackers that seek to extract the training data. In fact, this privacy risk has already been studied by the machine learning community, e.g., [67] showed that an adversary with no prior knowledge can achieve  $\approx 75\%$  accuracy when retrieving sensitive medical data from BERT responses. Privacy preserving mapping [68] can be applied to counter such adversaries; such approach should also be applicable to protect sensitive IP for EDA-centric LLMs.

###### Research Question 3

Can LLMs be trained to seamlessly integrate best practices for hardware security into the design flow, minimizing human error and vulnerabilities against various attacks?

The success of HDL debugging with LLMs [45], facilitated by synthetic supervised fine-tuning, highlights the potential for similar work in enhancing hardware security in general. Designers could create datasets with pairs of attacks and suitable countermeasures. By fine-tuning LLMs with such specialized datasets, the models could become adept at not only recognizing security vulnerabilities but also generating secure designs. Besides, Netlist Whisperer [62] presents an alternative strategy in which LLMs are fine-tuned to pinpoint vulnerabilities and, once identified, another fine-tuned LLM could help to fix those to generate more secure designs.

###### Research Question 4

Can LLMs assist in devising better design-for-trust techniques such as advanced root-of-trust modules or novel logic-locking schemes?

LLMs have shown potential in identifying the security properties of chip designs [53]. This opens up the possibility of fine-tuning LLMs to discover previously unrecognized properties of secure designs. Designers may not fully understand why a particular design is sufficiently secure—the properties identified by LLMs can provide valuable insights, helping designers harden their chips to the fullest extent.

##### B. Future Directions

The current application of LLMs has demonstrated their formidable capabilities for chip design in general as well as for hardware security in particular. However, numerous aspects require further exploration to fully integrate LxMs—not only LLMs—into the chip design domain.

Below, we outline several key research directions:

**1) Representation Alignment:** Recently, LMMs and LCMs offer to integrate diverse formats from various design stages, such as specifications, HLS, RTL, etc., to facilitate a more accurate and coherent foundation for AI-assistance with complex design tasks, like verification, innovative design-space exploration, etc. [10], [11]. We call for more works utilizing LxMs toward such highly promising alignment for chip design in general and hardware security in particular.

**2) Optimization:** Techniques for enhancing LxM performance remain underexplored, particularly in balancing security and functionality. For example, generative design can be approached as a search space primarily concerned with functional correctness. Then, selecting and combining generated modules from such space should also account for security.

**3) Design Automation:** HDL code generation and verification is still underdeveloped. For example, to generate complex, yet correct designs, and to identify possible CWEs, all without designer guidance and expertise, has yet to be achieved.

**4) LxM Security:** Threats to LxMs themselves, including the sensitive nature of datasets and fine-tuned models, requires thorough efforts. For example, using untrusted datasets can lead to unreliable LxMs, resulting in vulnerable designs.

## V. CONCLUSION

Researchers and practitioners are actively exploring various approaches to enhance the capabilities of large language models (LLMs) toward handling chip design. While LLMs have demonstrated remarkable success for certain hardware tasks, particularly in HDL and security assertion generation, other advanced aspects of chip design, like hardware security risks and trust issues, still need to be further explored. Here, we have reviewed prominent prior art, which are also summarized in our LLM4IC hub. We further raised several research questions toward important future directions.

## REFERENCES

- [1] M. Rapp *et al.*, “MLCAD: A Survey of Research in Machine Learning for CAD Keynote Paper,” *IEEE TCAD*, vol. 41, pp. 3162–3181, 2021.
- [2] OpenAI, “GPT-4 Technical Report,” vol. 2, no. 5, 2023.
- [3] H. Touvron *et al.*, “LLAMA 2: Open Foundation and Fine-Tuned Chat Models,” *arXiv:2307.09288*.
- [4] “Synopsys Showcases,” URL: <https://bit.ly/3xZu2Oh>.
- [5] M. Liu *et al.*, “ChipNeMo: Domain-Adapted LLMs for Chip Design,” *arXiv:2311.00176*.
- [6] J. Wei *et al.*, “Finetuned Language Models Are Zero-Shot Learners,” *arXiv:2109.01652*.
- [7] S. Zhang *et al.*, “Instruction tuning for large language models: A survey,” *arXiv:2308.10792*.
- [8] P. Lewis *et al.*, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” *Advances in NeurIPS*, vol. 33, 2020.
- [9] S. Yao *et al.*, “ReAct: Synergizing Reasoning and Acting in Language Models,” *arXiv:2210.03629*.
- [10] J. Wu *et al.*, “Multimodal Large Language Models: A Survey,” in *2023 IEEE BigData*. IEEE, 2023, pp. 2247–2256.
- [11] L. Chen *et al.*, “The Dawn of AI-Native EDA: Promises and Challenges of Large Circuit Models,” *arXiv:2403.07257*.
- [12] A. B. Kahng *et al.*, *VLSI Physical Design: From Graph Partitioning to Timing Closure*. Springer, 2011.
- [13] J. Knechtel *et al.*, “Towards Secure Composition of Integrated Circuits and Electronic Systems: On the Role of EDA,” in *Proc. DATE*, 2020.
- [14] M. Rostami *et al.*, “A Primer on Hardware Security: Models, Methods, and Metrics,” *Proceedings of the IEEE*, vol. 102, pp. 1283–1295, 2014.
- [15] J. Knechtel, “Hardware security for and beyond CMOS technology: An overview on fundamentals, applications, challenges, and prospects,” in *Proc. ISPD*, 2020, pp. 75–86.
- [16] P. Kocher *et al.*, “Spectre Attacks: Exploiting Speculative Execution,” in *IEEE S&P*, 2019, pp. 1–19.
- [17] M. Lipp *et al.*, “Meltdown: Reading Kernel Memory from User Space,” in *USENIX Security*, Aug. 2018, pp. 973–990.
- [18] T. Trippel *et al.*, “Fuzzing Hardware Like Software,” in *USENIX Security Symposium*, Aug. 2022, pp. 3237–3254.
- [19] A. Ardeshircham *et al.*, “Register Transfer Level Information Flow Tracking for Provably Secure Hardware Design,” in *DATE*, 2017.
- [20] C. Kern and M. R. Greenstreet, “Formal Verification In Hardware Design: A Survey,” *ACM TODAES*, vol. 4, no. 2, p. 123–193, apr 1999.
- [21] H. Witharana *et al.*, “A Survey on Assertion-based Hardware Verification,” *ACM Comput. Surv.*, vol. 54, no. 11s, 2022.
- [22] M. Li *et al.*, “SpecLLM: Exploring Generation and Review of VLSI Design Specification with Large Language Model,” *arXiv:2401.13266*.
- [23] Y. Tsai *et al.*, “RTLFixer: Automatically Fixing RTL Syntax Errors with Large Language Models,” *arXiv:2311.16543*.
- [24] B. Han *et al.*, “New Interaction Paradigm for Complex EDA Software Leveraging GPT,” *arXiv:2307.14740*.
- [25] H. Wu *et al.*, “ChatEDA: A Large Language Model Powered Autonomous Agent for EDA,” *IEEE TCAD*, 2024.
- [26] H. Pearce *et al.*, “DAVE: Deriving Automatically Verilog from English,” in *Proceedings of the 2020 ACM/IEEE Workshop on MLCAD*, 2020.
- [27] S. Thakur *et al.*, “VeriGen: A Large Language Model for Verilog Code Generation,” *ACM TODAES*, 2023.
- [28] VeriGen. Available: <https://github.com/shailja-thakur/VGen>
- [29] M. Liu *et al.*, “VerilogEval: Evaluating Large Language Models for Verilog Code Generation,” in *2023 IEEE/ACM ICCAD*. IEEE, 2023.
- [30] VerilogEval. Available: <https://github.com/NVlabs/verilog-eval>
- [31] S. Liu *et al.*, “RTLCoder: Outperforming GPT-3.5 in Design RTL Generation with Our Open-Source Dataset and Lightweight Solution,” *arXiv:2312.08617*.
- [32] RTLCoder. Available: <https://github.com/hkust-zhiyao/RTLCoder>
- [33] J. Blocklove *et al.*, “Chip-Chat: Challenges and Opportunities in Conversational Hardware Design,” in *2023 ACM/IEEE 5th MLCAD Workshop*.
- [34] Chip-Chat. Available: <https://zenodo.org/records/7953725>
- [35] S. Thakur *et al.*, “AutoChip: Automating HDL Generation Using LLM Feedback,” *arXiv:2311.04887*.
- [36] AutoChip. Available: <https://github.com/shailja-thakur/AutoChip>
- [37] Y. Lu *et al.*, “RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model,” in *2024 29th ASP-DAC*.
- [38] RTLLM. Available: <https://github.com/hkust-zhiyao/RTLLM>
- [39] K. Chang *et al.*, “ChipGPT: How far are we from natural language hardware design,” *arXiv:2305.14019*.
- [40] Z. Pei *et al.*, “BetterV: Controlled Verilog Generation with Discriminative Guidance,” *arXiv:2402.03375*.
- [41] M. DeLorenzo *et al.*, “Make Every Move Count: LLM-based High-Quality RTL Code Generation Using MCTS,” *arXiv:2402.03289*.
- [42] Y. Fu *et al.*, “GPT4AIGChip: Towards Next-Generation AI Accelerator Design Automation via Large Language Models,” in *2023 ICCAD*.
- [43] L. J. Wan *et al.*, “Software/hardware co-design for ILM and its application for design verification,” in *2024 29th ASP-DAC*. IEEE, pp. 435–441.
- [44] Chrysalis. Available: <https://github.com/UIUC-ChenLab/Chrysalis-HLS>
- [45] X. Yao *et al.*, “HDLdebugger: Streamlining HDL debugging with Large Language Models,” *arXiv:2403.11671*.
- [46] W. Fu *et al.*, “LLM4SecHW: Leveraging domain-specific large language model for hardware debugging,” in *2023 AsianHOST*. IEEE, 2023.
- [47] LLM4SecHW. Available: <https://huggingface.co/datasets/KSU-HW-SEC/LLM4SecHW-OSHD>
- [48] H. Pearce *et al.*, “Examining Zero-Shot Vulnerability Repair with Large Language Models,” in *2023 IEEE S&P*. IEEE, 2023, pp. 2339–2356.
- [49] ———. Available: <https://drive.google.com/drive/folders/1xJz2Wvvg7JSaxfTQdxayXFEmoF3y0ET>
- [50] M. Nair *et al.*, “Generating Secure Hardware using ChatGPT Resistant to CWEs,” *Cryptology ePrint Archive*, 2023.
- [51] B. Ahmad *et al.*, “On Hardware Security Bug Code Fixes By Prompting Large Language Models,” *IEEE TIFS*, 2024.
- [52] ———. Available: <https://zenodo.org/records/10416865>
- [53] X. Meng *et al.*, “Unlocking Hardware Security Assurance: The Potential of LLMs,” *arXiv:2308.11042*.
- [54] S. Paria *et al.*, “DIVAS: An LLM-based End-to-End Framework for SoC Security Analysis and Policy-based Protection,” *arXiv:2308.06932*.
- [55] M. Orenes-Vera *et al.*, “From RTL to SVA: LLM-assisted generation of Formal Verification Testbenches,” *arXiv:2309.09437*.
- [56] AutoSVA2. Available: <https://github.com/gpt4rtl/AutoSVA2>
- [57] R. Kande *et al.*, “(Security) Assertions by Large Language Models,” *IEEE TIFS*, 2024.
- [58] B. Mali *et al.*, “ChIRAG: ChatGPT Informed Rapid and Automated Assertion Generation,” *arXiv:2402.00093*.
- [59] W. Fang *et al.*, “AssertLLM: Generating and Evaluating Hardware Verification Assertions from Design Specifications via Multi-LLMs,” *arXiv:2402.00386*.
- [60] AssertLLM. Available: <https://github.com/hkust-zhiyao/AssertLLM>
- [61] G. Kokolakis *et al.*, “Harnessing the power of general-purpose ILMs in hardware trojan design,” in *Proc. 5th AIHWS | ACNS Workshop*, 2024.
- [62] M. Nair *et al.*, “Netlist Whisperer: AI and NLP Fight Circuit Leakage!” in *Proceedings of the 2023 Workshop on ASHES*, 2023, pp. 83–92.
- [63] A. Srivastava *et al.*, “SCAR: Power Side-Channel Analysis at RTL Level,” *IEEE TVLSI*, 2024.
- [64] E. Nijkamp *et al.*, “CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis,” *arXiv:2203.13474*.
- [65] The MITRE Corporation, “CWE-1194: CWE VIEW: Hardware Design,” <https://cwe.mitre.org/data/definitions/1194.html>, 2021.
- [66] J. Devlin *et al.*, “Bert: Pre-training of deep bidirectional transformers for language understanding,” *arXiv:1810.04805*.
- [67] X. Pan *et al.*, “Privacy risks of general-purpose language models,” in *2020 IEEE S&P*. IEEE, 2020, pp. 1314–1331.
- [68] S. Salamatian *et al.*, “Managing your private and public data: Bringing down inference attacks against your privacy,” *IEEE J-STSP*, vol. 9, 2015.