AI code assistants need security training

Illustration: Si Weon Kim

Multiple studies have found that generative neural networks that produce code also reproduce security vulnerabilities in their datasets.

Generative neural networks, such as ChatGPT, promise to change many facets of how humans interact with machines, with one of the most significant being how developers create applications.

Using natural language, pseudo-code comments and function definitions, developers can quickly use the technology to auto-complete the body of a function or other code snippet. GitHub’s Copilot, for example, can easily create the login code for a PHP application just by using a comment — such as “// login page” — or create a simple Python Flask application with comments — such as “# show all posts” or “# say hello to user,” Kadir Arslan, a security engineer with Invicti Security, said in an analysis of Copilot-generated code.

Yet, because the artificial intelligence (AI) assistant is trained using billions of lines of code stored on GitHub by developers, its suggestions come with common mistakes as well. Arslan found five vulnerabilities in the simple PHP app and seven security issues in the Python app, including SQL injection flaws and cross-site scripting (XSS) vulnerabilities.

“GitHub Copilot is a very clever and convenient tool for reducing developer workload — it can provide you with boilerplate code for typical tasks in seconds,” Arslan said in the analysis. “In terms of security, however, you have to be very careful and treat Copilot suggestions only as a starting point.”

Machine-learning models learn from their training datasets, which also means they have all the mistakes and biases hidden in those examples. Copilot, for example, is a variant of GPT-3 based on the OpenAI Codex engine, a large language model (LLM) trained on a massive dataset of natural language and billions of lines of code. Unfortunately, that code is generated by humans — many of whom have historically not been trained to focus on software security — in development teams which are incentivized to create new features more efficiently and not necessarily create secure code.

While Copilot has garnered a great deal of attention, GitHub is far from the only one to offer machine-learning-based technology for developers: Amazon released CodeWhisperer in June 2022, startup Tabnine released updated AI models the same month, and Google has developed its Pathways Language Model (PaLM) capable of code completion as well as AlphaCode, which generates code to solve programming problems.

Making developers more efficient

Development involves numerous repetitive tasks, each of which requires specific knowledge of a programming language, its syntax and, in many cases, knowledge of a particular framework, such as the Python-based Flask, JavaScript-based React and Java-based Spring frameworks. These issues can both slow developers down and make it harder for them to be creative, Ryan J. Salva, vice president of product management at GitHub, told README.

DeepMind / Unsplash

“We do not expect GitHub Copilot to replace developers,” Salva said. “Rather, we expect GitHub Copilot to partner with developers, augment their capabilities and enable them to be more productive, reduce manual tasks and help them focus on interesting work.”

Developers do appear to gain productivity when using AI assistants, with assisted developers submitting more lines of code, according to research conducted by a team from New York University. Rather than the common exercise of searching online for answers and then incorporating that syntax into their code, developers can easily use an AI assistant to generate significant blocks of code.

More specifically, developers who accept more of the suggestions made by an AI code generator tend to be more productive, even though the average developer accepts about one out of every five suggestions. The median developer using GitHub Copilot, for example, uses 23% of its code suggestions during non-working hours and 21% during working hours, according to an analysis of 2,000 anonymized GitHub Copilot users.

Historical mistakes become future flaws

The problem with those suggestions is that developers who blindly accept code from AI assistants will often increase the number of security vulnerabilities in their code. Researchers from Stanford University found that developers using AI-generated code tended to write significantly less secure code and trust that their code was secure, compared to developers who did not use an AI assistant.

“[G]iving an AI assistant too much agency (e.g. automating parameter selection) may encourage users to be less diligent in guarding against security vulnerabilities,” the researchers said in their paper. “Furthermore, AI assistants have the potential to decrease user pro-activeness to carefully search for API and safe implement details in library documentation directly, which can be concerning given that several of the security vulnerabilities we saw involved improper library selection or usage.”

The over-reliance on AI systems, especially by non-experts, is a general problem for all systems based on machine learning, Berryville Institute of Machine Learning founder and CEO Gary McGraw told README. When this problem affects software development, then vulnerabilities will go unchecked.

“Any naive application of machine learning is going to help attackers, because people are going to naively assume that their stuff doesn’t carry risk,” he says. “It is the exact same thing as people using software without thinking about the security of that software.”

Not a replacement for secure design

AI-assisted developers who pay attention to security as well can benefit from the increase in productivity without suffering from more bugs, the research team from New York University stated in a paper, which found that AI-assisted developers produced critical security bugs — the highest severity — only slightly more often than those without AI assistance.

“This suggests that security concerns with LLM assistants might not be as severe as initially suggested, although studies with larger sample sizes and diverse user groups are warranted,” the team stated.

Moreover, these problems will likely submit to further research effort and time, as the level of investment and innovation continues to grow. Google, for example, has improved its generalized coding engine, AlphaCode, to construct computer programs to solve problems expressed in native languages. AlphaCode currently performs about as well as the average programmer — as part of a coding competition, it ranked in the top 54% of participants. While that might not sound impressive, Go-playing computers required significant handicaps to beat teenage players in 2000, and 17 years later, Google’s AlphaGo went on to beat the top professional players.

GitHub’s Salva argues that it will only take time to further improve code-generating AI to get to the point where assisted developers will produce much more secure code much faster than non-assisted developers.

“AI assistants like GitHub Copilot will, in a few short years, produce code that is more secure, performant, [and] accessible,” he says. “In fact, we’re already running experiments in production where we apply security analysis on code to discourage developers from implementing common vulnerabilities.”

To get there, however, will require not just advances in the design of the machine-learning models that power AI systems, but also better efforts to exclude insecure coding examples from the data sets used to train AI coding assistants. Finally, low-probability edge cases continue to cause problems, especially if examples of those situations never make it into the training data, says BIML’s McGraw.

He points to self-driving cars as an example: The AI may not know that a ball bouncing out into the street is often followed by a child chasing the ball and so will not slow down the vehicle. Similarly, attackers can find edge cases in software — such as an input that an application designer never expected — and exploit them. Because AI assistants have evolved to the point where some people treat them as human, those developers may expect them to react in human ways — a tendency to anthropomorphize machines known as the Eliza effect.

“It is those sorts of things where the Eliza effect of — it mostly seems to work, and wow it’s so cool, and so it must be doing it the way we are doing it — is absolutely misleading,” he says. “Many of the things that we talk about as risks — and even as attacks — are pretty rare, but that doesn’t mean that they are not going to happen.”