Chapter 1. AI for Code: Ready for the Masses?

Just as integrated development environments (IDEs) revolutionized software development in the 1990s, AI-driven code generation and refactoring tools promise a step-function improvement to the way developers write, maintain, and optimize code today. Practically every software engineering organization right now is evaluating and implementing AI solutions with various goals in mind, such as to help their teams deliver more features faster, fill skill gaps, gain efficiencies, improve code quality, reduce technical debt, and save costs.

The shining star of AI for code thus far has been generative AI (GenAI) assistants or chatbots. These assistants, readily accessible on the market, are making it easier than ever for developers to write code. We’re already seeing the huge impact of tools such as GitHub Copilot, reporting faster developer task completion—55% faster—as well as code quality improvements. What an exciting time for developers!

Meanwhile, at Moderne, we have been focused on helping enterprise companies manage their large and growing codebases—including their proprietary source code along with thousands of open source software components. Our platform enables accurate, automated code refactoring and analysis across these massive codebases using deterministic recipes and a novel dataset for code. We have also been excited by the surge of AI models and techniques available to us now and have readily jumped in to discover what value they can bring to our platform and our customers working at scale.

During our AI explorations, we have faced some distinct challenges using GenAI alone to maintain and secure existing codebases. Every single edit with AI is based on immediate context and is suggestive, requiring a “human in the loop” to review. This system obviously doesn’t scale if you are refactoring and analyzing across multiple code repositories at once. To refactor at scale, you need accuracy and trust in your system. We see a number of ways to leverage the value of AI while improving accuracy for mass-scale code change.

Tip

AI is a term that covers algorithms that mimic intelligence. Some earlier AI, such as A.L.I.C.E, were able to mimic, to some extent, a conversation by applying heuristic pattern matching rules to a human’s input. Machine learning (ML) is a subtype of AI that uses past data to identify patterns to make predictions or decisions, and deep learning is a subtype of ML, which is what we mostly refer to when we refer to AI and large language models (LLMs) today.

We’ll start this report by sharing the impact of AI assistants on software development. Then, we’ll dive into the heart of the matter: maintaining and securing massive codebases—and how AI can be optimized for mass-scale code refactoring and analysis, based on Moderne’s own experience.

The Rise of AI Assistants in Software Development

The last wave of machine learning gave you infinite interns who could read anything for you, but you had to check, and now we have infinite interns that can write anything for you, but you have to check.

Benedict Evans1

GenAI is a probabilistic system incentivized to do what its name denotes: generate. By leveraging ML models trained on vast datasets, these coding assistants can predict and suggest text and code snippets. They can also supply data or context to the model at the time of query or generation as developers code.

AI assistants—those infinite interns—are changing the way developers work. Instead of writing everything by hand assisted by a rules-based IDE that is 100% correct, developers are interacting with their AI assistants, evaluating suggestions that they then accept or reject. The rules-based system may still help in writing a majority of code, but AI suggestions offer another level of developer productivity improvement.

At Moderne, when our developers were just starting to use Copilot, they were often amazed that it seemed to know about some novel code they were working on. Surely it was not trained on this yet! What developers quickly learn is that Copilot doesn’t just use the data it was trained on to provide suggestions; it also leverages content from the developer’s IDE such as the file opened, additional files opened, and recent files.

As developers learn how to harness the power of Copilot, they learn how to guide it by staging additional data, opening similar files containing the type of code they are trying to write. This suggests to Copilot what to include in the prompt to make the large language model (LLM) output more relevant. This works really well for code generation.

The Challenges of Using AI to Refactor and Analyze Code at Scale

We all may collectively be exiting the “hype” stage of technology adoption when it comes to GenAI coding assistants and getting to the practical realities of using these tools. As we have come to experience, AI suggestions can seem prescient, intuitive, and extremely helpful, but they can also be wrong, unnecessary, buggy, and prone to hallucinations.

The downside of AI inaccuracies can be seen in the rise of churn rates for code. Churn is when incomplete or error-prone code is committed or pushed to the source code repository and then quickly reverted, removed, or updated. This is like two steps forward and one step back for development teams.

The proclivity of GenAI to “generate” is also leading to less reuse and refactoring of the code, which can exacerbate the problems of code maintenance. One study shows a 17% decline year over year of moved code since the adoption of GitHub Copilot. Experienced developers readily see the value in reusing code because it’s already battle tested in production and likely touched by other developers. But GenAI models such as Copilot like to re-create rather than reuse.

At Moderne, our developers use Copilot regularly and have experienced its shortcomings. One developer using Copilot implemented a feature that required supplying configuration properties for deployment. One of those properties was JAVA_HOME. Copilot, which uses OpenAI LLMs trained on a lot of data (including GitHub open source software (OSS) and Stack Overflow community content), suggested JAVE_HOME instead of JAVA_HOME for the configuration—a typo in its training data. We were able to track this down to the exact Stack Overflow post after the fact.

Our developer accepted the suggestion, not noticing this one error among many correct configuration properties suggested by Copilot. Deploying this feature in production, however, caused an outage, as the Java version was not found. After some panic and intense scrutiny by the team, we discovered the misspelling of JAVA_HOME in the merged pull request (PR). This illustrates that even the most senior developers can be susceptible to AI errors and hallucinations—things that in quick review may look correct but aren’t.

As engineers working with AI, we often find ourselves fine-tuning our queries to achieve the precise results we need. For instance, adding explicit instructions such as “do not alter any other cases” can improve the AI’s performance. Despite this, even a small margin of error—say, one incorrect output out of ten—can be problematic when scaling up. Imagine a scenario where your codebase contains 1,000 switch cases. A 1% error rate means 10 cases might have altered the logic behind the switch case, posing significant risks. In addition, asking the model multiple times does not improve this error rate, as is sometimes claimed. This mistakes the relationship between reliability and validity. A model may reliably answer inaccurately (or unreliably answer accurately).

Knowing What We Now Know, How Can We Use AI for Mass-Scale Code Refactoring?

Mass-scale code refactoring can involve affecting changes across many individual cursor positions in thousands of repositories, representing millions to billions of lines of code. It’s a multipoint operation that requires coordination, accuracy, consistency, and broad code visibility. With current AI assistants, you are working file by file augmenting them with local, limited context. You can end up with incomplete, error-ridden, and very developer-time-intensive code maintenance.

Incorporating techniques such as retrieval-augmented generation (RAG), which we’ll discuss in more depth in Chapter 2, “Techniques for Scaling AI for Large Codebases, ensures models remain efficient and effective without the overhead of larger systems. Additionally, employing a strategy that uses multiple models from least to most computationally demanding, along with OpenRewrite recipes or different libraries for various rules-based subtasks, can further optimize performance and capabilities.

Ultimately, for mass-scale changes, you may find that it is more time- and cost-efficient to take a rules-based recipe that’s already been validated to fix the issue—then run this recipe automatically across the repos to fix the code. You can leverage GenAI to help develop the recipe (because recipes are tested before they are put to use). You can also use AI to help you identify applicable recipes to fix the problem or even to search the codebase for more detailed analysis that yields recipe recommendations.

In this report, we’ll cover what mass-scale automated code refactoring with AI looks like in practice. You’ll get details of:

  • AI techniques and LLMs to employ

  • Use cases for AI when working with large codebases

  • Considerations of large enterprises when adopting AI for mass-scale refactoring

1 Benedict Evans, “AI and the Automation of Work,” ben-evans.com (website), July 2, 2023.

Get AI for Mass-Scale Code Refactoring and Analysis now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.