Radar / AI & ML

Quality Assurance, Errors, and AI

Some Thoughts about Software Quality

By Mike Loukides

April 9, 2024

Rhombus and hex weave. (source: Joel Cooper on Flickr)

A recent article in Fast Company makes the claim “Thanks to AI, the Coder is no longer King. All Hail the QA Engineer.” It’s worth reading, and its argument is probably correct. Generative AI will be used to create more and more software; AI makes mistakes and it’s difficult to foresee a future in which it doesn’t; therefore, if we want software that works, Quality Assurance teams will rise in importance. “Hail the QA Engineer” may be clickbait, but it isn’t controversial to say that testing and debugging will rise in importance. Even if generative AI becomes much more reliable, the problem of finding the “last bug” will never go away.

However, the rise of QA raises a number of questions. First, one of the cornerstones of QA is testing. Generative AI can generate tests, of course—at least it can generate unit tests, which are fairly simple. Integration tests (tests of multiple modules) and acceptance tests (tests of entire systems) are more difficult. Even with unit tests, though, we run into the basic problem of AI: it can generate a test suite, but that test suite can have its own errors. What does “testing” mean when the test suite itself may have bugs? Testing is difficult because good testing goes beyond simply verifying specific behaviors.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

The problem grows with the complexity of the test. Finding bugs that arise when integrating multiple modules is more difficult and becomes even more difficult when you’re testing the entire application. The AI might need to use Selenium or some other test framework to simulate clicking on the user interface. It would need to anticipate how users might become confused, as well as how users might abuse (unintentionally or intentionally) the application.

Another difficulty with testing is that bugs aren’t just minor slips and oversights. The most important bugs result from misunderstandings: misunderstanding a specification or correctly implementing a specification that doesn’t reflect what the customer needs. Can an AI generate tests for these situations? An AI might be able to read and interpret a specification (particularly if the specification was written in a machine-readable format—though that would be another form of programming). But it isn’t clear how an AI could ever evaluate the relationship between a specification and the original intention: what does the customer really want? What is the software really supposed to do?

Security is yet another issue: is an AI system able to red-team an application? I’ll grant that AI should be able to do an excellent job of fuzzing, and we’ve seen game playing AI discover “cheats.” Still, the more complex the test, the more difficult it is to know whether you’re debugging the test or the software under test. We quickly run into an extension of Kernighan’s Law: debugging is twice as hard as writing code. So if you write code that’s at the limits of your understanding, you’re not smart enough to debug it. What does this mean for code that you haven’t written? Humans have to test and debug code that they didn’t write all the time; that’s called “maintaining legacy code.” But that doesn’t make it easy or (for that matter) enjoyable.

Programming culture is another problem. At the first two companies I worked at, QA and testing were definitely not high-prestige jobs. Being assigned to QA was, if anything, a demotion, usually reserved for a good programmer who couldn’t work well with the rest of the team. Has the culture changed since then? Cultures change very slowly; I doubt it. Unit testing has become a widespread practice. However, it’s easy to write a test suite that give good coverage on paper, but that actually tests very little. As software developers realize the value of unit testing, they begin to write better, more comprehensive test suites. But what about AI? Will AI yield to the “temptation” to write low-value tests?

Perhaps the biggest problem, though, is that prioritizing QA doesn’t solve the problem that has plagued computing from the beginning: programmers who never understand the problem they’re being asked to solve well enough. Answering a Quora question that has nothing to do with AI, Alan Mellor wrote:

We all start programming thinking about mastering a language, maybe using a design pattern only clever people know.
Then our first real work shows us a whole new vista.
The language is the easy bit. The problem domain is hard.
I’ve programmed industrial controllers. I can now talk about factories, and PID control, and PLCs and acceleration of fragile goods.
I worked in PC games. I can talk about rigid body dynamics, matrix normalization, quaternions. A bit.
I worked in marketing automation. I can talk about sales funnels, double opt in, transactional emails, drip feeds.
I worked in mobile games. I can talk about level design. Of one way systems to force player flow. Of stepped reward systems.
Do you see that we have to learn about the business we code for?
Code is literally nothing. Language nothing. Tech stack nothing. Nobody gives a monkeys [sic], we can all do that.
To write a real app, you have to understand why it will succeed. What problem it solves. How it relates to the real world. Understand the domain, in other words.

Exactly. This is an excellent description of what programming is really about. Elsewhere, I’ve written that AI might make a programmer 50% more productive, though this figure is probably optimistic. But programmers only spend about 20% of their time coding. Getting 50% of 20% of your time back is important, but it’s not revolutionary. To make it revolutionary, we will have to do something better than spending more time writing test suites. That’s where Mellor’s insight into the nature of software so crucial. Cranking out lines of code isn’t what makes software good; that’s the easy part. Nor is cranking out test suites, and if generative AI can help write tests without compromising the quality of the testing, that would be a huge step forward. (I’m skeptical, at least for the present.) The important part of software development is understanding the problem you’re trying to solve. Grinding out test suites in a QA group doesn’t help much if the software you’re testing doesn’t solve the right problem.

Software developers will need to devote more time to testing and QA. That’s a given. But if all we get out of AI is the ability to do what we can already do, we’re playing a losing game. The only way to win is to do a better job of understanding the problems we need to solve.

Post topics: AI & ML, Artificial Intelligence

Post tags: Signals

Quality Assurance, Errors, and AI

Learn faster. Dig deeper. See farther.

Get the O’Reilly Radar Trends to Watch newsletter

Thank you for subscribing.