Our Study

The categorization we have presented here came out of our experiences analyzing code clones that we found in several open source software systems, including the Linux kernel, the Apache httpd web server, the PostgreSQL relational database system, the Gnumeric spreadsheet application, the Columba email client, nine text editors, including vim and emacs, and eight X11 window managers. Since we had spotted each of these patterns multiple times across several systems, we were pretty confident that the patterns were “real” and not peculiar to a particular system or application domain. And although we were also pretty sure that cloning was often used as a principled practice, we lacked any concrete quantitative evidence. So we set out to examine two large open source systems from different domains, and tried to measure just how commonly cloning is used as a principled design tool, at least in those systems.

To crawl and categorize these large systems, we used our own clone detection tool, called CLICS (CLoning Interpretation and Categorization System). CLICS tokenizes the source code input and then employs suffix arrays to perform parameterized string matching; this is similar to the approach of other tools such as CCfinder. In order to detect clones where variable names might have been changed, this technique maps all identifiers to a single proxy token; that is, all identifiers will match each other, and the remaining tokens in the input stream—keywords, operators, and separators—play ...

Get Making Software now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.