GitHub’s Commercial AI Tool Was Built From Open Source Code
Earlier this month, Armin Ronacher, a prominent open-source developer, was experimenting with a new code-generating tool from GitHub called Copilot when it began to produce a curiously familiar stretch of code. The lines, drawn from the source code of the 1999 video game Quake III, are infamous among programmers—a combo of little tricks that add up to some pretty basic math, imprecisely. The original Quake coders knew they were hacking. “What the fuck,” one commented in the code beside an especially egregious shortcut.
So it was strange for Ronacher to see such code generated by Copilot, an artificial intelligence tool that is marketed to generate code that is both novel and efficient. The AI was plagiarizing—copying the hack (including the profane comment) verbatim. Worse yet, the code it had chosen to copy was under copyright protection. Ronacher posted a screenshot to Twitter, where it was entered as evidence in a roiling trial-by-social-media over whether Copilot is exploiting programmers’ labor.
Copilot, which GitHub calls “your AI pair programmer,” is the result of a collaboration with OpenAI, the formerly nonprofit research lab known for powerful language-generating AI models such as GPT-3. At its heart is a neural network that is trained using massive volumes of data. Instead of text, though, Copilot’s source material is code: millions of lines uploaded by the 65 million users of GitHub, the world’s largest platform for developers to collaborate and share their work. The aim is for Copilot to learn enough about the patterns in that code that it can do some hacking itself. It can take the incomplete code of a human partner and finish the job. For the most part, it appears successful at doing so. GitHub, which was purchased by Microsoft in 2018, plans to sell access to the tool to developers.