Software could expose the identity of hackers

Researchers working with ARL use "stlyometry" to sniff out the stylistic fingerprints of code writers.

January 15, 2016

Scholars have for years used the science of stylometry—studying the individual stylistic traits of a writer—to try to identify the authors of documents such as the Federalist Papers (which parts were written by James Madison, Alexander Hamilton or John Jay), whether Homer wrote all of the Illiad and Odyssey, or if other authors contributed to Shakespeare’s sonnets and plays.

Now an academic team working with the Army Research Laboratory has applied the same type of technique to identifying authors of malicious computer code, as a way to potentially track down hackers.

The team, including researchers from Princeton University, Drexel University and the University of Gottingen in Germany, used machine learning algorithms to parse computer code and identify its author. The results, presented by Princeton post-doctoral candidate Aylin Caliskan-Islam at the 32nd Chaos Computer Conference, was a 94 percent accuracy when examining sample from 1,600 programmers, according to an ARL release. And when researchers could narrow down the field to the five most likely suspects, they were on the mark practically every time.

This was a lab experiment with samples of code from known authors, and the researchers say they need to expand their work to a more real-world environment, dealing with ways that malware writers can try to mask their code. (Though researchers also have attributed authorship on “real-world” code in single-author GitHub repositories, ARL said.) Eventually, an automated tool kit that can help identify malware authors could go a long way toward solving a big problem with responding to a cyberattack—attributing where it came from. Because online attacks can be routed around the world, an attack that appears to come from, say, China, might not be from China.

Security experts have used a stylometry-like approach to identify the source of cyberattacks, for example citing similarities in code from previous attacks to attribute the Sony hack to North Korea. But it’s a time-consuming process. Having a software tool that does a lot of the heavy lifting could speed things up.

"Attribution is a real challenge, as it is done manually by experts who have to reconcile forensics following an attack," said Richard Harang, an ARL network security researcher and technical lead for the research. "Currently, human analysis is the common tool. It works, but it can be slow and take a lot of resources. We are developing a toolkit to make it a lot faster and cheaper to support analysts in identifying bad actors."

In writing, stylometry analyzes word choice, sentence structures, syntax, spelling and punctuation to identify a writer’s stylistic “fingerprint.” As the New York Times has pointed out, Madison’s tendency to use “whilst” while Hamilton preferred “while” helped identify their roles in writing the Federalist Papers.

And while applying those principles to code-writing could help identify hackers, stylometry also has other potential repercussions, such as identifying whistle-blowers or human rights activists. In fact, when Drexel researchers released an early version of their stylometry tool in 2012, they also released another tool called Anonymouth, which authors can use to conceal their stylistic identity.

For now, researchers are just looking to continue to improve their tools. "This basic research shows that identifying authors of computer programs based on coding style is possible and worth pursuing," Harang said. "This is collaborative research that builds upon a lot of good work before us."

NEXT STORY: Air Force wants sense and avoid technology for large drones