Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection

  • Zeping Yu Tencent Security Keen Lab
  • Rui Cao Tencent Security Keen Lab
  • Qiyi Tang Tencent Security Keen Lab
  • Sen Nie Tencent Security Keen Lab
  • Junzhou Huang Tencent AI Lab
  • Shi Wu Tencent Security Keen Lab

Abstract

Binary code similarity detection, whose goal is to detect similar binary functions without having access to the source code, is an essential task in computer security. Traditional methods usually use graph matching algorithms, which are slow and inaccurate. Recently, neural network-based approaches have made great achievements. A binary function is first represented as an control-flow graph (CFG) with manually selected block features, and then graph neural network (GNN) is adopted to compute the graph embedding. While these methods are effective and efficient, they could not capture enough semantic information of the binary code. In this paper we propose semantic-aware neural networks to extract the semantic information of the binary code. Specially, we use BERT to pre-train the binary code on one token-level task, one block-level task, and two graph-level tasks. Moreover, we find that the order of the CFG's nodes is important for graph similarity detection, so we adopt convolutional neural network (CNN) on adjacency matrices to extract the order information. We conduct experiments on two tasks with four datasets. The results demonstrate that our method outperforms the state-of-art models.

Published
2020-04-03
Section
AAAI Technical Track: Applications