capa is the Mandiant FLARE team’s open source tool that is used to automatically identify capabilities of programs. Reverse engineers and malware analysts run capa against suspected malware in order to uncover its underlying functionality by matching extracted features to a well-defined collection of rules. This allows analysts to quickly narrow their scope down to areas of interest within a sample, taking advantage of significant speed gains provided by years of cumulative research.
Since its conception, capa has received industry-wide adoption via platform integrations and by supporting several popular backends spawned from other open-source & proprietary infosec projects. These projects include: VirusTotal, HexRay’s IDA Pro, vivisect, dnfile, and Binary Ninja. My goal this summer was to further expand capa adoption by integrating a popular, open source reverse engineering framework, Ghidra, as a backend. Adding capa support for a framework like Ghidra expands capa analysis to users who wish to tightly integrate results with their disassembly framework of choice.
Deliverables and Status
All planned deliverables for the Google Summer of Code (GSoC) period have been completed and integrated. The main functionality, the feature extractors, serve as the core of the Ghidra backend, allowing us to tap into rich databases populated by Ghidra analysis. In order to implement the Ghidra feature extractors, capa is designed to use an abstract FeatureExtractor
class for each backend. In my case, I implemented a new GhidraFeatureExtractor
class to initialize capa’s access to the Ghidra databases.
After defining the class, I created a set of Ghidra feature extractor modules, namely global_.py
, file.py
, function.py
, basicblock.py
, and insn.py
. Each feature extractor module is scoped to increase granularity. The scopes are as follows:
- Global Scope
- Operating system detection
- Architecture detection
- File Scope
- File format & sections
- File imports & exports
- File strings
- Statically-linked library functions
- Embedded portable executable (PE) file detection
- Function Scope
- Function call references
- Loops within functions
- Recursive function calls
- Basic Block Scope
- Tight loops
- Embedded stack strings
- Instruction Scope
- Instruction mnemonics
- Number and offset operands
- Operating system level API calls
- Embedded bytes and strings
- Cross-section control flow & file segment access
- Indirect & referenced calls
- Bitwise non-zeroing XOR features
- Windows Process Environment Block (PEB) accesses
- Obfuscated instruction call exploitation
The final two deliverables of the project were designed to help with the continuity of developing and integrating the Ghidra backend. capa uses Github Actions runners to run a Continuous Integration (CI) workflow that executes automated unit tests for each backend. For Ghidra feature extraction, it is tested on two concurrently running Ubuntu servers that use Python 3.8 and 3.11. Each server is automatically configured with capa and its associated dependencies, the latest Ghidra release, and the latest Ghidrathon release. After configuration, the runner executes an automated unit test via pytest. The automated unit test runs capa with the Ghidra feature extractor against a sample to assert that specific features from each associated scope are found.
Github Contributions
All tracking of this project may be accessed via: capa-ghidra pull requests. The main pull request used to track GSoC-associated changes may be accessed here.
Mini Demo
wumbo@flarenix:~/capa$ analyzeHeadless ~/Desktop/ghidra_projects/ capa_test -process 'Practical Malware Analysis Lab 16-01.exe_' -ScriptPath ./capa/ghidra/ -PostScript capa_ghidra.py "./rules"
[...]
(AutoAnalysisManager)
INFO REPORT: Analysis succeeded for file: /Practical Malware Analysis Lab 16-01.exe_ (HeadlessAnalyzer)
INFO SCRIPT: /home/wumbo/capa/./capa/ghidra/capa_ghidra.py (HeadlessAnalyzer)
┍━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ md5 │ 7faafc7e4a5c736ebfee6abbbc812d80 │
│ sha1 │ │
│ sha256 │ 309217d8088871e09a7a03ee68ee46f60583a73945006f95021ec85fc1ec959e │
│ os │ windows │
│ format │ Portable Executable (PE) │
│ arch │ x86 │
│ path │ /home/wumbo/capa/./tests/data/Practical Malware Analysis Lab 16-01.exe_ │
┕━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
┍━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ ATT&CK Tactic │ ATT&CK Technique │
┝━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ DEFENSE EVASION │ Indicator Removal::File Deletion T1070.004 │
│ │ Indicator Removal::Timestomp T1070.006 │
│ │ Modify Registry T1112 │
├────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤
│ DISCOVERY │ File and Directory Discovery T1083 │
│ │ Query Registry T1012 │
│ │ System Information Discovery T1082 │
├────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤
│ EXECUTION │ Command and Scripting Interpreter T1059 │
│ │ Shared Modules T1129 │
│ │ System Services::Service Execution T1569.002 │
├────────────────────────┼────────────────────────────────────────────────────────────────────────────────────┤
│ PERSISTENCE │ Create or Modify System Process::Windows Service T1543.003 │
┕━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ MBC Objective │ MBC Behavior │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ ANTI-BEHAVIORAL ANALYSIS │ Debugger Detection::Process Environment Block BeingDebugged [B0001.035] │
│ │ Debugger Detection::Process Environment Block NtGlobalFlag [B0001.036] │
├─────────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ COMMUNICATION │ Interprocess Communication::Create Pipe [C0003.001] │
├─────────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ DEFENSE EVASION │ Self Deletion::COMSPEC Environment Variable [F0007.001] │
├─────────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ DISCOVERY │ File and Directory Discovery [E1083] │
│ │ System Information Discovery [E1082] │
├─────────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ EXECUTION │ Command and Scripting Interpreter [E1059] │
├─────────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ FILE SYSTEM │ Copy File [C0045] │
│ │ Delete File [C0047] │
│ │ Get File Attributes [C0049] │
│ │ Read File [C0051] │
│ │ Writes File [C0052] │
├─────────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ OPERATING SYSTEM │ Environment Variable::Set Variable [C0034.001] │
│ │ Registry::Delete Registry Value [C0036.007] │
│ │ Registry::Query Registry Value [C0036.006] │
│ │ Registry::Set Registry Key [C0036.001] │
├─────────────────────────────┼───────────────────────────────────────────────────────────────────────────────┤
│ PROCESS │ Create Process [C0017] │
│ │ Terminate Process [C0018] │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Capability │ Namespace │
┝━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
│ check for PEB BeingDebugged flag (24 matches) │ anti-analysis/anti-debugging/debugger-detection │
│ check for PEB NtGlobalFlag flag (24 matches) │ anti-analysis/anti-debugging/debugger-detection │
│ self delete │ anti-analysis/anti-forensic/self-deletion │
│ timestomp file │ anti-analysis/anti-forensic/timestomp │
│ create pipe │ communication/named-pipe/create │
│ accept command line arguments │ host-interaction/cli │
│ query environment variable (4 matches) │ host-interaction/environment-variable │
│ set environment variable │ host-interaction/environment-variable │
│ get common file path │ host-interaction/file-system │
│ copy file │ host-interaction/file-system/copy │
│ delete file │ host-interaction/file-system/delete │
│ check if file exists │ host-interaction/file-system/exists │
│ get file attributes │ host-interaction/file-system/meta │
│ read file on Windows (2 matches) │ host-interaction/file-system/read │
│ write file on Windows (3 matches) │ host-interaction/file-system/write │
│ check OS version │ host-interaction/os/version │
│ create process on Windows (2 matches) │ host-interaction/process/create │
│ terminate process (2 matches) │ host-interaction/process/terminate │
│ query or enumerate registry value (2 matches) │ host-interaction/registry │
│ set registry value │ host-interaction/registry/create │
│ delete registry value │ host-interaction/registry/delete │
│ create service │ host-interaction/service/create │
│ delete service │ host-interaction/service/delete │
│ modify service │ host-interaction/service/modify │
│ link function at runtime on Windows │ linking/runtime-linking │
│ persist via Windows service │ persistence/service │
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
Script /home/wumbo/capa/./capa/ghidra/capa_ghidra.py called exit with code 0
Challenges
The Ghidra feature extractor was only made possible by another Mandiant FLARE team project, Ghidrathon. Ghidra natively supports Jython, a Java-Python integration that is only compatible with Python 2. This rendered Python 3 modules and projects, such as capa, incompatible with the native scripting interface. Additionally, Ghidrathon had not been used to this extent, so a large portion of development went towards uncovering and fixing potential issues that may arise in similar projects.
Found Ghidrathon Issues
Data Type Conversion
The first problem encountered was identifying data type conversions to be handled as we passed these constructs to be processed by either the Java side or the Python 3 side. The most prominent example was byte extraction. Bytes returned by calls to the Ghidra API would be returned as singed ints
, versus a compatible Python 3 bytes
type.
This required us to handle the conversion in Python 3 before passing this data to the appropriate routines and functions. The first step was to convert the signed ints
to unsigned ints
. Fortunately, all this required was a bitwise & 0xFF
for each int
. From there, we needed to cast it to the Python 3 bytes
type. The original implementation took advantage of the builtin to_bytes()
function; however, our performance testing revealed that this was incredibly inefficient.
To remediate the performance issue, we took advantage of Python 3 list comprehensions as well as the builtin bytes()
casting. This improved the speed of our conversions approximately 100x. The pull request addressing this issue may be found here.
CPython Module Accessibility
Ghidrathon embeds CPython interpreters into the Java Virtual Machine (JVM) in order to allow Python 3 code to be executed in the JVM. Originally, this caused issues as re-importing certain Python modules caused crashes, due to these modules not supporting multiple instances of an interpreter within the same process. Because we run everything from a single Java process for Ghidra feature extraction, this required an entire re-architecting of Ghidrathon.
To remedy this issue, Ghidrathon implemented shared interpreters in Ghidrathon v2.2.0 to allow accessibility of each module to each shared interpreter.
Maintaining Unique Ghidra Script State Variables for each CPython Shared Interpreter
After implementing shared interpreters, the Ghidra Script state variables that Ghidrathon added to the builtins scope for each interpreter no longer remained unique. This resulted in sequential runs of the Ghidra feature extractor using the wrong state of these objects, therefore producing incorrect results. Ghidrathon does the crucial job of exposing necessary objects that relate to each Ghidra database. Namely, the Ghidra feature extractor heavily makes use of currentProgram
and monitor
to access data needed for capa processing.
These objects were originally accessible as normal Python 3 objects, for example, currentProgram.getFunctionManager()
. However, to maintain the proper state, accessing these objects was changed from direct access to a function call. With these changes, the same line as above is written as currentProgram().getFunctionManager()
.
These changes were added in the release of Ghidrathon v3.0.0.
Conclusion
As Ghidrathon had been a relatively new and untested project, my project served as a great stepping stone to improving Ghidrathon’s ability to support others. This contributes greatly to the industry by allowing most existing Python 3 binary analysis tooling access to the feature-rich Ghidra scripting framework.
Acknowledgement:
I’d like to thank the Mandiant FLARE team and all of the GSoC mentors who volunteered their time to guide myself and others. A special acknowledgement goes to my primary mentor, Mike Hunhoff. He contributed greatly to my growth by asking me the questions I failed to ask myself, providing sound advice & great design suggestions, and for the super quick Ghidrathon changes necessary to roll the project to production!