Studying the Effect of Data in Commit-Based Static Analysis
Introduction
The importance of software security has been emphasized as various industries have become increasingly dependent on software. One of the major challenges in ensuring software security is identifying and addressing vulnerabilities in the code. If left undetected, vulnerabilities can result in significant harm. Static analysis remains a prevalent tool in detecting vulnerabilities in software. This method can analyze code without executing it, allowing for early detection of vulnerabilities in the development process. In this project, we will explore the use of commit-based static analysis, which focuses on individual code changes, to improve the accuracy and effectiveness of vulnerability detection. We will evaluate the performance of various machine learning models trained on vulnerable code commits to identify vulnerabilities. The goal of this research is to enhance commit-based static analysis and improve software security by developing more effective methods for identifying and addressing vulnerabilities in software.
Contents
-
Data/
: contains the datasets used in the project and the scripts to process them. -
Docker/
: contains the docker containers for easy setup of the project. -
README.md
: this file provides an overview of the repository.
Dataset
The dataset can be found here.
Dependencies
- Python 3.11.2
- Docker >= 20.10.5
Usage
Make sure to install the dependencies before running the project.
Setup with Docker
- Run
docker compose up -d mongo redis mysql
in the root directory of the project to start the databases. - Run
docker compose up -d mongo_python
to start populating the CVE-Search database. - Run
docker compose up -d --build commit_analysis
to start the main container.
NOTE: Check the README.md in the
Docker/
directory for more information!
Rebuilding the commit_analysis container
The current code is copied to the commit_analysis
container.
If any changes are made in the init_scripts, the container needs to be rebuilt.
This can be done by running docker compose up -d --build commit_analysis
.
Use of the pre-populated database
The databases expose their ports to the host machine.
To get the ports, check the docker-compose.yml
file.
You can use those ports to connect to the databases and pre-populate them with the data provided.
Then you do not need to run all mappings.
License
Distributed under the GPLv3 License. See LICENSE
for more information.
Authors
- Rawel Ahmad - rawel.ahmad@stud.tu-darmstadt.de
- Nikolaos Alexopoulos - coordination <GitHub>
Acknowledgements
The following individuals have contributed to this project:
- BIC-Tracker by JayJayJay1: The crawler for syzkaller crash reports was adapted from this project. (See Data/DatasetSources/syzkaller/)
- What Happens When We Fuzz? Investigating OSS-Fuzz Bug History by Keller et al.: The extracting method for the OSS-Fuzz dataset was adapted from this paper. (See Data/DatasetSources/oss-fuzz/processor.py)
- VulnerabilityLifetimes by manuelbrack: The heuristic was adapted from this project. (See Data/Heuristic/)
Thesis information
This project is part of a bachelor's thesis at Technische Universität Darmstadt.