Gregor Godbersen : Matching source code dump to git repository

Matching source code dump to git repository

Note that this entry was originally published 7 years ago. It may present outdated technologies or knowlege.

We were working with the Xilinx zturn7 FPGA development board from MYiR, which was running an old linux kernel for a start-up project. To update the board and build an custom embedded linux distribution using Yocto Project, we needed to know which modifications from stock Linux were made.

Luckily, the manufacturer observed the GPL requirements, and the source was provided as a tarball on the accompanying CD. However, this was provided as a file dump with no easy way to identify the custom changes made. The upstream source was known and available as a git repository but not which commit was used as the base for the changes.

The following script in rust tries to find this base. Using the git2 library from rust makes it easy to work directly with the git pack file format to improve the speed. It assumes that the initial code was based on a named release tag or branch head and calculates the changed files and lines for all of them. The source dump is added as a separate branch named “source-dump” to the upstream repository to integrate it into the git structure.

extern crate git2;
use std::str;
use git2::{Repository, ObjectType, DiffOptions};

fn main() {
    let repo = Repository::open("/tmp/linux-xlnx").expect("failed to open repo");
    for name in (repo.tag_names(Some("*"))).expect("Error loading tags").iter() {
        let name = name.unwrap();
        diff_tree(&repo, "source-dump", &name);
    }
    for branch in (repo.branches(None)).expect("Error") {
        let (branch, typ) = branch.expect("Could not load branch");
        let name = branch.name().expect("Could not get branch name").unwrap();
        diff_tree(&repo, "source-dump", &name);
    }
}

fn diff_tree(repo: &git2::Repository, name1: &str, name2: &str) {
    let tree1 = repo
        .revparse_single(name1).expect("Error parsing rev1")
        .peel(ObjectType::Tree).expect("Error peeling tree1");
    let tree2 = repo
        .revparse_single(name2).expect("Error parsing rev2")
        .peel(ObjectType::Tree).expect("Error peeling tree22");
    let mut opts = DiffOptions::new();
    let diff = repo.diff_tree_to_tree(
                            tree1.as_tree(),
                            tree2.as_tree(),
                            Some(&mut opts)
                        ).expect("Error diffing");
    let stats = diff.stats().expect("Could not load diff stats");
    println!("{} \t\t{}: {}, {}, {}", 
            (stats.insertions() + stats.deletions()),
            name2, stats.files_changed(),
            stats.insertions(), stats.deletions()
        );

}

Initially, I planned to apply a more sophisticated binary search procedure, but the simple process proved fast enough, and a commit with a very small change set could be identified.