dirverify: concurrent sha256 verifier written in Rust
In a follow up to
Copying large data and learning Rust!
(dircopy):
I’ve worked on a companion tool
dirverify:
that verifies directories against dircopy shasum.*.txt files.
By default it will attempt to use concurrency to achieve performance;
- File reading is running in parallel with
SHA256calculations. - Multiple directories can be verified in parallel by specifying
dirverify dir1 dir2 dir3. So you can for example verify source and destination drives in parallel without needing progress to wait on a slower drive.
Table of Contents:
- Usage
- Restructuring source directory for multiple binaries
- Cross-platform path name processing
- Inspect directory and obtain hash files
- Parallel directory verification
- Parallel file read and hashing
Usage
Manual: dirverify.md
dirverify -h
A directory verifier. Searches for shasum.*.txt files in directories
Usage: dirverify [OPTIONS] [DIR]...
Arguments:
[DIR]... Directories with files to be verified
Options:
--hash-file <HASH_FILE> Specify sha256-file, and disable automatic search for shasum*.txt files
--silent Inhibit all stdout print outs
--no-convert-paths Keep paths exactly as is. Do not try to workaround unix, dos mismatches
--no-summary Do not print a summary
--no-threaded-sha Disable threaded sha read/hash behavior
--no-parallell Do not check multiple directories at the same time
--queue-size <QUEUE_SIZE> Size of queue between reader and hasher thread. Tuning parameter [default: 2]
--block-size <BLOCK_SIZE> Size of blocks between reader and hasher thread. Tuning parameter [default: 128K]
--verbose Print informative messages helpful for understanding processing
-h, --help Print help
-V, --version Print version
Restructuring source directory for multiple binaries
With two binary deliveries out from one cargo repository,
having all code in src/main.rs doesn’t work any more.
The new directory structure was as follows:
- src/bin/dircopy.rs
dircopybinary source code. - src/bin/dirverify.rs
dirverifybinary source code. - src/bin/texttools/mod.rs
shared module
texttoolswith some basic string features.
Cross-platform path name processing
Path, directory and file names are annoying as the path separator differs between Windows/DOS and Linux/WSL.
To help cross platform users,
dirverify is converting between \ and /
automatically in shasum.*.txt files.
So dirverify can use shasum.*.txt files generated on Windows in Linux, or vise versa.
let filename_corrected;
if self.convert_paths {
if !filename.contains(MAIN_SEPARATOR_STR) {
match MAIN_SEPARATOR_STR {
"\\" => filename_corrected = filename.replace("/", "\\"),
"/" => filename_corrected = filename.replace("\\", "/"),
&_ => filename_corrected = filename.to_string(),
}
} else {
filename_corrected = filename.to_string();
}
} else {
filename_corrected = filename.to_string();
}
Inspect directory and obtain hash files
When executing dirverify dir, the tool needs to perform a few checks on directory dir;
- is it a directory?
- is the directory accessible?
- find any
shasum.*.txtfiles in the directory, and return them to verifier. Return error if no such file. - Optionally skip the
shasum.*.txtcheck, if the hash-file has been specified elsewhere via command line options.
fn inspect_dir(dir: &std::path::PathBuf, detect_sha_files: bool) -> Result<Vec<String>, String> {
if !dir.is_dir() {
return Err(format!("Not a directory {}", dir.display()));
}
let read_dir_maybe = fs::read_dir(&dir);
let read_dir;
match read_dir_maybe {
Ok(rd) => read_dir = rd,
Err(e) => {
return Err(format!("{}: {}", dir.display(), e));
}
}
if !detect_sha_files {
return Ok(Vec::new());
}
let mut names: Vec<String> = Vec::new();
for entry in read_dir {
let name;
match entry {
Ok(file_entry) => {
match file_entry.file_name().into_string() {
Ok(n) => name = n,
Err(e) => {
return Err(format!("Error reading file name: {}", e.display()));
}
}
match file_entry.file_type() {
Ok(file_type) => {
if !file_type.is_file() {
continue;
}
}
Err(e) => {
return Err(format!("Error determining file type; {} {}", name, e));
}
}
}
Err(e) => {
return Err(format!("Unexpected: {}", e));
}
}
if !name.starts_with("shasum.") {
continue;
}
if !name.ends_with(".txt") {
continue;
}
names.push(name);
}
Ok(names)
}
Parallel directory verification
The main function checks if should be run in parallel or not;
// ------ run the verifier ------
if args.no_parallell {
stats = run_sequential(dirverify, args.hash_file, sha_files);
} else {
stats = run_parallell(dirverify, args.hash_file, sha_files);
}
run_sequential just loops over the directories,
and verifies all files,
except the shasum.*.txt files themselves:
fn run_sequential(
dirverify: DirVerify,
hash_file: Option<std::path::PathBuf>,
sha_files: Vec<(std::path::PathBuf, Vec<String>)>,
) -> Statistics {
let mut stats = Statistics::new();
for (dir, names) in &sha_files {
let hash_names = match hash_file {
Some(_) => None,
None => Some(names.clone()),
};
dirverify.verify_all_lists(&mut stats, &dir, &hash_names, &hash_file);
}
stats
}
The run_parallell version basically just spawns a thread per
directory, and then awaits all threads to terminate.
For all other purposes, it is the same for-loop.
fn run_parallell(
dirverify: DirVerify,
hash_file: Option<std::path::PathBuf>,
sha_files: Vec<(std::path::PathBuf, Vec<String>)>,
) -> Statistics {
let mut stats = Statistics::new();
let mut threads = Vec::new();
for (dir, names) in &sha_files {
let hash_file = hash_file.clone();
let dir_thread = dir.clone();
let names_thread = names.clone();
let dirverify_thread = dirverify.clone();
let thread = thread::spawn(move || -> Statistics {
let mut thread_stats = Statistics::new();
let hash_names = match hash_file {
Some(_) => None,
None => Some(names_thread.clone()),
};
dirverify_thread.verify_all_lists(
&mut thread_stats,
&dir_thread,
&hash_names,
&hash_file,
);
thread_stats
});
threads.push(thread);
}
for thread in threads {
match thread.join() {
Ok(x) => stats.add(&x),
Err(err) => {
stats.errors += 1;
eprintln!("{}", format!("Join error: {:?}", err));
}
}
}
stats
}
Parallel file read and hashing
The tool can run file read and SHA-hash in parallel.
While your mileage may vary, on my tests with modern processor and fast SSDs, the parallel version is a few percent faster. So it saves some wall time clock, but the gains aren’t enormous.
fn sha_file(&self, stats: &mut Statistics, file: &mut File) -> Result<String, String> {
if self.threaded_sha_reader {
self.sha_file_multithread(stats, file)
} else {
self.sha_file_single_thread(stats, file)
}
}
The single thread version is as you would expect just a read, update loop.
fn sha_file_single_thread(
&self,
stats: &mut Statistics,
file: &mut File,
) -> Result<String, String> {
let block_size = self.block_size;
let mut h1 = Sha256::new();
let mut heap_buf: Vec<u8> = Vec::with_capacity(block_size);
heap_buf.resize(block_size, 0x00);
loop {
match file.read(&mut heap_buf[0..block_size]) {
Ok(0) => break,
Ok(n) => {
h1.update(&heap_buf[0..n]);
stats.read_bytes += n;
}
Err(e) => {
return Err(e.to_string());
}
}
}
let digest = h1.finalize();
let strdigest = format!("{:x}", digest);
Ok(strdigest)
}
The multithreaded version is basically the same thing,
but with an thread spawn for the SHA256 hasher,
a sync_channel queue between threads,
and a thread join to obtain the SHA256 result.
fn sha_file_multithread(
&self,
stats: &mut Statistics,
file: &mut File,
) -> Result<String, String> {
let block_size = self.block_size;
let queue_size = self.queue_size;
let (read_tx, sha_rx) = sync_channel::<Message>(queue_size);
let sha_thread = thread::spawn(move || -> Result<String, String> {
let mut h1 = Sha256::new();
loop {
match sha_rx.recv() {
Ok(Message::Block(block)) => {
h1.update(&block);
}
Ok(Message::Error) => {
return Err(String::from("T-Read: sent error"));
}
Ok(Message::Done) => {
break;
}
Err(e) => {
return Err(format!("T-SHA: {}", e));
}
}
}
let digest = h1.finalize();
let strdigest = format!("{:x}", digest);
return Ok(strdigest);
});
let mut heap_buf: Vec<u8> = Vec::with_capacity(block_size);
heap_buf.resize(block_size, 0x00);
loop {
match file.read(&mut heap_buf[0..block_size]) {
Ok(0) => {
if let Err(e) = read_tx.send(Message::Done) {
return Err(format!("Error: {}", e));
}
break;
}
Ok(n) => {
stats.read_bytes += n;
if let Err(e) = read_tx.send(Message::Block(heap_buf[0..n].to_vec())) {
return Err(format!("Error: {}", e));
}
}
Err(e) => {
_ = read_tx.send(Message::Error);
return Err(e.to_string());
}
}
}
match sha_thread.join() {
Ok(x) => x,
Err(err) => Err(format!("Join error: {:?}", err)),
}
}