Omar Abid

How command line arguments get parsed

This blog post assumes you have basic knowledge of Rust.

Suppose you want to create a Rust program. This program will read the arguments passed and parse them.

./myprogram a bunch of arguments

This is fairly straightforward. Not only can you read these arguments, but they are also already parsed for you.

fn main() {
    let args: Vec<String> = std::env::args().collect();
    dbg!(args);
}

Which gives

[src/main.rs:6] args = [
    "target/debug/myprogram",
    "a",
    "bunch",
    "of",
    "arguments",
]

Simple, right? What about a String instead?

let command = "myprogram a bunch of arguments";

To keep things simple, we can split the String by white spaces.

let command = "myprogram a bunch of arguments";
let arguments = command.split_whitespace().collect::<Vec<&str>>();
dbg!(arguments);

We get roughly the same output, or at least for the parts (arguments) we care about. This approach works fine until we encounter arguments styled differently.

let command = "myprogram commit -m \"commit message\"";

Our little white space splitting operation will yield this:

[src/main.rs:10] arguments = [
    "myprogram",
    "commit",
    "-m",
    "\"commit",
    "message\"",
]

Which doesn't seem to be quite right. This has split our message. Does the Rust parsing mechanism do that? If we run the initial program with the same arguments, we get:

[src/main.rs:6] args = [
    "target/debug/myprogram",
    "commit",
    "-m",
    "commit message",
]

This is interesting; it not only failed to split the message but also removed the quotation marks. This suggests arguments are a bit trickier than whitespace. As such, it would be preferrable to find out how Rust is doing it instead of trying to implement it. One way to do this is by reviewing Rust source code. Good news: One can read the source code for Rust std library.

std::env::args

The place to start is std::env::args. The first thing to learn is that there are Operating System differences in how arguments are handled:

On Unix systems the shell usually expands unquoted arguments with glob patterns (such as * and ?). On Windows this is not done, and such arguments are passed as-is.

We can then follow the code to find the responsible part:

    // Step 1: https://doc.rust-lang.org/src/std/env.rs.html#759
    Args { inner: args_os() }
   
    // Step 2: https://doc.rust-lang.org/src/std/env.rs.html#794
    ArgsOs { inner: sys::args::args() }

Finally, we land on the file that matters: sys::unix::args. There are different implementations for the different targets out there. However, the top of the file commentary does stir us in a different direction.

//! Global initialization and retrieval of command line arguments.
//!
//! On some platforms these are stored during runtime startup,
//! and on some they are retrieved from the system on demand.

So the parsed arguments are actually passed from the Operating System itself (or retrieved from it)? The plot thickens...

The Operating System

LWN.net has a great article on how programs are run, under Linux.

For Linux versions up to and including 3.18, the only system call that invokes a new program is execve(), which has the following prototype:

int execve(const char *filename, char *const argv[], char *const envp[]);

The filename argument specifies the program to be executed, and the argv and envp arguments are NULL-terminated lists that specify the command line arguments and environment variables for the new program.

In order to execute a program under linux, you have to make a syscall with execve; the only problem is: You still have to specify the arguments yourself. This implies linux itself is not involved in the parsing of the arguments.

So, who does? The party responsible for executing the program should be the one parsing the arguments. Whoever that party may be. If you are executing the command in the terminal, it should be the shell's responsibility. But does this mean that the shell is parsing the arguments?

The Shell

The shell is a program whose main function is to interpret commands. Shells are not terminals and this difference can be lost as frequent use of the terminal can blur these boundaries.

This is problematic, however. If different shells apply parsing differently, how can programs account for that. Does Clap, for example, account for that? Or are all shells supposed to behave the same?

Luckily, there is a POSIX standard. Shells and Operating Systems might not necessarily comply to that, but it is a starting point.

The details about parsing are a bit too technical and can be dazzling. I found that this slide can give a good overview. You can also check the code for the GNU bash parser if you are curious about a particular implementation.

shlex

Lucky for us, there is a crate that can do this parsing: shlex.

Let's give it a shot!

    let shlex_command = "myprogram commit -m \"commit message\"";
    let shlex_arguments = shlex::split(slex_command).unwrap();
    dbg!(shlex_arguments);

Bingo!

[src/main.rs:14] shlex_arguments = [
    "myprogram",
    "commit",
    "-m",
    "commit message",
]

std::process::Command

If you have used std::process::Command then you might have wondered, like I did, why you have to parse the arguments before passing them.

Command::new("cmd")
    .args(["/C", "echo hello"])
    .output()
    .expect("failed to execute process")

Command accepts an array of arguments. These have to be parsed. You can't pass these arguments as a single string; otherwise, your command will fail. This is because, behind the scenes, Rust uses execve to run your command; just like the shell. So it expects you to do the parsing of the arguments!