Writing a basic shell using Flex and Lemon: Part 1

Introduction

I’ve recently decided to switch one of my projects from using the Bison parser generator to Lemon. Bison is the go to standard for parser generators, but after looking into Lemon, I really think that Lemon is safer, easier to use, and a better choice for writing a parser.

Before I start using new technologies, I like to produce a non-trivial sample program so that I can better understand the ins and outs of it before I jump in with a full-sized project. One good-sized problem for this domain is writing a simple shell; the grammar is not terribly complex, but it’s a trivial problem either. It should give an idea of the kinds of problems to expect when expanding to a larger project.

In this tutorial, we will be writing a very simple shell using Flex and Lemon. With this shell, you will be able to run commands with arguments, use command substitution, and pipe output from one program into another. A simple case would just be the command

ls

while a more complex case would be

cat $(find . | grep -i access) | sed -r 's/http[s]?:\/\///g' | cut -d '/' -f 1 | sort | uniq -c

This tutorial won’t produce a shell that can run programs in the background, redirect output to a file, support up/down-arrow history, support command line editing, allow argument expansion, or any other advanced features (although you could add those in). It won’t be powerful enough to replace your main shell any time soon, but it should give you an idea of how Lemon and Flex work.

Lemon

Why not Bison?

As stated earlier, we will be using Lemon to generate a parser. Lemon has several features that I feel make it a better choice than Bison. First, with Bison, the parser directly calls the scanner, which creates a tight coupling between the two. It’s difficult to debug your parser until your scanner is already working. With Lemon, you call the scanner yourself and feed the results into the scanner. This lets you unit test both pieces more easily. Second, Bison supports numerical arguments for its rules, which makes reading the code more difficult and error-prone. For example, a Bison rule for a C-style function would look like:

function:
  returnType LEFT_PARENTHESIS parameterList RIGHT_PARENTHESIS methodBody
  {
     $$ = new FunctionNode($1, $3, $5);
  }

where $$ is the return value for this rule and the $1 $3 $5 tokens are the return values for the returnType parameterList methodBody rules respectively. This is difficult to read because understanding the meaning of the action requires looking at the rule. Quick, without looking back at the rule, what’s the $5 in the action represent?

Contrast this with Lemon’s syntax:

function(FUNCTION) ::= returnType(TYPE) LEFT_PARENTHESIS parameterList(PARAMS) RIGHT_PARENTHESIS methodBody(BODY) .
{
  FUNCTION = new FunctionNode(TYPE, PARAMS, BODY);
}

With Lemon, we provide a name for each of the tokens in parentheses that we want to use in the action. Then, in the action, we refer to the name and don’t have to refer back to the rule to understand which token we are referring to. We can also modify the rule by inserting more tokens without having to update the $1 $3 $5 arguments in the action to match the new positions of the tokens. Lemon will also warn you if you give a name to a token but don’t use it in the action. The dot at the end rule indicates the end of the rule.

Getting Started

Let’s start writing a basic Lemon parser for our shell. Create a file called shellparser.y; this will be our Lemon grammar file.

Let’s start by defining our grammar rules. By convention, terminal symbols (that is, grammar rules that can’t be broken down any more) are specified in all capital letters, and other tokens start with a lower-case letter. For now, we won’t include any actions – we’ll add those after we test our grammar.

Our first rule should allow a bunch of possibly piped commands. Let’s call this commandList.

start ::= commandList .
{
}
commandList ::= command PIPE commandList .
{
}
commandList ::= command .
{
}

Here, I’ve specified that a commandList either consists of a single command followed by a pipe and another commandList, or just a single command. This way, a long list of commands will continue expanding until it hits the last command.

Our next rule is for a single command, which consists of a program’s name and a (possibly empty) series of arguments. Note that because other filenames can be used as arguments, we will have to define a rule for arguments to use either filenames or regular arguments.

command ::= FILENAME argumentList .
{
}
command ::= FILENAME .
{
}
command ::= COMMAND_SUBSTITUTION_START commandList COMAMND_SUBSTITUTION_END .
{
}

argumentList ::= argument argumentList .
{
}
argumentList ::= argument .
{
}
argument ::= ARGUMENT .
{
}
argument ::= FILENAME .
{
}
argument ::= COMMAND_SUBSTITUTION_START commandList COMAMND_SUBSTITUTION_END .
{
}

That should be it for our rules. However, because we are also planning on running these commands at some point, we’ll want to store what our ARGUMENTs were. To do so, we need to declare the types of these terminals. For now, we’ll define them as C-style strings. Add this above the rules:

%token_type {const char*}

If Lemon encounters an error while parsing, it will run the code defined in the %syntax_error section. Let’s define that near the top of our file:

%syntax_error
{
    std::cerr << "Error parsing command\n";
}

To get that to work properly, we will also have to include the iostream header file. Lemon also uses assertions, so we will have to include the header file for those. To make Lemon do that, add this to your file:

%include
{
    #include <cassert>
    #include <iostream>
}

Your full Lemon grammar should look like this (this code is also available on my GitHub):

%include
{
    #include <cassert>
    #include <iostream>
}

%syntax_error
{
    std::cerr << "Error parsing command\n";
}

%token_type {const char*}

start ::= commandList .
commandList ::= command PIPE commandList .
{
}
commandList ::= command .
{
}

command ::= FILENAME argumentList .
{
}
command ::= FILENAME .
{
}
command ::= COMAND_SUBSTITUTION_START commandList COMMAND_SUBSTITUTION_END .
{
}

argumentList ::= argument argumentList .
{
}
argumentList ::= argument .
{
}
argument ::= ARGUMENT .
{
}
argument ::= FILENAME .
{
}
argumentList ::= COMAND_SUBSTITUTION_START commandList COMMAND_SUBSTITUTION_END .
{
}

Testing

Driver

To test our grammar file, let’s write a simple driver file and a Makefile to compile everything. Start by writing a file named Makefile first. Add the following code to it:

shell: main.o shellparser.o
	$(CXX) main.o shellparser.o -o shell

main.o: main.cpp shellparser.hpp

shellparser.cpp: lemonfiles

shellparser.hpp: lemonfiles

.PHONY: lemonfiles
lemonfiles: shellparser.y
	lemon shellparser.y -s
	mv shellparser.c shellparser.cpp
	mv shellparser.h shellparser.hpp

make can figure out some simple rules on its own; for example, shellparser.o only relies on shellparser.cpp, so make will automatically compile it. Some files, such as main.o, need to have their dependencies manually listed so that make will recompile them if their include files change. In this case, make is able to figure out a default action for creating main.o.

The most interesting part is with the .PHONY line. Normally, make expects that a target (the bit before the colon) will be made upon running the actions that follow the target. However, if we simply listed shellparser.cpp and shellparser.hpp as targets with actions that ran Lemon, make would attempt to run Lemon twice, which can cause weird problems. By using .PHONY, we can tell make that the target lemonfiles won’t actually be created by the actions, but that it can still be used as a dependency target.

Now, let’s write our test main file. Lemon will produce a header file that defines the values of the various tokens that it recognizes. Unfortunately, this header does not define the interface to the parser, so we will have to forward declare the interface ourselves. To start the parser, we call the ParseAlloc function with a function to allocate memory with. After that, we call the Parse function with values of tokens that we have scanned. The end of the tokens is indicated with a token value of 0. Finally, we clean up the parser by calling ParseFree.

Edit the file main.cpp:

#include <cstdlib>
#include <iostream>
#include "shellparser.hpp"
using namespace std;

void* ParseAlloc(void* (*allocProc)(size_t));
void Parse(void*, int, const char*);
void ParseFree(void*, void(*freeProc)(void*));

int main()
{
    void* shellParser = ParseAlloc(malloc);
    // Simulate a command line such as "cat main.cpp | wc"
    Parse(shellParser, FILENAME, "cat");
    Parse(shellParser, ARGUMENT, "main.cpp");
    Parse(shellParser, PIPE, 0);
    Parse(shellParser, FILENAME, "wc");
    Parse(shellParser, 0, 0); // Signal end of tokens
    ParseFree(shellParser, free);
    return 0;
}

Compiling

Let’s test our program by compiling it. Because we specified a valid list of commands in our driver, it should run without calling the %syntax_error directive we defined in our grammar file. Compile everything by typing:

make

and run the driver by typing:

./shell

If no message is printed, then everything worked!

Breaking

Now let’s check to see if the parser fails when given invalid tokens. Our shell should refuse to parse the command line:

cat main.cpp | | wc

because there are two pipes. Edit the main.cpp file and make two copies of the PIPE line to simulate this command line. Recompile and run your program, and it should print an error.

Advertisements
This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

6 Responses to Writing a basic shell using Flex and Lemon: Part 1

  1. Pingback: Writing a simple shell using Flex and Lemon: part 2 | Brandon's Thoughts

  2. Small fix for your main.cpp: s/PROGRAM/FILENAME/g

    Thanks for a nice tutorial! See my comment on part 2 as well.

    • bskari says:

      Whoops, you’re right. I’ll go fix that now, thanks!

      • NP, I’ve got another one 🙂 I think the return type for Parse and ParseFree should have void return type, not void* (copy/paste error from ParseAlloc?). I think that’s why I’m currently getting a link error on Windows/MSVC (need to wait for VM to start up before I know for sure. GCC had no problem with it.

  3. Now confirmed that that s/void*/void/ on the return type for ParseFree and Parse fixes linking on Windows. Seems MSVC is pickier than GCC and wants the types to match exactly. I sent a pull request on GitHub.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s