Writing a simple shell using Flex and Lemon: part 2

Introduction

In part 1, we created a simple parser and driver for a basic shell that supports piping and command substitution. In order to test our parser, the driver in our last part only simulated reading tokens. In this part, we’re going to be using Flex to generate a real scanner that’s capable of reading tokens and passing them to the parser.

One other reason that I prefer Lemon over Bison is that Lemon is automatically reentrant and thread safe. Flex is not normally thread safe, but we’ll do some extra work to make it that way so that we can use it with multiple threads if we need to. In the process, we’ll also write less tightly coupled code.

Flex

Flex is a tool that lets us write a scanner using a higher level descriptive input, similar to how we used our Lemon grammar file to produce a parser. Flex is nice in that it allows us to use regular expressions to match tokens, and can automatically ignore white space.

Our scanner needs are pretty simple: our parser only has four tokens, PIPE, two tokens for starting and ending command substitutions, FILENAME which will match Unix files, and ARGUMENT, which could be a program name, a file name, a flag, or some other argument. The first three are relatively easy, but the last will require a little more work because we’re going to allow quoting of parameters.

Flex syntax

A Flex specification file consists of three sections. The first section contains configuration options, the second contains the rules for matching tokens, and the third allows you to enter code that will be copied to the end of the output file. Each section is separated by the token %%. We will only use the first two sections here.

It’s worth mentioning that Flex will automatically match the longest rule that it can find. If you have a regular expression that can match anything, make sure that Flex only matches a single character; otherwise, it will prefer that rule over other ones, even though the others may be a better fit.

Configurations section

Start editing a file named shellscanner.l. The first thing that we will want to do is to include the generated tokens definition file from Lemon, so add this at the top:

%{
    #include "shellparser.h"
%}

Next, we can put various options to control Flex’s behavior. The first option we’re interested in is one that makes Flex reentrant, so add this next:

%option reentrant

Flex will also call a function named yywrap after it has finished scanning so that you can, for example, scan multiple files in a chain. We don’t care about this, so add

%option noyywrap

to prevent Flex from trying to call it.

Now we will define various states that the scanner can move into. In the normal state, the scanner just reads tokens and returns their values, but there are times where we want to change into a special state and have differently defined actions. For example, in C, identifiers consist of a string of alphanumeric characters, so Flex will return an IDENTIFIER value whenever it reads one. However, if we are parsing a comment, then we want Flex to ignore these strings, so we define a separate state for parsing while we are in a comment.

In our example, we’ll have one state for quoted parameters. Other shells behave differently when presented with single-quoted versus double-quoted parameters (e.g. double-quoted parameters honor escaping, but single-quoted parameters don’t) but we’ll ignore those differences here. Define the two states by entering:

%x SINGLE_QUOTED
%x DOUBLE_QUOTED

That should be it for this section, so enter in

%
%

and we’ll get on to the next section.

Rules section

Rules in Flex start with an optional state identifier in angle brackets (if not in the default state), a regular expression to match in the input, followed by some action to take in curly braces. The normal action that you will take is to return the value that represents the token that you matched, although sometimes you may not want to return anything at all, such as when you are parsing whitespace.

Our first rule is our easiest. If we see the pipe character, we return the PIPE token.

"|"	{return PIPE;}

We’ll do similar things for the command substitution rules.

"$("    {return COMMAND_SUBSTITUTION_START;}
")"     {return COMMAND_SUBSTITUTION_END;}

Our next rule will ignore any whitespace that shows up in our command line.

[ \t\n\r]	{}

The next rule will match Unix filenames. Filenames in Unix can use just about any character, but we’re going to limit it to alphanumeric characters and underscores, periods, and dashes. Our action in this case will be to just return the FILENAME token defined in the Lemon-generated header file.

[a-zA-Z0-9_\.\-]	{return FILENAME;}

Note that we had to escape the period and the dash, as both have special meanings in regular expressions.

Our next rules will handle quoted arguments. These are more complicated because we have to use states. To start, we’ll specify how to enter a state, then what actions to take while we’re in that state, and how to get out of it, and what to return when we’re done.

We’ll start with single quoted arguments first. To start, we enter the QUOTED state any time we encounter a single quote character. Flex has a special action named BEGIN that will switch to other states that we have defined in the first section, or back to the DEFAULT state.

"'"	{BEGIN(SINGLE_QUOTED);}

Once we’re in the SINGLE_QUOTED state, we want to ignore keep reading characters until we hit the closing quote, at which point we are can go back to the normal INITIAL scanning state.

<SINGLE_QUOTED>[^']+	{}
<SINGLE_QUOTED>"'"	{BEGIN(INITIAL); return ARGUMENT;}

One final note is that if we start quoting an argument but never find the closing quote, we need to signal an error. Flex has a special rule named &lt&ltEOF>>, so if we encounter it, we’ll just return -1 because -1 is an invalid state.

<SINGLE_QUOTED><<EOF>>	{return -1};

Our double quoted arguments will be similar, except that we’ll start and end with double quotes instead.

["]	{BEGIN(DOUBLE_QUOTED);}
<DOUBLE_QUOTED>[^"]	{}
<DOUBLE_QUOTED>["]	{BEGIN(INITIAL); return ARGUMENT;}
<DOUBLE_QUOTED><<EOF>>	{return -1};

The only thing we have left to match is other arguments, such as --color=auto. For this case, we just want to match any string consisting of not whitespace and quotes.

[^ \t\r\n|'"]+	{return ARGUMENT;}

Finally, we can close this section and finish this file by entering

%
%

Your full file should look like this (you can also grab the source from my GitHub:

%{
    #include "shellparser.hpp"
%}

%option reentrant
%option noyywrap

%x SINGLE_QUOTED
%x DOUBLE_QUOTED

%%

"|"    {return PIPE;}
"$("    {return COMMAND_SUBSTITUTION_START;}
")"     {return COMMAND_SUBSTITUTION_END;}

[ \t\r\n]    {}

[a-zA-Z0-9\-\._]+    {return FILENAME;}

[']                     {BEGIN(SINGLE_QUOTED);}
<SINGLE_QUOTED>[^']+    {}
<SINGLE_QUOTED>[']      {BEGIN(INITIAL); return ARGUMENT;}
<SINGLE_QUOTED><<EOF>>  {return -1;}

["]                     {BEGIN(DOUBLE_QUOTED);}
<DOUBLE_QUOTED>[^"]+    {}
<DOUBLE_QUOTED>["]      {BEGIN(INITIAL); return ARGUMENT;}
<DOUBLE_QUOTED><<EOF>>  {return -1;}

[^ \t\r\n|'"]+    {return ARGUMENT;}

%%

One final note. If Flex can’t match some input against a rule, it will by default just print that input. A lot of tutorials therefore recommend putting a rule at the end that will match any single character and then returning an error if that rule is matched. I haven’t done that here because we should be matching everything with our own rules. The good news is that Flex is smart enough to recognize if a rule can’t be matched and will print a warning, so you can test that everything is being matched by adding this rule at the end:

.    {return -1;}

Updating the Lemon parser

In part 1, if Lemon ever encountered a syntax error, it would just print a message to stderr. That worked fine when we were just testing the parser, but now that we’re integrating a scanner, we’ll need a better way of reporting errors.

Lemon allows you define and send an additional parameter to the parse function that will then be accessible in the parser’s rules. If you were implementing a compiler or something similar, you could use this parameter to send a struct that counts how many tokens you had seen or keep track of other information as the parser is running. For our purposes, we’ll just send a boolean by pointer that the parser will set to false if it encounters an error.

First, define the extra parameter by adding:

%extra_argument {bool* valid}

then change the %syntax_error section to read:

%syntax_error {
    *valid = false;
}

Updating the driver

We’ll use a driver that’s very similar to what we did last time, but now we’ll manually call the scanner. We also have to initialize the scanner ourselves with the input that we read from the user.

First, we’ll write a separate function that we will call from main to parse a single command line. It will take a string that is the command line to be parsed. In this function, we’ll have to initialize the Flex scanner and keep reading tokens and calling the parser until we either hit the end of file or encounter an error. Then, our main function just needs to read one line at a time and call the parse function with that line. You can also grab this code from my GitHub. If C is more your style, you can check out David E. Wheeler’s port too.

#include <cstdlib>
#include <iostream>
#include <string>
#include "shellparser.hpp"
#include "shellscanner.yy.hpp"
using namespace std;

void* ParseAlloc(void* (*allocProc)(size_t));
void Parse(void* parser, int token, const char*, bool* valid);
void ParseFree(void* parser, void(*freeProc)(void*));

void parse(const string& commandLine) {
    // Set up the scanner
    yyscan_t scanner;
    yylex_init(&scanner);
    YY_BUFFER_STATE bufferState = yy_scan_string(commandLine.c_str(), scanner);

    // Set up the parser
    void* shellParser = ParseAlloc(malloc);

    bool validParse;
    int lexCode;
    do {
        lexCode = yylex(scanner);
        Parse(shellParser, lexCode, NULL, &validParse);
    }
    while (lexCode > 0 && validParse);

    if (-1 == lexCode) {
        cerr << "The scanner encountered an error.\n";
    }

    if (!validParse) {
        cerr << "The parser encountered an error.\n";
    }

    // Cleanup the scanner and parser
    yy_delete_buffer(bufferState, scanner);
    yylex_destroy(scanner);
    ParseFree(shellParser, free);
}

int main() {
    string commandLine;
    while (getline(cin, commandLine)) {
        parse(commandLine);
    }
    return 0;
}

Updating the Makefile

Now that we’ve added some new files, we’ll have to udpdate the dependencies and add some new rules to make the Flex source files by running flex against the scanner specification file. We’ll use the same .PHONY specification trick as we did with the parser to create a fake target and list it as a dependency.

Also, because our folder is getting somewhat cluttered with files generated from Lemon and Flex and with object files, we’ll create another target named clean that will remove all of these genereated files. We can clean up our directory by just running

make clean

Just like with our Flex and Lemon targets, clean isn’t a real target, so we’ll use the same .PHONY trick. You can also grab this code from my GitHub.

shell: main.o shellparser.o shellscanner.yy.o
	$(CXX) main.o shellparser.o shellscanner.yy.o -o shell

main.o: main.cpp shellscanner.yy.hpp shellparser.hpp

shellparser.cpp: lemonfiles

shellparser.hpp: lemonfiles

.PHONY: lemonfiles
lemonfiles: shellparser.y
	lemon shellparser.y -s
	mv shellparser.c shellparser.cpp
	mv shellparser.h shellparser.hpp

shellscanner.o: shellscanner.yy.hpp shellscanner.yy.cpp

shellscanner.yy.cpp: flexfiles

shellscanner.yy.hpp: flexfiles

.PHONY: flexfiles
flexfiles: shellscanner.l
	flex --outfile=shellscanner.yy.cpp --header-file=shellscanner.yy.hpp shellscanner.l

.PHONY: clean
clean:
	rm -f *.o
	rm -f shellscanner.yy.cpp shellscanner.yy.hpp
	rm -f shellparser.cpp shellparser.hpp shellparser.out
	rm -f shell

Testing

Now that we have everything in place, we should be ready to compile and test our program. Compile it by typing

make

and run it by typing

./shell

You can now start entering command line strings. If nothing gets printed, then that means the command was successfully parsed; an error message means that something went wrong. Test it by typing some example lines, such as:

cat main.cpp "hello world.txt" | wc -l
ls
cat $(find . | grep 'cpp$') | grep -v -P '^\s*$' | wc -l
echo "foo" 'bar' |
| ./shell
find . | grep test | grep tos | grep -v -i photos | wc
find . | |

Some of these should parse perfectly.

Advertisements
This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

5 Responses to Writing a simple shell using Flex and Lemon: part 2

  1. FYI: I ported this to C and uploaded it to GitHub. Thanks for the tutorial, very helpful.

  2. Great tutorial. However, I can’t find the code for part 2 on your GitHub, only part 1?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s