One important thing that computers do is recognise whether a sequence of objects forms a valid statement in a language. For example,
x := x + 1
is a Pascal statement, while
the cat eats the canary.
is a statement of the English language: a Pascal compiler must be able to recognise that the first statement is legal Pascal, while a program aiming to provide an intelligent natural-language interface to users should be able to recognise that the second statement is a sentence of English.
A parser is a Scheme function which recognises a phrase in a language, that is to say in the case of English, some well-formed part of a sentence. For example, we regard "the cat" as a phrase of the English language. Why do we focus on recognising phrases? Because we can only build a parser for something as complicated as a sentence by building parsers for small parts of a language first. We shall see in this lecture how we can systematically build up a parser for a given language by building up parsers to recognise ever larger phrases of the language. For example, if we are creating a parser for English, we will create parsers to recognise nouns, and determiners (the words "a" and "the") and integrate these parsers into one for recognising noun-phrases; we then build a parser for sentences out of noun- and verb- phrase parsers.
Formally, languages are defined by grammars. Commonly grammars are written using productions, which specify how a high-level concept, such as sentence can be expressed or rewritten in terms of lower-level concepts such as phrases, which in turn are expressed in terms of still lower level concepts such as words. For example, the production
sentence -> noun_phrase verb_phrasesays that a sentence is a noun-phrase followed by a verb-phrase. In Scheme terms, this is saying that the list of words which constitutes a sentence for example:
'(the cat eats the canary)can be computed by appending the list of words which constitutes a noun-phrase the cat to the list of words that constitute a verb-phrase eats the canary.
(append '(the cat) '(eats the canary))We speak of terminal symbols as being those entities in the grammar which cannot be rewritten, and which are thus words (or "tokens", or "lexemes") of the language. Those entities in the grammar which can be rewritten are called "non-terminal symbols". So, in writing a grammar, it is necessary to distinguish between terminal and non-terminal symbols. We shall use the notation that we have already been using informally to descibe Scheme, in which non-terminals are written using bold typewriter characters, while terminals are written using plain tyewriter characters. A grammar for a small subset of English might be written as
sentence -> noun_phrase verb_phrase noun_phrase -> determiner noun determiner -> the determiner -> a noun -> cat noun -> dog noun -> canary verb_phrase -> verb noun_phrase verb -> eats verb -> likes
For convenience, the vertical bar can be used to indicate alternative Right Hand Sides
verb -> eats | likes
The earliest convention for writing productions used by computer
scientists is Backus-Naur Form (BNF), in which non-terminal
symbols are enclosed in angle brackets. For example
<sentence> -> <noun_phrase> <verb_phrase> <determiner> -> the |
Another convention that is used is to quote terminal symbols, thus:
sentence -> noun_phrase verb_phrase determiner -> "the" |
We have said above that a parser is a function for recognising that a particular sequence of terminal symbols is a legal statement in a language. As well as performing the recognition, practical applications of parsers usually require that some kind of parse-tree is produced to represent the logical structure of the statement. For example, a compiler needs a parse-tree in order to generate code. A natural language query system requires a parse tree in order to "understand" a question put to it and to generate an answer.
In computer science we normally call the terminal symbols fed to a parser "tokens" - they are the equivalent of words in the English language.
If we are writing a parser in Scheme, the natural thing to do is to use Scheme lists to represent parse-trees. So, a Pascal parser might parse:
'(x + 2 * y)
as the Scheme structure
'(+ x (* 2 y))
We might think of a parser as being a function that took a list of tokens and returned a parse-tree. However that raises two problems - the first is relatively trivial - we have to find some way of signalling that a list of tokens is not in fact a sentence of the language - well, we could return #f (if we were sure that #f couldn't be mistaken for an actual parse-tree).
The second problem is more serious - it's to do with what might be called compositionality. It's obviously hopeless to try to write one huge function which will perform a parse of a complex language. So, we are going to have to build a parser out of smaller functions that we write. What sort of functions? Well, the obvious choice is to build big parsers out of smaller ones. We can use the language definition as a guide to doing this.
sentence -> noun_phrase verb_phrase
If we want a parser for a sentence, presumably it makes sense to try to build it out of a parser for the grammatical class noun_phrase and of a parser for the class verb_phrase. So, suppose we want our sentence-parser to parse this sentence:
(the cat eats the canary)
What text are the noun_phrase parser and the verb_phrase parser going to have to work on? Clearly the noun_phrase-parser is going to work on the whole list:
(the cat eats the canary)
recognising (the cat) as being a noun_phrase. But the verb_phrase-parser is going to have to work on (eats the canary). So we have to provide some way of giving it its correct argument. If we simply gave it the cddr of the original list this would be making the assumption that each noun phrase consists of exactly two tokens, which is actually true for our existing grammar, but would build in an assumption that could be very wrong if we wanted to extend it, perhaps by allowing an obtional adjectives in front of the noun, as in (the furry cat eats the canary). And indeed, it would be a very bad mistake when creating parsers to assume that any given grammatical class had a fixed length.
Much better is to require a parser to return as its result a record which contains the part of the original list of tokens which remains unparsed. That way we can assemble the two parsers for noun phrases and verb phrases together (or any other two). We can think of the first parser as "eating" tokens until it is satisfied, when what remains is given to the second parser.
Note that it is adequate if we require that a parser recognise that the list of tokens that it is given as an argument begin with a sequence of tokens drawn from the language to be recognised. Thus a parser for English would not accept '(gobbledegook the cat eats the canary).
So we need records to store our parses in. Each record has two components, the tree_parse component is a representation of what the parser has actually found, and the rest_parse component is the list of tokens remaining unparsed.
If we are using UMASS Scheme we can define a record class for parses as follows:
(define class_parse (record-class 'parse '(full full))) (define cons_parse (car class_parse)) (define sel_parse (caddr class_parse)) (define tree_parse (car sel_parse)) (define rest_parse (cadr sel_parse))Here the record-class function creates a class of opaque records which can only be created by a constructor function, and only be accessed by selector functions. The call of record-class produces a list, class_parse containing the functions necessary to create and access records.
Now, a parser can signal failure by returning #f without risk of confusion, since if it succeeds it will always return one of these parse-records.
Thus we can regard a parser as a function which takes as argument a list of tokens and returns as result
For example, we might write a parser which recognised expressions in the Pascal language. In this case we could return a Scheme expression (in internal form) which represented the Pascal. Thus the Pascal expression x+2*y would be represented by the Scheme
(+ x (* 2 y))
Thus suppose we have a parser for Pascal called parse_pascal, and we apply it to a list of Pascal tokens
(define p (parse_pascal '(x + 2 * y ; z := 4;)) )
then the Scheme variable p will have the value which prints as:
<parse (begin (+ x (* 2 y)) (:= z 4)) '()>
if we choose an obvious way to represent Pascal as Scheme.
To make a parser for the whole English language is hard- even to make a parser for a programming language is quite hard. However, let us make a start at a parser for some English noun phrases,
noun_phrase -> determiner noun determiner -> the determiner -> a
That is to say, a determiner is the word "a" or "the". So let's write the parser for a determiner. The function takes a list of tokens as argument. If (1) the list of tokens begins with the symbol 'a or with the symbol 'the, then we do indeed have a determiner, so (2) we create a parse-record, consisting of the actual symbol found (3), and the unparsed list (4). Otherwise (5) we return #f to indicate failure.
(define parse_determiner (lambda (list_of_tokens) (cond ((null? list_of_tokens) #f) ((member? (car list_of_tokens) '(a the)) ; (1) determiner? (cons_parse ; (2) yes!, make parse (car list_of_tokens) ; (3) tree (cdr list_of_tokens))) ; (4) unparsed list (else #f ) ; (5) no. fail ) ; end cond ) ; end lambda ) ; end definition
We will need the definition of member? from Lecture 5:
(define (member? x list) (if (null? list) #f (if (equal? x (car list)) #t (member? x (cdr list)))))
We can now try this out. The call
(parse_determiner '(the cat eats the canary))
returns the parse-record:
<parse the (cat eats the canary)>
So, here the tree_parse component of the parse record is the symbol 'the (which is a very simple tree...), while the rest_parse component is '(eats the canary), that is the original list with the determiner removed.
While if there is no determiner:
(parse_determiner '(eats the canary))
we get
#f
Note that there is a bug in this function, we have not allowed for the possibility that the list of tokens might be empty. We'll mend this later.
Likewise we could define a noun as a member of a list of words known to be nouns. Now there are tens if not hundreds of thousands of nouns in the English language, so this would be a long list (and expensive to look through), but we can restrict our vocabulary.
(define noun '(cat dog child woman man bone cabbage canary)) (define parse_noun (lambda (list_of_tokens) (cond ((null? list_of_tokens) #f) ((member? (car list_of_tokens) noun) ; (1) noun? (cons_parse ; (2) yes!, make parse (car list_of_tokens) ; (3) tree (cdr list_of_tokens))) ; (4) unparsed list (else #f ) ; (5) no. fail ) ; end cond ) ; end lambda ) ; end definition
Now, can we write a parser for a noun_phrase? We want to write this in a way which makes use of our two existing parsers, rather than trying to write a function that sees if the list_of_tokens has a determiner as a first word and a noun as second. This is NOT what we should do:
(define parse_noun_phrase (lambda (list_of_tokens) ; DONT DO THIS (if (and (member? (car list_of_tokens) '(a the)) (member? (cadr list_of_tokens) noun) ) (cons_parse (list 'noun_phrase (car list_of_tokens) (cadr list_of_tokens)) (cddr list_of_tokens) ) #f) ))
Why is writing the above function not a good way to proceed? Well, primarily because it does not match our grammar well.
noun_phrase -> determiner noun_phrase
There's no way we could extend the kind of parser we see above to handle a language with a complex grammar. Instead we should rely on our little parsers being designed to work together to make a big parser.
At (1) we call parse_determiner to see if it can find a determiner at the start of the list of tokens. If (2) it has succeeded, we call (3) parse_noun to find a noun starting where parse_determiner left off. If this second parse has succeeded (4), then (5) we create a parse-record whose tree (6) consists of the trees from the parse-record for the determiner (7) and from the noun-record (8). The list of remaining tokens (9) for the parse-record consists of those remaining from the parsing of the noun. If the second parse failed (10), then we return #f, and likewise if the first parse failed (11) we return #f.
(define parse_noun_phrase (lambda (list_of_tokens) (let ((p_det (parse_determiner list_of_tokens))) ; (1) (if p_det ; (2) (let ( (p_n (parse_noun (rest_parse p_det)))); (3) (if p_n ; (4) (cons_parse ; (5) (list 'noun_phrase ; (6) (tree_parse p_det) ; (7) (tree_parse p_n)) ; (8) (rest_parse p_n) ; (9) ) ; #f) ;(10) );end let ; #f) ;(11) );end let ; )); end def. parse_noun_phrase ;
Now we can try this out
(parse_noun_phrase '(the cat eats the canary))
obtaining
<parse (noun_phrase the cat) (eats the canary)>
So, here the tree_parse component of the parse record is a tree '(noun_phrase the cat) , while the rest_parse component is '(eats the canary), that is the original list with the noun-phrase removed.
and, if the parse fails:
(parse_noun_phrase '(eats the canary))
we get
#f
Likewise if we have a determiner first, but no noun second:
(parse_noun_phrase '(the the canary))
we get
#f
(define verb '(likes eats hugs)) (define parse_verb (lambda (list_of_tokens) (cond ((null? list_of_tokens) #f) ((member? (car list_of_tokens) verb) ; (1) verb? (cons_parse ; (2) yes!, make parse (car list_of_tokens) ; (3) tree (cdr list_of_tokens))) ; (4) unparsed list (else #f ) ; (5) no. fail ) ; end cond ) ; end lambda ) ; end definition
We are going to get fed up with changing the names of the p_... variables, so let us call them p1 and p2.
(define parse_verb_phrase (lambda (list_of_tokens) (let ((p1 (parse_verb list_of_tokens))) (if p1 (let ((p2 (parse_noun_phrase (rest_parse p1)))) (if p2 (cons_parse (list 'verb_phrase (tree_parse p1) (tree_parse p2)) (rest_parse p2) ) #f) ) ;end let #f) );end let )); end def. parse_verb_phrase
(example '(parse_verb_phrase '(eats the canary)) (cons_parse '(verb_phrase eats (noun_phrase the canary)) '()) )
Now we can define a sentence of the English Language
sentence -> noun_phrase verb_phrase
(define parse_sentence (lambda (list_of_tokens) (let ((p1 (parse_noun_phrase list_of_tokens))) (if p1 (let ((p2 (parse_verb_phrase (rest_parse p1)))) (if p2 (cons_parse (list 'sentence (tree_parse p1) (tree_parse p2)) (rest_parse p2) ) #f) ) ;end let #f) );end let )); end def. parse_sentence
Now we can try out our complete sentence-parser. If we try it on the sentence '(the cat eats the canary) we see that we obtain a parse:
(example '(parse_sentence '(the cat eats the canary)) (cons_parse '(sentence (noun_phrase the cat) ; parse-tree (verb_phrase eats (noun_phrase the canary))) ; end of parse-tree '() ; unparsed ) ; end of parse )
If we give it a non-sentence like '(canary the cat eats) we get:
(example '(parse_sentence '(canary the cat eats)) #f)
that is, the parse fails.
If we give it a list that begins with a sentence, but which has some nonsense at the end. We can use the example capability to compare the result of the parse (1) with what we expect (2). Run the example - it works!
(example '(parse_sentence '(the dog eats the bone 4 5 6)) ;(1) (cons_parse ;(2) '(sentence (noun_phrase the dog) (verb_phrase eats (noun_phrase the bone))) '(4 5 6) ) )
Here we have the parse-tree:
'(sentence (noun_phrase the dog) (verb_phrase eats (noun_phrase the bone)))
while the unparsed residual is the list '(4 5 6) .
Naturally(!) there is quite a lot our parser does not know about English. For example, if we try out
(example '(parse_sentence '(the cat eats the canary with the yellow feathers)) (cons_parse '(sentence (noun_phrase the cat) (verb_phrase eats (noun_phrase the canary))) '(with the yellow feathers) ) )
that is the prepositional phrase '(with the yellow feathers) is left unrecognised. Note that this form of sentence also poses a problem of ambiguity - does the canary have yellow feathers? or perhaps the cat uses yellow feathers as an instrument with which to eat the unfortunate bird. We know that canaries are yellow-feathered birds and that cats are not given to using tools as instruments - but that is semantic knowledge that cannot readily be built into syntax. The parallel sentence "the man eats the turkey with the knife and fork", has the opposite structure. Ambiguity can occur in computer languages, but language designers try to avoid it. Moreover we can parse nonsense sentences such as:
(example '(parse_sentence '(the cabbage eats the man)) ( cons_parse '(sentence (noun_phrase the cabbage) (verb_phrase eats (noun_phrase the man))) '() ) )
Distinguishing grammatically correct sense from grammatically correct nonsense is an issue of semantics.
We can make good use of the trace function to help us debug our parsers. Below is a trace, slightly edited to make it viewable on a web-browser.
(trace parse_verb_phrase) (trace parse_verb) (trace parse_noun_phrase) (parse_verb_phrase '(eats the canary))Produces the output
(parse_verb_phrase (eats the canary) ) |(parse_verb (eats the canary) ) |parse_verb =|(parse_noun_phrase (the canary) ) |parse_noun_phrase = parse_verb_phrase =