从零开始写个编译器吧 - 开始写词法分析器(2)

632 查看

对于 Normal 状态,我只需要关心两个过程:

  1. 何时从 Normal 跳转到该状态
  2. 何时从该状态跳回 Normal 状态。

在上一章中,我已经写好了从 Normal 状态跳转到各个其他状态的代码,这一章中,我将写完所有非 Normal 状态下对字符的处理代码,以及跳回 Normal 状态代码。

首先是 Identifier 标示符:

回顾上一章,Normal 状态在何时会跳转到 Identifier 状态呢?

if(state == State.Normal) {
    if(inIdentifierSetButNotRear(c)) {
        state = State.Identifier;
    }
}

而身处 Identifier 状态时。

else if(state == State.Identifier) {

    if(inIdentifierSetButNotRear(c)) {
        readBuffer.append(c);

    } else if(include(IdentifierRearSign, c)) {
        createType = Type.Identifier;
        readBuffer.append(c);
        state = State.Normal;

    } else {
        createType = Type.Identifier;
        state = State.Normal;
        moveCursor = false;
    }
}

这段代码用到了 IdentifierRearSign ,我它的定义贴在下面把。

private static final char[] IdentifierRearSign = new char[] {'?', '!'};

以上这段代码表明。处于 Normal 状态时,读到数字、英文字母、下划线时,会跳转到 Identifier 状态。之后如果继续读数字、英语字母、下划线,则会缓存这些字符,并继续保持 Identifier 状态,直到:

  1. 读到 “?”,“!” 这两个只能用于 Identifier 结尾部分的字符,则立即创建一个 Identifier 的 Token 并跳转回 Normal 状态。

  2. 读到一个非数字、非英文字母、非下划线字符。此时立即跳转回 Normal 状态。但不移动游标,以便令处于 Normal 状态下的代码来判断这个字符属于什么样的 Token。

然后是 Annotation 注释:

回顾上一章,Normal 状态何时跳转到 Annotation 状态呢?

if(state == State.Normal) {
    ...
    else if(c == '#') {
        state = State.Annotation;
    }
}

处于 Annotation 状态时的代码如下。

else if(state == State.Annotation) {

    if(c != '\n' & c != '\0') {
        readBuffer.append(c);

    } else {
        createType = Type.Annotation;
        state = State.Normal;
        moveCursor = false;
    }
}

注释 Annotation 自 “#” 符号开始,读到换行符(或源代码读完了)则结束。特别的,注释结束后不移动游标,因为读到的最后一个字符要由跳转回的 Normal 做处理。(往往是生成一个 NewLine 类型的 Token 或 EndSymbol 类型的 Token。)

之后是 String 字符串、 RegEx 正则表达式:

回顾上一章,Normal 状态何时跳转到这两种状态呢?

if(state == State.Normal) {
    ...
    else if(c == '\"' | c == '\'') {
        state = State.String;
    }
    else if(c == '`') {
        state = State.RegEx;
    }
}

而处于 String、RegEx 状态下的代码如下。

 else if(state == State.String) {

    if(c == '\n') {
        throw new LexicalAnalysisException(c);

    } else if(c == '\0') {
        throw new LexicalAnalysisException(c);

    } else if(transferredMeaningSign) {

        Character tms = StringTMMap.get(c);
        if(tms == null) {
            throw new LexicalAnalysisException(c);
        }
        readBuffer.append(tms);
        transferredMeaningSign = false;

    } else if(c == '\\') {
        transferredMeaningSign = true;

    } else {
        readBuffer.append(c);
        char firstChar = readBuffer.charAt(0);
        if(firstChar == c) {
            createType = Type.String;
            state = State.Normal;
        }
    }
} else if(state == State.RegEx) {

    if(transferredMeaningSign) {

        if(c != '`') {
            throw new LexicalAnalysisException(c);
        }
        readBuffer.append(c);
        transferredMeaningSign = false;

    } else if(c =='\\') {
        transferredMeaningSign = true;

    } else if(c == '\0') {
        throw new LexicalAnalysisException(c);

    } else if(c == '`') {
        readBuffer.append(c);
        createType = Type.RegEx;
        state = State.Normal;

    } else {
        readBuffer.append(c);
    }
} 

当然,这里引入了一个新变量(成员变量),其声明如下。这个变量用于处理转义符号“\”。

private boolean transferredMeaningSign;

当然,这个变量必须在从 Normal 状态跳转到 String、RegEx 状态时初始化值。因此 Normal 状态下的代码也要做少许修改。

if(state == State.Normal) {
    ...
    else if(c == '\"' | c == '\'') {
        state = State.String;
        transferredMeaningSign = false;
    }
    else if(c == '`') {
        state = State.RegEx;
        transferredMeaningSign = false;
    }
}

所谓转义,举个例子。字符串可以写成 "hello world." 这种形式。以一个双引号开始,并以一个双引号结束。加入我要在字符串中间出现双引号,则必须使用转义符号。例如,"he said \"hello world\"." 这样的形式。

特别的,一些特殊不可见字符也可以用转义符号表示,例如,\n、\t 分别表示换行符、制表符。对于这些符号的映射关系,我建立了一张 HashMap 来表示。

private static final HashMap<Character, Character> StringTMMap = new HashMap<>();

static {
    StringTMMap.put('\"', '\"');
    StringTMMap.put('\'', '\'');
    StringTMMap.put('\\', '\\');
    StringTMMap.put('b', '\b');
    StringTMMap.put('f', '\f');
    StringTMMap.put('t', '\t');
    StringTMMap.put('r', '\r');
    StringTMMap.put('n', '\n');
}

因为 String 和 RegEx 都有明显的结束符号,因此只需要将读取的字符缓存,并在读到结束符号时生成对应的 Token,并跳回 Normal 状态即可。

只不过因为存在转义符号这种东西,所以要特别处理一下。

另外,在读 String 和 RegEx 时源代码不许结束,即读到 '\0' 符号,若结束,则判定为词法错误。当然,转义奇奇怪怪的东西也是词法错误。对于 String 而言,也有一些其他的词法错误判定,如,不能换行。

最后,Space 空白:

回顾上一章,Normal 状态到 Space 状态的代码。

else if(include(Space, c)) {
    state = State.Space;
}

而 Space 状态下的代码。

} else if(state == State.Space) {

    if(include(Space, c)) {
        readBuffer.append(c);

    } else {
        createType = Type.Space;
        state = State.Normal;
        moveCursor = false;
    }
}

此处无需多言。

最后的最后,还有一些 Normal 状态下不必跳转状态即可处理掉的状况:

else if(c == '\n') {
    createType = Type.NewLine;
}
else if(c == '\0') {
    createType = Type.EndSymbol;
}

即 NewLine 换行符和 EndSymbol 终止符。也无需多言。

上一张和本章所写的全部代码:

package com.taozeyu.taolan.analysis;

import java.io.IOException;
import java.io.Reader;
import java.util.HashMap;
import java.util.LinkedList;

import com.taozeyu.taolan.analysis.Token.Type;

public class LexicalAnalysis {

    private static enum State {
        Normal, 
        Identifier, Sign, Annotation,
        String, RegEx, Space;
    }

    private static final char[] IdentifierRearSign = new char[] {'?', '!'};
    private static final char[] Space = new char[] {' ', '\t'};

    private static final HashMap<Character, Character> StringTMMap = new HashMap<>();

    static {
        StringTMMap.put('\"', '\"');
        StringTMMap.put('\'', '\'');
        StringTMMap.put('\\', '\\');
        StringTMMap.put('b', '\b');
        StringTMMap.put('f', '\f');
        StringTMMap.put('t', '\t');
        StringTMMap.put('r', '\r');
        StringTMMap.put('n', '\n');
    }

    public LexicalAnalysis(Reader reader) {
        //TODO
    }

        Token read() throws IOException, LexicalAnalysisException {
        //TODO
        return null;
    }

    private State state;
    private final LinkedList<Token> tokenBuffer = new LinkedList<>();
    private StringBuilder readBuffer = null;

    private boolean transferredMeaningSign = false;

    private void refreshBuffer(char c) {
        readBuffer = new StringBuilder();
        readBuffer.append(c);
    }

    private void createToken(Type type) {
        Token token = new Token(type, readBuffer.toString());
        tokenBuffer.addFirst(token);
        readBuffer = null;
    }

    private boolean readChar(char c) throws LexicalAnalysisException {

        boolean moveCursor = true;
        Type createType = null;

        if(state == State.Normal) {

            if(inIdentifierSetButNotRear(c)) {
                state = State.Identifier;
            }
            else if(SignParser.inCharSet(c)) {
                state = State.Sign;
            }
            else if(c == '#') {
                state = State.Annotation;
            }
            else if(c == '\"' | c == '\'') {
                state = State.String;
                transferredMeaningSign = false;
            }
            else if(c == '`') {
                state = State.RegEx;
                transferredMeaningSign = false;
            }
            else if(include(Space, c)) {
                state = State.Space;
            }
            else if(c == '\n') {
                createType = Type.NewLine;
            }
            else if(c == '\0') {
                createType = Type.EndSymbol;
            }
            else {
                throw new LexicalAnalysisException(c);
            }
            refreshBuffer(c);

        } else if(state == State.Identifier) {

            if(inIdentifierSetButNotRear(c)) {
                readBuffer.append(c);

            } else if(include(IdentifierRearSign, c)) {
                createType = Type.Identifier;
                readBuffer.append(c);
                state = State.Normal;

            } else {
                createType = Type.Identifier;
                state = State.Normal;
                moveCursor = false;
            }
        } else if(state == State.Sign) {        
            //TODO

        } else if(state == State.Annotation) {

            if(c != '\n' & c != '\0') {
                readBuffer.append(c);

            } else {
                createType = Type.Annotation;
                state = State.Normal;
                moveCursor = false;
            }
        } else if(state == State.String) {

            if(c == '\n') {
                throw new LexicalAnalysisException(c);

            } else if(c == '\0') {
                throw new LexicalAnalysisException(c);

            } else if(transferredMeaningSign) {

                Character tms = StringTMMap.get(c);
                if(tms == null) {
                    throw new LexicalAnalysisException(c);
                }
                readBuffer.append(tms);
                transferredMeaningSign = false;

            } else if(c == '\\') {
                transferredMeaningSign = true;

                } else {
                readBuffer.append(c);
                char firstChar = readBuffer.charAt(0);
                if(firstChar == c) {
                    createType = Type.String;
                    state = State.Normal;
                }
            }
        } else if(state == State.RegEx) {

            if(transferredMeaningSign) {

                if(c != '`') {
                    throw new LexicalAnalysisException(c);
                }
                readBuffer.append(c);
                transferredMeaningSign = false;

            } else if(c =='\\') {
                transferredMeaningSign = true;

            } else if(c == '\0') {
                throw new LexicalAnalysisException(c);

            } else if(c == '`') {
                readBuffer.append(c);
                createType = Type.RegEx;
                state = State.Normal;

            } else {
                readBuffer.append(c);
            }

        } else if(state == State.Space) {

            if(include(Space, c)) {
                readBuffer.append(c);

            } else {
                createType = Type.Space;
                state = State.Normal;
                moveCursor = false;
            }
        }

        if(createType != null) {
            createToken(createType);
        }
        return moveCursor;
    }

    private boolean inIdentifierSetButNotRear(char c) {
        return (c >= 'a' & c <= 'z' ) | (c >='A' & c <= 'Z') | (c >= '0' & c <= '9')|| (c == '_');
    }

    private boolean include(char[] range, char c) {
        boolean include = false;
        for(int i=0; i<range.length; ++i) {
            if(range[i] == c) {
                include = true;
                break;
            }
        }
        return include;
    }
}