对于非 Normal 状态,我只需要关心两个过程:
- 何时从 Normal 跳转到该状态;
- 何时从该状态跳回 Normal 状态。
在上一章中,我已经写好了从 Normal 状态跳转到各个其他状态的代码,这一章中,我将写完所有非 Normal 状态下对字符的处理代码,以及跳回 Normal 状态代码。
首先是 Identifier 标示符:
回顾上一章,Normal 状态在何时会跳转到 Identifier 状态呢?
if(state == State.Normal) {
if(inIdentifierSetButNotRear(c)) {
state = State.Identifier;
}
}
而身处 Identifier 状态时。
else if(state == State.Identifier) {
if(inIdentifierSetButNotRear(c)) {
readBuffer.append(c);
} else if(include(IdentifierRearSign, c)) {
createType = Type.Identifier;
readBuffer.append(c);
state = State.Normal;
} else {
createType = Type.Identifier;
state = State.Normal;
moveCursor = false;
}
}
这段代码用到了 IdentifierRearSign ,我它的定义贴在下面把。
private static final char[] IdentifierRearSign = new char[] {'?', '!'};
以上这段代码表明。处于 Normal 状态时,读到数字、英文字母、下划线时,会跳转到 Identifier 状态。之后如果继续读数字、英语字母、下划线,则会缓存这些字符,并继续保持 Identifier 状态,直到:
读到 “?”,“!” 这两个只能用于 Identifier 结尾部分的字符,则立即创建一个 Identifier 的 Token 并跳转回 Normal 状态。
读到一个非数字、非英文字母、非下划线字符。此时立即跳转回 Normal 状态。但不移动游标,以便令处于 Normal 状态下的代码来判断这个字符属于什么样的 Token。
然后是 Annotation 注释:
回顾上一章,Normal 状态何时跳转到 Annotation 状态呢?
if(state == State.Normal) {
...
else if(c == '#') {
state = State.Annotation;
}
}
处于 Annotation 状态时的代码如下。
else if(state == State.Annotation) {
if(c != '\n' & c != '\0') {
readBuffer.append(c);
} else {
createType = Type.Annotation;
state = State.Normal;
moveCursor = false;
}
}
注释 Annotation 自 “#” 符号开始,读到换行符(或源代码读完了)则结束。特别的,注释结束后不移动游标,因为读到的最后一个字符要由跳转回的 Normal 做处理。(往往是生成一个 NewLine 类型的 Token 或 EndSymbol 类型的 Token。)
之后是 String 字符串、 RegEx 正则表达式:
回顾上一章,Normal 状态何时跳转到这两种状态呢?
if(state == State.Normal) {
...
else if(c == '\"' | c == '\'') {
state = State.String;
}
else if(c == '`') {
state = State.RegEx;
}
}
而处于 String、RegEx 状态下的代码如下。
else if(state == State.String) {
if(c == '\n') {
throw new LexicalAnalysisException(c);
} else if(c == '\0') {
throw new LexicalAnalysisException(c);
} else if(transferredMeaningSign) {
Character tms = StringTMMap.get(c);
if(tms == null) {
throw new LexicalAnalysisException(c);
}
readBuffer.append(tms);
transferredMeaningSign = false;
} else if(c == '\\') {
transferredMeaningSign = true;
} else {
readBuffer.append(c);
char firstChar = readBuffer.charAt(0);
if(firstChar == c) {
createType = Type.String;
state = State.Normal;
}
}
} else if(state == State.RegEx) {
if(transferredMeaningSign) {
if(c != '`') {
throw new LexicalAnalysisException(c);
}
readBuffer.append(c);
transferredMeaningSign = false;
} else if(c =='\\') {
transferredMeaningSign = true;
} else if(c == '\0') {
throw new LexicalAnalysisException(c);
} else if(c == '`') {
readBuffer.append(c);
createType = Type.RegEx;
state = State.Normal;
} else {
readBuffer.append(c);
}
}
当然,这里引入了一个新变量(成员变量),其声明如下。这个变量用于处理转义符号“\”。
private boolean transferredMeaningSign;
当然,这个变量必须在从 Normal 状态跳转到 String、RegEx 状态时初始化值。因此 Normal 状态下的代码也要做少许修改。
if(state == State.Normal) {
...
else if(c == '\"' | c == '\'') {
state = State.String;
transferredMeaningSign = false;
}
else if(c == '`') {
state = State.RegEx;
transferredMeaningSign = false;
}
}
所谓转义,举个例子。字符串可以写成 "hello world." 这种形式。以一个双引号开始,并以一个双引号结束。加入我要在字符串中间出现双引号,则必须使用转义符号。例如,"he said \"hello world\"." 这样的形式。
特别的,一些特殊不可见字符也可以用转义符号表示,例如,\n、\t 分别表示换行符、制表符。对于这些符号的映射关系,我建立了一张 HashMap 来表示。
private static final HashMap<Character, Character> StringTMMap = new HashMap<>();
static {
StringTMMap.put('\"', '\"');
StringTMMap.put('\'', '\'');
StringTMMap.put('\\', '\\');
StringTMMap.put('b', '\b');
StringTMMap.put('f', '\f');
StringTMMap.put('t', '\t');
StringTMMap.put('r', '\r');
StringTMMap.put('n', '\n');
}
因为 String 和 RegEx 都有明显的结束符号,因此只需要将读取的字符缓存,并在读到结束符号时生成对应的 Token,并跳回 Normal 状态即可。
只不过因为存在转义符号这种东西,所以要特别处理一下。
另外,在读 String 和 RegEx 时源代码不许结束,即读到 '\0' 符号,若结束,则判定为词法错误。当然,转义奇奇怪怪的东西也是词法错误。对于 String 而言,也有一些其他的词法错误判定,如,不能换行。
最后,Space 空白:
回顾上一章,Normal 状态到 Space 状态的代码。
else if(include(Space, c)) {
state = State.Space;
}
而 Space 状态下的代码。
} else if(state == State.Space) {
if(include(Space, c)) {
readBuffer.append(c);
} else {
createType = Type.Space;
state = State.Normal;
moveCursor = false;
}
}
此处无需多言。
最后的最后,还有一些 Normal 状态下不必跳转状态即可处理掉的状况:
else if(c == '\n') {
createType = Type.NewLine;
}
else if(c == '\0') {
createType = Type.EndSymbol;
}
即 NewLine 换行符和 EndSymbol 终止符。也无需多言。
上一张和本章所写的全部代码:
package com.taozeyu.taolan.analysis;
import java.io.IOException;
import java.io.Reader;
import java.util.HashMap;
import java.util.LinkedList;
import com.taozeyu.taolan.analysis.Token.Type;
public class LexicalAnalysis {
private static enum State {
Normal,
Identifier, Sign, Annotation,
String, RegEx, Space;
}
private static final char[] IdentifierRearSign = new char[] {'?', '!'};
private static final char[] Space = new char[] {' ', '\t'};
private static final HashMap<Character, Character> StringTMMap = new HashMap<>();
static {
StringTMMap.put('\"', '\"');
StringTMMap.put('\'', '\'');
StringTMMap.put('\\', '\\');
StringTMMap.put('b', '\b');
StringTMMap.put('f', '\f');
StringTMMap.put('t', '\t');
StringTMMap.put('r', '\r');
StringTMMap.put('n', '\n');
}
public LexicalAnalysis(Reader reader) {
//TODO
}
Token read() throws IOException, LexicalAnalysisException {
//TODO
return null;
}
private State state;
private final LinkedList<Token> tokenBuffer = new LinkedList<>();
private StringBuilder readBuffer = null;
private boolean transferredMeaningSign = false;
private void refreshBuffer(char c) {
readBuffer = new StringBuilder();
readBuffer.append(c);
}
private void createToken(Type type) {
Token token = new Token(type, readBuffer.toString());
tokenBuffer.addFirst(token);
readBuffer = null;
}
private boolean readChar(char c) throws LexicalAnalysisException {
boolean moveCursor = true;
Type createType = null;
if(state == State.Normal) {
if(inIdentifierSetButNotRear(c)) {
state = State.Identifier;
}
else if(SignParser.inCharSet(c)) {
state = State.Sign;
}
else if(c == '#') {
state = State.Annotation;
}
else if(c == '\"' | c == '\'') {
state = State.String;
transferredMeaningSign = false;
}
else if(c == '`') {
state = State.RegEx;
transferredMeaningSign = false;
}
else if(include(Space, c)) {
state = State.Space;
}
else if(c == '\n') {
createType = Type.NewLine;
}
else if(c == '\0') {
createType = Type.EndSymbol;
}
else {
throw new LexicalAnalysisException(c);
}
refreshBuffer(c);
} else if(state == State.Identifier) {
if(inIdentifierSetButNotRear(c)) {
readBuffer.append(c);
} else if(include(IdentifierRearSign, c)) {
createType = Type.Identifier;
readBuffer.append(c);
state = State.Normal;
} else {
createType = Type.Identifier;
state = State.Normal;
moveCursor = false;
}
} else if(state == State.Sign) {
//TODO
} else if(state == State.Annotation) {
if(c != '\n' & c != '\0') {
readBuffer.append(c);
} else {
createType = Type.Annotation;
state = State.Normal;
moveCursor = false;
}
} else if(state == State.String) {
if(c == '\n') {
throw new LexicalAnalysisException(c);
} else if(c == '\0') {
throw new LexicalAnalysisException(c);
} else if(transferredMeaningSign) {
Character tms = StringTMMap.get(c);
if(tms == null) {
throw new LexicalAnalysisException(c);
}
readBuffer.append(tms);
transferredMeaningSign = false;
} else if(c == '\\') {
transferredMeaningSign = true;
} else {
readBuffer.append(c);
char firstChar = readBuffer.charAt(0);
if(firstChar == c) {
createType = Type.String;
state = State.Normal;
}
}
} else if(state == State.RegEx) {
if(transferredMeaningSign) {
if(c != '`') {
throw new LexicalAnalysisException(c);
}
readBuffer.append(c);
transferredMeaningSign = false;
} else if(c =='\\') {
transferredMeaningSign = true;
} else if(c == '\0') {
throw new LexicalAnalysisException(c);
} else if(c == '`') {
readBuffer.append(c);
createType = Type.RegEx;
state = State.Normal;
} else {
readBuffer.append(c);
}
} else if(state == State.Space) {
if(include(Space, c)) {
readBuffer.append(c);
} else {
createType = Type.Space;
state = State.Normal;
moveCursor = false;
}
}
if(createType != null) {
createToken(createType);
}
return moveCursor;
}
private boolean inIdentifierSetButNotRear(char c) {
return (c >= 'a' & c <= 'z' ) | (c >='A' & c <= 'Z') | (c >= '0' & c <= '9')|| (c == '_');
}
private boolean include(char[] range, char c) {
boolean include = false;
for(int i=0; i<range.length; ++i) {
if(range[i] == c) {
include = true;
break;
}
}
return include;
}
}