Creating a Java Lexer for Tokenizing Source Code

Java Lexer for Source Code Analysis

Discover a powerful Java lexer implementation for efficiently tokenizing source code. Our solution includes comprehensive error handling and organized code design, making it a valuable resource for those seeking help with their JAVA assignment. Whether you're a student or a programmer looking to enhance your skills, our lexer implementation will simplify source code analysis and boost your proficiency in Java programming.

Block 1: Package and Class Declaration


package plc.project; import java.util.ArrayList; import java.util.List; public final class Lexer { // Class implementation }
This block contains the package and class declaration for the Lexer class, which is defined within the plc.project package.
The class is marked as final, meaning it cannot be subclassed.

Block 2: Class Documentation Comment


/**
* The lexer works through three main functions:
*
* - {@link #lex()}, which repeatedly calls lexToken() and skips whitespace
* - {@link #lexToken()}, which lexes the next token
* - {@link CharStream}, which manages the state of the lexer and literals
*
* If the lexer fails to parse something (such as an unterminated string) you
* should throw a {@link ParseException} with an index at the character which is
* invalid or missing.
*
* The {@link #peek(String...)} and {@link #match(String...)} functions are
* helpers you need to use, they will make the implementation a lot easier.
*/

Block 3: Constructor


public Lexer(String input) { chars = new CharStream(input); }
This block defines the constructor of the Lexer class, which takes an input string.
It initializes the chars field with a new instance of the CharStream class, passing the input string to it.

Block 4: lex() Method


public List lex() {
List tokens = new ArrayList<>();
while (peek(".")) {
// ignore spaces
if (match("[ \b\n\r\t]")) {
chars.skip();
} else {
Token tok = lexToken();
tokens.add(tok);
}
}
return tokens;
}

This block defines the lex() method, which is responsible for lexing the input text and returning a list of tokens.
It initializes an empty list called tokens to store the tokens.
It enters a loop that continues as long as the next character in the input matches the pattern. (indicating there are more tokens to be processed).
Within the loop, it checks for whitespace characters and skips them using the chars.skip() method.
If a non-whitespace character is encountered, it calls lexToken() to lex the next token and adds it to the tokens list.
Finally, it returns the list of tokens.

Block 5: lexToken() Method


public Token lexToken() {
if (peek("[A-Za-z_]")) {
return lexIdentifier();
} else if (peek("[\\+\\-]", "[0-9]") || peek("[0-9]")) {
return lexNumber();
} else if (peek("'")) {
return lexCharacter();
} else if (peek("\"")) {
return lexString();
} else {
return lexOperator();
}
}

This block defines the lexToken() method, which identifies and lexes the next token in the input.
It determines the type of the next token by examining the first character and then delegates to the appropriate lexing method for that token type.
The possible token types include identifiers, numbers, characters, strings, and operators.

Block 6: lexIdentifier() Method


public Token lexIdentifier() {
while (match("[A-Za-z0-9_-]"));
return chars.emit(Token.Type.IDENTIFIER);
}

This block defines the lexIdentifier() method, responsible for lexing identifiers (e.g., variable names).
It enters a loop that matches characters that are part of valid identifiers (letters, digits, underscores, and hyphens).
Once the identifier has been fully matched, it emits an IDENTIFIER token using chars.emit() and returns it.

Block 7: lexNumber() Method


public Token lexNumber() {
match("[\\+\\-]");
while(match("[0-9]"));
if (!peek("\\.", "[0-9]")) {
return chars.emit(Token.Type.INTEGER);
}
match("\\.", "[0-9]");
while(match("[0-9]"));
return chars.emit(Token.Type.DECIMAL);
}

This block defines the lexNumber() method, responsible for lexing numeric literals (integers and decimals).
It first checks for an optional leading plus or minus sign.
Then, it enters a loop to match digits (0-9) to identify the integer part of a number.
If a decimal point (.) is encountered, it transitions to the decimal part, matching digits after the decimal point.
Finally, it emits an appropriate token (either INTEGER or DECIMAL) using chars.emit() and returns it.

Block 8: lexCharacter() Method


public Token lexCharacter() {
match("'");
if (peek("\\\\")) {
lexEscape();
} else if (!match("[^'\r\n]")) {
throw new ParseException("Invalid character", chars.index);
}
if (match("'")) {
return chars.emit(Token.Type.CHARACTER);
} else {
throw new ParseException("Invalid character", chars.index);
}
}

This block defines the lexCharacter() method, responsible for lexing character literals (e.g., 'a' or '\n').
It starts by matching a single quote character (') to indicate the beginning of a character literal.
If a backslash is encountered (indicating an escaped character), it calls the lexEscape() method to handle it.
If a character other than a single quote or a line break character is encountered, it throws a ParseException with an error message.
If a closing single quote is found, it emits a CHARACTER token and returns it. Otherwise, it throws a ParseException.

Block 9: lexString() Method


public Token lexString() {
match("\"");
while (chars.has(0) && !peek("\"")) {
if (peek("\\\\")) {
lexEscape();
} else if (peek("[\n\r]")) {
throw a ParseException("Invalid character in string", chars.index);
}
match(".");
}
if (!match("\"")) {
throw new ParseException("Unterminated string", chars.index);
} else {
return chars.emit(Token.Type.STRING);
}
}

This block defines the lexString() method, responsible for lexing string literals (e.g., "Hello, World!").
It starts by matching a double quote (") character to indicate the beginning of a string literal.
It then enters a loop that continues until it encounters another double quote (indicating the end of the string).
Within the loop, it handles escaped characters using the lexEscape() method and checks for line breaks (which are not allowed in strings).
If the loop completes without finding a closing double quote, it throws a ParseException indicating an unterminated string.
If a closing double quote is found, it emits a STRING token and returns it.

Block 10: lexEscape() Method


public void lexEscape() {
if (!match("\\\\", "[bnrt'\"\\\\]")) {
throw new ParseException("Invalid escaped character", chars.index);
}
}

This block defines the lexEscape() method, responsible for handling escaped characters within character and string literals.
It checks if the next character sequence matches one of the valid escape sequences (e.g., \\, \', \", \n, \r, \t, \b).
If the sequence does not match any valid escape sequence, it throws a ParseException indicating an invalid escaped character.

Block 11: lexOperator() Method


public Token lexOperator() {
if (match(">")) {
match("=");
} else if (match("<")) {
match("=");
} else if (match("!")) {
match("=");
} else if (match("=")) {
match("=");
} else {
match(".");
}
return chars.emit(Token.Type.OPERATOR);
}

This block defines the lexOperator() method, responsible for lexing operators.
It checks for various operator patterns, including >, <, !=, ==, and ..
Depending on the pattern matched, it advances the character stream accordingly.
Finally, it emits an OPERATOR token and returns it.

Block 12: peek() Method


public boolean peek(String... patterns) {
for (int i = 0; i < patterns.length; i++ ) {
if ( !chars.has(i) || !String.valueOf(chars.get(i)).matches(patterns[i])) {
return false;
}
}
return true;
}

This block defines the peek() method, which checks if the next sequence of characters matches the given patterns (specified as regex).
It iterates through the patterns, checking each character in the input stream against the corresponding pattern.
If any character does not match its respective pattern, or if there are not enough characters in the input stream, it returns false. Otherwise, it returns true.

Block 13: match() Method


public boolean match(String... patterns) {
boolean peek = peek(patterns);
if (peek) {
for (int i = 0; i < patterns.length; i++) {
chars.advance();
}
}
return peek;
}
}

This block defines the match() method, which is similar to peek(), but it also advances the character stream if the patterns match.
It first calls peek() to check if the patterns match the next characters in the input.
If the patterns match, it enters a loop that advances the character stream by the length of the matched patterns.
It returns true if the patterns match, and false otherwise.

Block 14: CharStream Inner Class


public static final class CharStream {
// Inner class implementation
}

This block defines the inner class CharStream, which is used to manage the state of the lexer and literals.
It encapsulates the input string, the current index of the character stream, and the length of the current token being matched.
This class is used by the lexer to keep track of the position in the input stream and emit tokens.

Conclusion

In conclusion, our Java lexer implementation stands as a versatile tool for both students and experienced programmers alike. It not only aids in the seamless tokenization of source code but also provides essential error-handling mechanisms, ensuring the accuracy and reliability of the parsing process. Whether you're embarking on a JAVA assignment or striving to advance your programming skills, our lexer offers comprehensive support, guiding you through complex code structures with ease. Its modular and organized design fosters efficient source code analysis, making it an invaluable resource for tackling various programming challenges. Embrace the power of our lexer to enhance your Java programming proficiency, streamline your coding endeavors, and achieve success in your projects. Explore the depths of JAVA with confidence, backed by a tool designed to simplify and empower your coding journey.

Building a Lexer in Java for Tokenizing Source Code