# Explanation ## V1 Here we have a single input file `input.txt` which looks like ```plaintext function hello { print "hello"; } ``` Now how we tokenise is by having a list of characters for each class. ```python alpha_lower = "abcdefghijklmnopqrstuvwxyz" alpha_upper = alpha_lower.upper() alpha = alpha_upper + alpha_lower numeric = "0123456789" alphanumeric = alpha+numeric whitespace = " \t\r\n" ``` We have what is know as a state machine. The tokeniser starts in the `scan` state. It then proceeds, character by character, and decides based on what character it see, what it does, and which state it goes into next. Tools like [Antlr](/aw/lang/antlr) automate this process, but to understand it it is instructive to write your own tokeniser and parser without assisting tools. Once you understand *what a parser generator is doing for you*, you're better equipped to understand how computer languages work at the grammatical level. We shall flesh out this language grammar and then write a script that parses the [grammar](LanguageGrammar). # The source: ## V1 The input: ```plaintext function hello { print "hello"; } ``` the Python ```python input_text = open("input.txt").read() # step 1: tokenise state = "scan" i = 0 alpha_lower = "abcdefghijklmnopqrstuvwxyz" alpha_upper = alpha_lower.upper() alpha = alpha_upper + alpha_lower numeric = "0123456789" alphanumeric = alpha+numeric whitespace = " \t\r\n" tokens = [] src = input_text while i < len(src): c = src[i] print(f"scan <{state}> i={i} c={c}") if state == "scan": if c in alpha: i0 = i state = "identifier" elif c in whitespace: i0 = i state = "whitespace" elif c == "{": tokens.append(("lbrace","{")) state == "scan" elif c == "}": tokens.append(("rbrace","}")) state == "scan" elif c == ";": tokens.append(("semicolon",";")) elif state == "identifier": if c in alphanumeric: pass elif c not in alphanumeric: tokens.append((state,src[i0:i])) state = "scan" continue elif state == "whitespace": if c in whitespace: pass elif c not in whitespace: tokens.append((state,src[i0:i])) state = "scan" continue i += 1 print(tokens) ``` the output: (skipping the debug prints) ```plaintext [('identifier', 'function'), ('whitespace', ' '), ('identifier', 'hello'), ('whitespace', ' '), ('lbrace', '{'), ('whitespace', '\n '), ('identifier', 'print'), ('whitespace', ' '), ('identifier', 'hello'), ('semicolon', ';'), ('whitespace', '\n'), ('rbrace', '}')] ```