# Explanation

## V1

Here we have a single input file `input.txt` which looks like
```plaintext
function hello {
  print "hello";
}
```
Now how we tokenise is by having a list of characters for each class.
```python
alpha_lower = "abcdefghijklmnopqrstuvwxyz"
alpha_upper = alpha_lower.upper()
alpha = alpha_upper + alpha_lower
numeric = "0123456789"
alphanumeric = alpha+numeric
whitespace = " \t\r\n"
```
We have what is know as a state machine. The tokeniser starts in the `scan` state.
It then proceeds, character by character, and decides based on what character it see, what it does,
and which state it goes into next. Tools like [Antlr](/aw/lang/antlr) automate this process, but to
understand it it is instructive to write your own tokeniser and parser without assisting tools.
Once you understand *what a parser generator is doing for you*, you're better equipped to understand
how computer languages work at the grammatical level.

We shall flesh out this language grammar and then write a script that parses the [grammar](LanguageGrammar).

# The source:

## V1
The input:
```plaintext
function hello {
  print "hello";
}
```
the Python
```python
input_text = open("input.txt").read()

# step 1: tokenise
state = "scan"
i = 0
alpha_lower = "abcdefghijklmnopqrstuvwxyz"
alpha_upper = alpha_lower.upper()
alpha = alpha_upper + alpha_lower
numeric = "0123456789"
alphanumeric = alpha+numeric
whitespace = " \t\r\n"
tokens = []
src = input_text
while i < len(src):
  c = src[i]
  print(f"scan <{state}> i={i} c={c}")
  if state == "scan":
    if c in alpha:
      i0 = i
      state = "identifier"
    elif c in whitespace:
      i0 = i
      state = "whitespace"
    elif c == "{":
      tokens.append(("lbrace","{"))
      state == "scan"
    elif c == "}":
      tokens.append(("rbrace","}"))
      state == "scan"
    elif c == ";":
      tokens.append(("semicolon",";"))
  elif state == "identifier":
    if c in alphanumeric:
      pass
    elif c not in alphanumeric:
      tokens.append((state,src[i0:i]))
      state = "scan"
      continue
  elif state == "whitespace":
    if c in whitespace:
      pass
    elif c not in whitespace:
      tokens.append((state,src[i0:i]))
      state = "scan"
      continue
  i += 1
print(tokens)
```
the output: (skipping the debug prints)
```plaintext
[('identifier', 'function'), ('whitespace', ' '), ('identifier', 'hello'), ('whitespace', ' '), ('lbrace', '{'), ('whitespace', '\n  '), ('identifier', 'print'), ('whitespace', ' '), ('identifier', 'hello'), ('semicolon', ';'), ('whitespace', '\n'), ('rbrace', '}')]

```