Finnish Name Generator: from Markov chain baseline to a small neural net

Note: This article was written with the help of an AI based on my Jupyter Notebook of the project.

Short story first. I wanted to demonstrate core ML skills by building something concrete, inspectable, and fun: a Finnish-style name generator. The path starts with a dead-simple character bigram Markov model, then levels up to a compact character-level LSTM that learns longer patterns. The secondary goal is practical: a generator for fiction where a real Finnish name or an imported name would feel out of place or uncanny.

The dataset is a single text file, one name per line, sourced from Avoindata. I filtered male and female first names that have been given to over 100 people. Nothing exotic: plain text in, plain text out.

Below I walk through the project in two phases and include the code blocks exactly as I used them. You can run each cell as you read. Comments in prose point out what matters and why.

Phase I — A tiny bigram Markov chain

This is the warm-up. It captures immediate letter-to-letter transitions. No memory beyond the previous character. Surprisingly decent for short names, and very instructive.

Load names

# Cell 1: Import libraries and load data
import random

# Read in Finnish names from the text file (one name per line)
with open('finnish_names.txt', 'r', encoding='utf-8') as f:
    names = [line.strip() for line in f if line.strip()]

print(f"Loaded {len(names)} names")

Build a bigram model with explicit start/stop

# Cell 2: Build a bigram Markov model
# We'll use '^' as start-token and '$' as end-token
model = {}
for name in names:
    padded = '^' + name + '$'
    for i in range(len(padded) - 1):
        prev_char = padded[i]
        next_char = padded[i + 1]
        model.setdefault(prev_char, []).append(next_char)

# Inspect a sample of transitions for the letter 'a'
print("Sample transitions from 'a':", model.get('a', [])[:10])

Sample names by walking the chain

# Cell 3: Define a function to generate a new name
def generate_name(model, max_length=20):
    """
    Generate a single name by walking the Markov chain.
    - Start from '^'
    - At each step, randomly choose one of the observed next characters
    - Stop when we hit '$' or exceed max_length
    """
    result = ""
    current = '^'
    while True:
        choices = model.get(current)
        if not choices:
            break
        nxt = random.choice(choices)
        if nxt == '$' or len(result) >= max_length:
            break
        result += nxt
        current = nxt
    return result

Quick smoke test

# Cell 4: Generate and display example Finnish-style names
for _ in range(10):
    print(generate_name(model))

# Sample output:
Ha
Tuvija
Eukkutain
Singatinti
Navana
Pemiks
Nekanansarispiana
Ederolgvimiinpis
An
Eli

What this gives you. A fast, understandable baseline. You will see plausible bits and vowels doing vowels things. You will also see the limits: no sense of longer patterns, syllables, or vowel harmony beyond one step.

Phase II — Character-level LSTM (a “proper” neural network)

Now we let the model remember more than one character. A small LSTM over characters is enough to learn multi-letter patterns common in Finnish names, including hyphenation and capitalization rules we apply post-generation.

Hyperparameters

# Cell 1: Imports and hyperparameters
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import random

# Hyperparameters – feel free to tweak
BATCH_SIZE = 64
EMBED_SIZE = 32
HIDDEN_SIZE = 128
NUM_LAYERS = 2
LEARNING_RATE = 0.002
NUM_EPOCHS = 20
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Load and normalize data, build vocabulary, and keep an “existing names” set

The model trains on lowercase with ^ and $ as explicit sequence boundaries. After sampling, we apply a small capitalization routine so anna-maija becomes Anna-Maija. We also maintain a set of original names to avoid outputting real names verbatim.

# Cell 2 (updated): Load data, lowercase, build vocab (with PAD token)
with open('finnish_names.txt', 'r', encoding='utf-8') as f:
    raw = [line.strip() for line in f if line.strip()]

# Lowercase all names for training
names = [name.lower() for name in raw]

# Add start/end tokens
all_text = ['^' + name + '$' for name in names]
chars = sorted(set(''.join(all_text)))

PAD = '<pad>'
chars.append(PAD)
char2idx = {ch:i for i,ch in enumerate(chars)}
idx2char = {i:ch for ch,i in char2idx.items()}
PAD_IDX = char2idx[PAD]

print(f"{len(names)} names, vocab size (incl. PAD): {len(chars)}")
# Cell 2a (after loading raw names): Build a set of existing names, post-processed
# We lowercase raw names, then post-process to get their canonical form
def capitalize_finnish(name: str) -> str:
    parts = name.split('-')
    return '-'.join(p.capitalize() for p in parts)

existing = {
    capitalize_finnish(name.lower())
    for name in raw
}
print(f"Found {len(existing)} unique original names")

Dataset, padding, and data loader

We convert each character to an index and create (input, target) pairs by shifting one step. Batches are padded and we tell the loss to ignore <pad> tokens so the model is trained only on real characters.

# Cell 3 (updated): Dataset + collate_fn + DataLoader
from torch.nn.utils.rnn import pad_sequence

class NameDataset(torch.utils.data.Dataset):
    def __init__(self, sequences, char2idx):
        self.seq_idxs = [
            torch.tensor([char2idx[ch] for ch in seq], dtype=torch.long)
            for seq in sequences
        ]
    def __len__(self):
        return len(self.seq_idxs)
    def __getitem__(self, i):
        seq = self.seq_idxs[i]
        return seq[:-1], seq[1:]  # inputs, targets

def collate_fn(batch):
    inputs, targets = zip(*batch)
    # pad both to max length in batch
    inputs_p = pad_sequence(inputs, batch_first=True, padding_value=PAD_IDX)
    targets_p = pad_sequence(targets, batch_first=True, padding_value=PAD_IDX)
    return inputs_p, targets_p

dataset = NameDataset(all_text, char2idx)
loader = DataLoader(
    dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    drop_last=True,
    collate_fn=collate_fn
)

# Update loss to ignore PAD positions
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

Note: In a later cell the loss is redefined without ignore_index. Keep the ignore_index=PAD_IDX version above for best results with padded batches.

Model: a lean character-level LSTM

An embedding layer turns indices into vectors, a stacked LSTM models the sequence, and a linear head projects back to character logits.

# Cell 4: Define the LSTM model
class NameLSTM(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)
    def forward(self, x, hidden=None):
        x = self.embed(x)               # (B, T) → (B, T, E)
        out, hidden = self.lstm(x, hidden)
        logits = self.fc(out)           # (B, T, H) → (B, T, V)
        return logits, hidden

model = NameLSTM(len(chars), EMBED_SIZE, HIDDEN_SIZE, NUM_LAYERS).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss()

As mentioned, prefer the earlier criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX). If you keep this redefinition, remove it or change it to the ignore-index version to avoid training on <pad>.

Training loop

Teacher forcing with next-character prediction. We reshape to (B*T, V) for the loss.

# Cell 5: Training loop
model.train()
for epoch in range(1, NUM_EPOCHS+1):
    total_loss = 0
    for inputs, targets in loader:
        inputs, targets = inputs.to(DEVICE), targets.to(DEVICE)
        optimizer.zero_grad()
        logits, _ = model(inputs)
        # reshape for loss: (B*T, V)
        loss = criterion(logits.view(-1, logits.size(-1)), targets.view(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    avg = total_loss / len(loader)
    print(f"Epoch {epoch}/{NUM_EPOCHS}  loss: {avg:.4f}")

Temperature-controlled sampling with uniqueness filter

We sample one character at a time, with a softmax temperature. Before returning a name, we canonicalize capitalization and ensure it is not an exact match to any real name from the source list. Fiction stays fictional.

# Cell 6b (replace previous sampling with uniqueness loop)
import numpy as np

def sample_unique_name(model, char2idx, idx2char,
                       existing, max_len=20, temperature=1.0, max_tries=10):
    """
    Sample up to max_tries times until we get a name not in existing set.
    Falls back to last sample if uniqueness fails.
    """
    for _ in range(max_tries):
        # generate one name
        model.eval()
        with torch.no_grad():
            inp = torch.tensor([[char2idx['^']]], device=DEVICE)
            raw_name = ''
            hidden = None
            while True:
                logits, hidden = model(inp, hidden)
                logits = logits[0, -1] / temperature
                probs = torch.softmax(logits, dim=0).cpu().numpy()
                probs[char2idx['<pad>']] = 0.0  # never sample PAD
                probs = np.clip(probs, 1e-12, None)
                probs = probs / probs.sum()
                idx = np.random.choice(len(probs), p=probs)
                ch = idx2char[idx]
                if ch == '$' or len(raw_name) >= max_len:
                    break
                raw_name += ch
                inp = torch.tensor([[idx]], device=DEVICE)

        name = capitalize_finnish(raw_name)
        if name not in existing:
            return name

    # If all tries fail, return the last one anyway
    return name

Generate samples at different temperatures

Lower temperatures stick to high-probability patterns. Higher temperatures explore. I usually sweep a small grid and skim for “feel”.

# Cell 7: Generate examples
import numpy as np
for temp in [1.0, 1.2, 1.4, 1.5, 1.6, 1.7]:
    print(f"\n-- temperature {temp} --")
    for _ in range(20):
        print(sample_unique_name(model, char2idx, idx2char, existing, temperature=temp))

# Sample output:
-- temperature 1.0 --
Demiida
Einer
Mathian
Hertinpoika
Anttem
Nigsya
Roly
Cosas
Somelia
Terja
Fmaida
Colina
Eas
Teni
Adi-Pekka
Lilu
Jakal
Lilier
Juiny
Aldanno

-- temperature 1.2 --
Vrynek
Engeth
Glgbenpoika
Hilkkastaria
Hu
Salmi
Tiriic
Berd
Jussan
Ryu
Bestian
Styfe
Vijku
Marciska
Philiat
Paja
Mivenia
Päini
Ventho
Deobeor

-- temperature 1.4 --
Ead
Jurh
Julu
Hudinpoek
Hillab
Sampu
Ussi
Loure
Eedanpoika
Sare
Heannina
Ja
Antinti
Tram
Bervi
Filis
Cristiina
Bler
Rusnd
Myrvo

-- temperature 1.6 --
Rikuulfaed
Göbéc
Myhgina
Tiika
Christitta
Ankarina
Wofhemlija
Orvas
Vyalim
Tufais
Venalmiio
Yöhnan
Ksadian
Hjarni
Ossikhadehved
Petam
Caric
Astir
Joon
Jaakam

-- temperature 1.7 --
Vadfih
Ossud
Rosamary
Aviweiqe
Everas
Oxöpine
Oviablas
Ederia
Niimis
Hamfed
Vgsat
Walpris
Pytrin
Celiklia
Ronne
Limeriia
Chyrik
Monjas
Riwo
Nirkki-Pettej

-- temperature 1.8 --
Puälvan
Piliima
Jorks
Abdrel
.alyvi
Dtithm.er
Emprita
Denossfnhi
Uhpeiinpoika
Arlejiödo
Lisep
Dewevy
Valterus
Pheonitti
Ftrehiias
Vali
Ah
Whegvea
Catrisula
Icrahimj

What this demonstrates (primary point)

Data handling: load, normalize, tokenize, pad, and build vocabularies.
Baselines first: a Markov chain you can read with your eyes.
Sequence modeling: a compact character-level LSTM with embeddings, training loop, and temperature sampling.
Guardrails: canonicalization and an “existing names” filter to avoid reproducing source entries.

What this is good for (secondary point)

A fiction-friendly Finnish-style name generator when a real registry name would be awkward.
Rapid ideation: generate a list, cherry-pick the few that make you nod.

Tuning tips

Temperature. 1.0–1.4 tends to be safe. 1.5–1.7 yields braver shapes, with occasional weirdness that might be exactly what you want.
Length. Clamp max_len to 6–12 if you prefer compact names.
Capacity. If you see repetition, try HIDDEN_SIZE = 256. If you overfit to real names (the uniqueness filter catches it), reduce epochs or try dropout in the LSTM.
Bigram + LSTM hybrid. It is trivial to seed the LSTM with top-k bigram suggestions for the first letter if you want extra stability at the beginning.

Sensible next steps

Add a simple syllable model or a constraint that bans triple vowels/consonants unless observed.
Train a tiny Transformer for curiosity and compare sample quality.
Add filters for stylistic knobs: hyphenated vs. non-hyphenated, ending in -a vs. -i, etc.

If you read this far, you now have both a baseline and a “proper” neural approach, end-to-end, with code you can run in a single notebook. Clean, finite, and, dare I say, pleasantly Finnish.