WordPlay
bidirectional character n-gram language models for word generation and Hangman word guessing challenge
Ever wondered English has so many words that you hear for the first time, and you wonder how do they even exist, and why haven’t I heard about it ever before?
kibosh, svengali, chicanery, bulwark, clandestine, and so on.. This discovery of new words does not stop even after memorizing all the vocabulary words for the toefl, it goes on and on. It seems like there is some magical power in this language with the potential to generate make new words, give them a meaning, and wait for them to get popularized. This idea to potentially make up new words is not a novel idea at all, history is witness to so many words being brought into the popular culture for more than the last half milennia… more increasingly now than ever in the information age.
Find details about technical implementation here on my github.
- WordGuesser: implemented n-gram language models for guessing strategy for the word guessing challenge
- WordGenerator: produce new english words from the trained generative probabilistic n-gram models.
What might these new words be useful for
Maybe you can these new generated words as the name for the secret incorporation idea that you have had since ages.
About this trained model: it is very simple and intuitive actually
The patterns that you observe in the words generated by the models arise having been trained on a dataset of 227,000 english words.
LexiNet is a character-level word modeling project built around bidirectional n-gram language models. The repository trains forward and reverse models with start/end padding, masked contexts, interpolation/backoff, and smoothing. The first live demo below uses the trained model tables to generate new English-like strings of a requested length directly inside the browser.
What is happening here?: a technical note for the demos
The demo starts with a blank word of the requested length. At every step, it scores candidate letters at every remaining blank position using the trained forward and reverse n-gram contexts, samples one (position, letter) pair, fills that slot, and repeats until the word is complete. Lower temperatures make the generation more conservative; higher temperatures make it more exploratory.
This is not dictionary lookup; it is probabilistic generation from the learned context counts.
One might be so bold to call these words hallucinations! Hah, it is not incorrect to say that. Of all these patterns that are being learned this generation, the very premise of finding new words that were not trained on, that is, seen before in the training set, would surface in between the probability distributions, known in the pop culture as hallucinations.
Both demos run entirely in the browser from compact JSON exports of the trained n-gram count tables. The generator begins with blank slots and repeatedly samples a (position, letter) pair using bidirectional context probabilities. Temperature controls how sharply those probabilities are used: low temperature favors the highest-probability letters, while high temperature allows more unusual choices. The guessing game is deterministic and greedier: given the visible pattern, it scores every unguessed letter across all blank positions using forward and reverse context probabilities, guesses the highest-scoring letter, reveals matches, and loses one life only on misses. For now, the site uses the strongest loaded public model order from n=3 to n=5; the local n=6 model can be swapped in later if the larger browser asset feels worth it.