SmartSimpleTextGenerator: A Smarter Way to Generate Text

Introduction

In the world of text generation, simple n-gram models can produce decent results, but they often lack context-awareness and coherence. To address these limitations, I have developed SmartSimpleTextGenerator, an improved version of my previous project, ImprovedSimpleTextGenerator.

This new version enhances the text generation process by integrating Part-of-Speech (POS) tagging, n-gram models, and a back-off strategy, making the generated text more meaningful and contextually relevant.

Key Features of SmartSimpleTextGenerator

✅ N-Gram Model with POS Tagging – Uses trigrams (default n=3) and applies POS tagging for better word prediction.
✅ Back-off Strategy – If a trigram sequence is unavailable, it falls back to bigrams and unigrams to ensure smooth text generation.
✅ Sentence Tokenization & Structure Preservation – Tokenizes input text properly while maintaining sentence integrity.
✅ Randomized Word Selection – Generates diverse outputs rather than repeating the same phrases.
✅ Handles Unknown Words Gracefully – Introduces a fallback mechanism to prevent abrupt text termination.

What’s New Compared to ImprovedSimpleTextGenerator?

🔹 Integration of POS Tagging – Unlike the previous version, which relied solely on word sequences, this version considers grammatical structure to enhance word selection.
🔹 Improved Text Coherence – The model now produces more fluent sentences by using part-of-speech-based word prediction.
🔹 More Robust Back-Off Strategy – If the highest-order n-gram isn’t available, the model smoothly transitions to lower-order n-grams, reducing abrupt sentence breaks.
🔹 Unigram Frequency Fallback – Ensures better handling of rare words, improving text quality compared to the previous version.
🔹 Better Sentence Termination – Generates text until a logical endpoint, rather than cutting off randomly.

How It Works

1️⃣ Training the Model:

The input text is tokenized and assigned POS tags.
Trigrams, bigrams, and unigrams are stored in a structured format.
Word sequences and their probabilities are recorded for future predictions.

2️⃣ Generating Text:

The model starts with a user-provided prompt.
It predicts the next word using trigrams (or falls back to bigrams/unigrams).
The process continues until a sentence-ending punctuation is reached or the word limit is met.

Code & Installation

The project is available on GitHub. You can clone and use it with:


git clone https://github.com/your-username/SmartSimpleTextGenerator.git
cd SmartSimpleTextGenerator
pip install -r requirements.txt

What’s Improved Compared to ImprovedSimpleTextGenerator?

1️⃣ Using POS Tagging for Better Word Prediction

👉 Before (ImprovedSimpleTextGenerator):
It used only word-based transitions, which sometimes led to grammatically incorrect predictions.

python

self.graph[key].append(next_word) # Old way (no POS tagging)

👉 Now (SmartSimpleTextGenerator):
It stores POS tags along with words, helping predict grammatically correct words.

python

self.pos_graph[key].append((next_word, next_pos)) # Store word + POS

📌 Why This is Better?
Instead of just predicting "is" or "the" randomly, the model now considers if a noun, verb, or adjective should come next!

2️⃣ Better Back-Off Strategy (Fallback to Bigram & Unigram)

👉 Before:
If the model couldn’t find a matching trigram, it stopped generating text.

👉 Now:
It first tries trigrams, then bigrams, and if both fail, it falls back to unigrams (most frequent words).

python

def _get_next_word(self, key): if key in self.pos_graph: # Prefer trigrams return random.choice([word for word, pos in self.pos_graph[key]]) bigram_key = key[-1:] # Try bigram fallback bigram_matches = [k for k in self.graph if k[-1:] == bigram_key] if bigram_matches: return random.choice(self.graph[random.choice(bigram_matches)]) if self.unigram_counts: # Unigram fallback return self.unigram_counts.most_common(1)[0][0] return "UNKNOWN" # If all else fails

📌 Why This is Better?
Even if the model doesn't find a perfect match, it still generates meaningful text instead of abruptly stopping.

3️⃣ Smarter Sentence Ending

👉 Before:
The model kept generating text endlessly or stopped too soon.

👉 Now:
It stops at punctuation (., !, ?) to ensure natural sentence structure.

python

if next_word in punctuation or next_word in ['.', '!', '?']: break # Stop at logical sentence boundaries

📌 Why This is Better?
Now, sentences end where they naturally should, making the generated text more realistic.

Final Thoughts

With SmartSimpleTextGenerator, text prediction and generation have become more contextually aware and grammatically structured. These enhancements ensure better fluency, diversity, and coherence compared to the older ImprovedSimpleTextGenerator.

Try it out, and feel free to contribute to the GitHub repository! 🚀

Jageshwar Tripathi

Search This Blog