Building an Improved Text Generator Using Markov Chains

📝 Introduction

Text generation has come a long way, from simple rule-based approaches to powerful AI-driven models like GPT. In our previous post, we introduced a Simple Text Generator based on Markov Chains, capable of generating text from Wikipedia biographies.

In this post, we take it one step further by improving the coherence, readability, and structure of the generated text. Let’s explore what’s new and how you can build your own Improved Text Generator. 🚀

📌 How Does a Markov Chain Text Generator Work?

A Markov Chain is a simple probabilistic model that predicts the next word in a sequence based on previous words. In our case:

We analyze Wikipedia text to learn word relationships.
We build a transition graph mapping words to possible next words.
We generate text by following word sequences probabilistically.

But the basic Markov Chain had some limitations:
✅ It could only use one preceding word for predictions (bigram model).
❌ It struggled with readability, often generating disjointed text.
❌ It could only process one Wikipedia page at a time.

✨ What’s New in the Improved Version?

1️⃣ Smarter Text Generation with Trigrams

📌 Before: The original model used bigrams (1 preceding word).
📌 Now: It uses trigrams, meaning each word is predicted based on two preceding words!
✔️ This improves fluency and sentence coherence significantly.

2️⃣ Supports Multiple Wikipedia Pages

📌 Before: The generator was trained on one biography at a time.
📌 Now: It can fetch and process multiple Wikipedia biographies, leading to better learning.

3️⃣ Cleaner & Better Tokenization

📌 Before: Basic tokenization removed punctuation, but was inconsistent.
📌 Now: Improved punctuation handling, better word splitting, and cleaner text processing.

4️⃣ Readability Improvements

📌 Before: Generated text could have random punctuation mistakes.
📌 Now: It capitalizes sentences, fixes spaces before punctuation, and ensures proper sentence endings (., ?, !).

5️⃣ Output Logging for Better Tracking

📌 Before: Generated text was only printed on the screen.
📌 Now: All outputs are saved to a log file (generated_output.txt), along with timestamps and prompts for easy tracking.

🛠️ How to Build & Run It

1️⃣ Install Dependencies

Run the following command:


pip install wikipedia-api

Or use:


pip install -r requirements.txt

2️⃣ Fetch Wikipedia Biographies

Run the following script to download Wikipedia biographies:

python wiki_getbio.py

This will save Wikipedia text inside the data/ folder.

3️⃣ Train & Generate Text

Run the following script to train the model and generate text:


python run_jmodel.py

You will see the generated text and it will be saved in the log file.

📌 Sample Output

📝 Prompt: "He was"
🤖 Generated Text:


He was born in India and became one of the most influential scientists. He inspired millions with his contributions.

📂 Log file:


[2025-03-10 12:30:45]  
Prompt: He was  
Generated: He was born in India and became one of the most influential scientists.  
--------------------------------------------------

📂 GitHub Repository & Code

You can find the complete Improved Text Generator code on GitHub:
👉 https://github.com/jags-programming/ImprovedSimpleTextGenerator

🔮 What’s Next?

This is just the beginning! In future versions, we can:
🔹 Use POS-tagging to improve grammar accuracy.
🔹 Enhance text coherence with back-off strategies.
🔹 Experiment with Transformer models like GPT for deeper learning.

Stay tuned for more NLP projects and tutorials! 🚀

💡 What do you think of this improved version? Let us know in the comments! 🔥

📌 Next Steps for You:

✅ Try running the code on your own Wikipedia text!
✅ Modify the parameters to see different results.
✅ Check out the GitHub repository and start experimenting.

Would you like further improvements or a tutorial on advanced text generation? Let us know! 🎯

Jageshwar Tripathi

Search This Blog