📝 Introduction
Text generation has come a long way, from simple rule-based approaches to powerful AI-driven models like GPT. In our previous post, we introduced a Simple Text Generator based on Markov Chains, capable of generating text from Wikipedia biographies.
In this post, we take it one step further by improving the coherence, readability, and structure of the generated text. Let’s explore what’s new and how you can build your own Improved Text Generator. 🚀
📌 How Does a Markov Chain Text Generator Work?
A Markov Chain is a simple probabilistic model that predicts the next word in a sequence based on previous words. In our case:
- We analyze Wikipedia text to learn word relationships.
- We build a transition graph mapping words to possible next words.
- We generate text by following word sequences probabilistically.
But the basic Markov Chain had some limitations:
✅ It could only use one preceding word for predictions (bigram model).
❌ It struggled with readability, often generating disjointed text.
❌ It could only process one Wikipedia page at a time.
✨ What’s New in the Improved Version?
1️⃣ Smarter Text Generation with Trigrams
📌 Before: The original model used bigrams (1 preceding word).
📌 Now: It uses trigrams, meaning each word is predicted based on two preceding words!
✔️ This improves fluency and sentence coherence significantly.
2️⃣ Supports Multiple Wikipedia Pages
📌 Before: The generator was trained on one biography at a time.
📌 Now: It can fetch and process multiple Wikipedia biographies, leading to better learning.
3️⃣ Cleaner & Better Tokenization
📌 Before: Basic tokenization removed punctuation, but was inconsistent.
📌 Now: Improved punctuation handling, better word splitting, and cleaner text processing.
4️⃣ Readability Improvements
📌 Before: Generated text could have random punctuation mistakes.
📌 Now: It capitalizes sentences, fixes spaces before punctuation, and ensures proper sentence endings (.
, ?
, !
).
5️⃣ Output Logging for Better Tracking
📌 Before: Generated text was only printed on the screen.
📌 Now: All outputs are saved to a log file (generated_output.txt
), along with timestamps and prompts for easy tracking.
🛠️ How to Build & Run It
1️⃣ Install Dependencies
Run the following command:
Or use:
2️⃣ Fetch Wikipedia Biographies
Run the following script to download Wikipedia biographies:
This will save Wikipedia text inside the data/
folder.
3️⃣ Train & Generate Text
Run the following script to train the model and generate text:
You will see the generated text and it will be saved in the log file.
📌 Sample Output
📝 Prompt: "He was"
🤖 Generated Text:
📂 Log file:
📂 GitHub Repository & Code
You can find the complete Improved Text Generator code on GitHub:
👉 https://github.com/jags-programming/ImprovedSimpleTextGenerator
🔮 What’s Next?
This is just the beginning! In future versions, we can:
🔹 Use POS-tagging to improve grammar accuracy.
🔹 Enhance text coherence with back-off strategies.
🔹 Experiment with Transformer models like GPT for deeper learning.
Stay tuned for more NLP projects and tutorials! 🚀
💡 What do you think of this improved version? Let us know in the comments! 🔥
📌 Next Steps for You:
✅ Try running the code on your own Wikipedia text!
✅ Modify the parameters to see different results.
✅ Check out the GitHub repository and start experimenting.
Would you like further improvements or a tutorial on advanced text generation? Let us know! 🎯
Comments
Post a Comment