Skip to main content

Building an Improved Text Generator Using Markov Chains

 

📝 Introduction

Text generation has come a long way, from simple rule-based approaches to powerful AI-driven models like GPT. In our previous post, we introduced a Simple Text Generator based on Markov Chains, capable of generating text from Wikipedia biographies.

In this post, we take it one step further by improving the coherence, readability, and structure of the generated text. Let’s explore what’s new and how you can build your own Improved Text Generator. 🚀


📌 How Does a Markov Chain Text Generator Work?

A Markov Chain is a simple probabilistic model that predicts the next word in a sequence based on previous words. In our case:

  • We analyze Wikipedia text to learn word relationships.
  • We build a transition graph mapping words to possible next words.
  • We generate text by following word sequences probabilistically.

But the basic Markov Chain had some limitations:
✅ It could only use one preceding word for predictions (bigram model).
❌ It struggled with readability, often generating disjointed text.
❌ It could only process one Wikipedia page at a time.


✨ What’s New in the Improved Version?

1️⃣ Smarter Text Generation with Trigrams

📌 Before: The original model used bigrams (1 preceding word).
📌 Now: It uses trigrams, meaning each word is predicted based on two preceding words!
✔️ This improves fluency and sentence coherence significantly.

2️⃣ Supports Multiple Wikipedia Pages

📌 Before: The generator was trained on one biography at a time.
📌 Now: It can fetch and process multiple Wikipedia biographies, leading to better learning.

3️⃣ Cleaner & Better Tokenization

📌 Before: Basic tokenization removed punctuation, but was inconsistent.
📌 Now: Improved punctuation handling, better word splitting, and cleaner text processing.

4️⃣ Readability Improvements

📌 Before: Generated text could have random punctuation mistakes.
📌 Now: It capitalizes sentences, fixes spaces before punctuation, and ensures proper sentence endings (., ?, !).

5️⃣ Output Logging for Better Tracking

📌 Before: Generated text was only printed on the screen.
📌 Now: All outputs are saved to a log file (generated_output.txt), along with timestamps and prompts for easy tracking.


🛠️ How to Build & Run It

1️⃣ Install Dependencies

Run the following command:


pip install wikipedia-api

Or use:


pip install -r requirements.txt

2️⃣ Fetch Wikipedia Biographies

Run the following script to download Wikipedia biographies:

python wiki_getbio.py

This will save Wikipedia text inside the data/ folder.

3️⃣ Train & Generate Text

Run the following script to train the model and generate text:


python run_jmodel.py

You will see the generated text and it will be saved in the log file.


📌 Sample Output

📝 Prompt: "He was"
🤖 Generated Text:


He was born in India and became one of the most influential scientists. He inspired millions with his contributions.

📂 Log file:


[2025-03-10 12:30:45] Prompt: He was Generated: He was born in India and became one of the most influential scientists. --------------------------------------------------

📂 GitHub Repository & Code

You can find the complete Improved Text Generator code on GitHub:
👉 https://github.com/jags-programming/ImprovedSimpleTextGenerator


🔮 What’s Next?

This is just the beginning! In future versions, we can:
🔹 Use POS-tagging to improve grammar accuracy.
🔹 Enhance text coherence with back-off strategies.
🔹 Experiment with Transformer models like GPT for deeper learning.

Stay tuned for more NLP projects and tutorials! 🚀


💡 What do you think of this improved version? Let us know in the comments! 🔥


📌 Next Steps for You:

✅ Try running the code on your own Wikipedia text!
✅ Modify the parameters to see different results.
✅ Check out the GitHub repository and start experimenting.

Would you like further improvements or a tutorial on advanced text generation? Let us know! 🎯

Comments

Popular posts from this blog

Virtual environments in python

 Creating virtual environments is essential for isolating dependencies and ensuring consistency across different projects. Here are the main methods and tools available, along with their pros, cons, and recommendations : 1. venv (Built-in Python Virtual Environment) Overview: venv is a lightweight virtual environment module included in Python (since Python 3.3). It allows you to create isolated environments without additional dependencies. How to Use: python -m venv myenv source myenv/bin/activate # On macOS/Linux myenv\Scripts\activate # On Windows Pros: ✅ Built-in – No need to install anything extra. ✅ Lightweight – Minimal overhead compared to other tools. ✅ Works across all platforms . ✅ Good for simple projects . Cons: ❌ No dependency management – You still need pip and requirements.txt . ❌ Not as feature-rich as other tools . ❌ No package isolation per project directory (requires manual activation). Recommendation: Use venv if you need a simple, lightweight solut...

Building a Simple Text Generator: A Hands-on Introduction

Introduction Text generation is one of the most exciting applications of Natural Language Processing (NLP) . From autocorrect and chatbots to AI-generated stories and news articles , text generation models help machines produce human-like text. In this blog post, we’ll introduce a simple yet effective text generation method using Markov Chains . Unlike deep learning models like GPT, this approach doesn’t require complex neural networks—it relies on probability-based word transitions to create text. We’ll walk through: ✅ The concept of Markov Chains and how they apply to text generation. ✅ A step-by-step implementation , fetching Wikipedia text and training a basic text generator. ✅ Example outputs and future improvements. The Concept of Markov Chains in Text Generation A Markov Chain is a probabilistic model that predicts future states (or words) based only on the current state (or word), rather than the full sentence history. How it works in text generation: 1️⃣ We analyze a gi...

Mastering Trade-Off Analysis in System Architecture: A Strategic Guide for Architects

 In system architecture and design, balancing conflicting system qualities is both an art and a science. Trade-off analysis is a strategic evaluation process that enables architects to make informed decisions that align with business goals and technical constraints. By prioritizing essential system attributes while acknowledging inevitable compromises, architects can craft resilient and efficient solutions. This enhanced guide provides actionable insights and recommendations for architects aiming to master trade-off analysis for impactful architectural decisions. 1. Understanding Trade-Off Analysis Trade-off analysis involves identifying and evaluating the conflicting requirements and design decisions within a system. Architects must balance critical aspects like performance, scalability, cost, security, and maintainability. Since no system can be optimized for every quality simultaneously, prioritization based on project goals is essential. Actionable Insights: Define key quality ...