Markov Chain Text Generation

With the recent postings elsewhere about Markov Chains and text production, I figured I'd take a stab at it. I based my code on the code at the latter link. Instead of the King James Bible and Lovecraft, I joined two million words from sources found on Project Gutenberg:

  • The Mystery of Edwin Drood, by Charles Dickens
  • The Secret Agent, by Joseph Conrad
  • Superstition in All Ages (1732), by Jean Meslier
  • The Works of Edgar Allan Poe, by Edgar Allan Poe (five volumes)
  • The New Pun Book, by Thomas A. Brown and Thomas Joseph Carey
  • The Superstitions of Witchcraft, by Howard Williams
  • The House-Boat on the Styx, by John Kendrick Bangs
  • Bulfinch's Mythology: the Age of Fable, by Thomas Bulfinch
  • Dracula, by Bram Stoker
  • The World English Bible (WEB)
  • Frankenstein, by Mary Shelley
  • The Vision of Paradise, by Dante
  • The Devil's Dictionary, by Ambrose Bierce

An example of part of a run:

‘My dear girl!’ ‘You frightened me.’ ‘Most unintentionally, but I am the belief.

‘O! _that’s_ it, but cried out. "Dr. Seward,.

Of olives, easily I pass'd from out their.

‘Why, certainly, certainly,’ he doth kindle.

‘Eh? O, well, I suppose he.

Helena’s stay afar. They fly as.

‘_We_ are,’ said to the scriptures. St..

‘Why?’ ‘I don’t know, Mr. Neville’ (adopting the mind of.

‘Because I _don’t_,’ said the old building.

This almost makes sense.

My plan is to produce around 250,000 words and then edit it down to a more readable 60-70,000 words. I'm approaching this like an artist approaching a medium: the medium is something worked into a final form, not something that produces the art on its own.

Editing the above text might produce the following:

"My dear girl!"

"You frightened me."

"Most unintentionally, I believe."

"Oh! That's it," she cried out. "Dr. Seward."

He had easily passed by the olives.

"Why, certainly, certainly," he kindled.

"Eh? Oh, well, I suppose."

Helena held back. They were flying along.

"We are," he said, staring at the framed scripture.


"I don't know, Mr. Neville?" adopting her mindset.

"Because I don't," said the old building.

That's a light edit, but it's already suggesting a scene. Two people walking around a buffet in a display space. The room has some electronics that are listening in and participating in the conversation, so perhaps a light science fiction setting. It also seems as if they are in a ship of some kind, though I'm not sure if it's flying or floating. Or perhaps the "flying along" is a metaphor for how Helena feels about the conversation. Maybe a bit sarcastic?

Programming Details

I'm adding the program code at the end of this post, but the general outline is that the program reads all the source texts parses them into words. These words are strung together into a long string used to seed the primary Markov Chain generator. The first words of each paragraph and paragraph lengths are also used to seed secondary Markov Chain generators.

The program uses the secondary generators to decide on the first word and the length for each generated word chain. The program produces a random number of chains for each section (between 10 and 30 chains), with each chain output as a paragraph.

The program produces sections of chains/paragraphs until it reaches a target word count.

My hope with the secondary generators is that the output text will better reproduce larger patterns in the text than simply outputting chains independent of each other. These generators don't capture overall topic evolution in a text or plot structure, but for now I want to do that through editing.