Here's the code I used to extract PDF content using pdfminer.six
from pdfminer.high_level import extract_text
import pyttsx3
text = extract_text(pdf_file_path, page_numbers =[1,3])
# text content is shown below
# this text need to applied RegEX to convert into proper paragraphs
engine = pyttsx3.init()
engine.say(text)
engine.runAndWait()
text content is shown below :
Introduction
The Book of Secrets became an Osho “classic” shortly after it was first
published. And no wonder – it contains not only a comprehensive
overview of Osho’s unique, contemporary take on the eternal human quest
for meaning, but also the most comprehensive set of meditation
techniques available to help find that meaning within our own lives.
As Osho explains in the first chapter:
These are the oldest, most ancient techniques. But you can call them
the latest, also, because nothing can be added to them. They have taken in
all the possibilities, all the ways of cleaning the mind, transcending the
mind. Not a single method could be added to [these] one hundred and
twelve methods. It is the most ancient and yet the latest, yet the newest.
Old like old hills – the methods seem eternal – and they are new like a
dewdrop before the sun, because they are so fresh. These one hundred and
twelve methods constitute the whole science of transforming mind.
Issue: engine.say(text) does its work BUT While speaking it gives long pauses after carriage return (e.g. "first", "comprehensive", "quest" ...) for an interval which matches pause after full stop(.). So, inorder to read smoothly I want first convert these paragraphs in proper format.
Solution approaches: Since the reader makes equal pauses at both - the end of sentence and the end of paragraph, we can chose following approaches:
- Either convert entire one paragraph as single sentence and pass to reader.
- Or, pass every single sentence(between two fullstops) to the reader.
Expected text (approach 1 - Preferable) :
Introduction
The Book of Secrets became an Osho “classic” shortly after it was first published. And no wonder – it contains not only a comprehensive overview of Osho’s unique, contemporary take on the eternal human quest for meaning, but also the most comprehensive set of meditation techniques available to help find that meaning within our own lives.
As Osho explains in the first chapter:
These are the oldest, most ancient techniques. But you can call them the latest, also, because nothing can be added to them. They have taken in all the possibilities, all the ways of cleaning the mind, transcending the mind. Not a single method could be added to [these] one hundred and twelve methods. It is the most ancient and yet the latest, yet the newest. Old like old hills – the methods seem eternal – and they are new like a dewdrop before the sun, because they are so fresh. These one hundred and twelve methods constitute the whole science of transforming mind.
Expected text (approach 2 - Preferable) :
Introduction
The Book of Secrets became an Osho “classic” shortly after it was first published.
And no wonder – it contains not only a comprehensive overview of Osho’s unique, contemporary take on the eternal human quest for meaning, but also the most comprehensive set of meditation techniques available to help find that meaning within our own lives.
As Osho explains in the first chapter:
These are the oldest, most ancient techniques. But you can call them the latest, also, because nothing can be added to them.
They have taken in all the possibilities, all the ways of cleaning the mind, transcending the mind.
Not a single method could be added to [these] one hundred and twelve methods. It is the most ancient and yet the latest, yet the newest.
Old like old hills – the methods seem eternal – and they are new like a dewdrop before the sun, because they are so fresh.
These one hundred and twelve methods constitute the whole science of transforming mind.
I am fairly new to RegEx and unable to come up with a RegEx that removes newline but retains paragraph structure.
For the first approach, you could use
re.sub(r"(?<!\n|:)\n(?!\n)", " ", text). There is probably a better way to do it, but it does work for the sample text. It functions by checking that a newline is::characterand replacing the matches with a space. However, it does add a single space to the first line and after some
.characters which is less than ideal, but may not have a significant impact on the reader which I am unfamiliar with.Note: The use of
file.writewas for copying and display purposes since it would be too long to display in my terminal, it isn't a necessary part of the step.