Python/RegEx : Converting bad paragrapgh to good paragraph

51 Views Asked by At

Here's the code I used to extract PDF content using pdfminer.six

from pdfminer.high_level import extract_text
import pyttsx3

text = extract_text(pdf_file_path, page_numbers =[1,3])
# text content is shown below
# this text need to applied RegEX to convert into proper paragraphs

engine = pyttsx3.init()
engine.say(text)
engine.runAndWait()

text content is shown below :

Introduction

The Book of Secrets became an Osho “classic” shortly after it was first
published.  And  no  wonder  –  it  contains  not  only  a  comprehensive
overview of Osho’s unique, contemporary take on the eternal human quest
for  meaning,  but  also  the  most  comprehensive  set  of  meditation
techniques available to help find that meaning within our own lives.

As Osho explains in the first chapter:
These  are  the  oldest,  most  ancient  techniques.  But  you  can  call  them
the latest, also, because nothing can be added to them. They have taken in
all  the  possibilities,  all  the  ways  of  cleaning  the  mind,  transcending  the
mind.  Not  a  single  method  could  be  added  to  [these]  one  hundred  and
twelve  methods.  It  is  the  most ancient  and yet  the  latest, yet  the  newest.
Old  like  old  hills  –  the  methods  seem  eternal  –  and  they  are  new  like  a
dewdrop before the sun, because they are so fresh. These one hundred and
twelve methods constitute the whole science of transforming mind.

Issue: engine.say(text) does its work BUT While speaking it gives long pauses after carriage return (e.g. "first", "comprehensive", "quest" ...) for an interval which matches pause after full stop(.). So, inorder to read smoothly I want first convert these paragraphs in proper format.

Solution approaches: Since the reader makes equal pauses at both - the end of sentence and the end of paragraph, we can chose following approaches:

  1. Either convert entire one paragraph as single sentence and pass to reader.
  2. Or, pass every single sentence(between two fullstops) to the reader.

Expected text (approach 1 - Preferable) :

Introduction

The Book of Secrets became an Osho “classic” shortly after it was first published.  And  no  wonder  –  it  contains  not  only  a  comprehensive overview of Osho’s unique,  contemporary take on the eternal human quest for  meaning,  but  also  the  most  comprehensive  set  of  meditation techniques available to help find that meaning within our own lives.

As Osho explains in the first chapter:
These  are  the  oldest,  most  ancient  techniques.  But  you  can  call  them     the latest, also, because nothing can be added to them. They have taken in     all  the  possibilities,  all  the  ways  of  cleaning  the  mind,  transcending  the     mind.  Not  a  single  method  could  be  added  to  [these]  one  hundred  and     twelve  methods.  It  is  the  most ancient  and yet  the  latest, yet  the  newest.    Old  like  old  hills  –  the  methods  seem  eternal  –  and  they  are  new  like  a     dewdrop before the sun, because they are so fresh. These one hundred and     twelve methods constitute the whole science of transforming mind.

Expected text (approach 2 - Preferable) :

Introduction    
The Book of Secrets became an Osho “classic” shortly after it was first published.
And  no  wonder  –  it  contains  not  only  a  comprehensive overview of Osho’s unique,  contemporary take on the eternal human quest for  meaning,  but  also  the  most  comprehensive  set  of  meditation techniques available to help find that meaning within our own lives.
As Osho explains in the first chapter:
These  are  the  oldest,  most  ancient  techniques.  But  you  can  call  them     the latest, also, because nothing can be added to them. 
They have taken in     all  the  possibilities,  all  the  ways  of  cleaning  the  mind,  transcending  the     mind.  
Not  a  single  method  could  be  added  to  [these]  one  hundred  and     twelve  methods.  It  is  the  most ancient  and yet  the  latest, yet  the  newest.    
Old  like  old  hills  –  the  methods  seem  eternal  –  and  they  are  new  like  a     dewdrop before the sun, because they are so fresh. 
These one hundred and     twelve methods constitute the whole science of transforming mind.

I am fairly new to RegEx and unable to come up with a RegEx that removes newline but retains paragraph structure.

2

There are 2 best solutions below

0
Shorn On

For the first approach, you could use re.sub(r"(?<!\n|:)\n(?!\n)", " ", text). There is probably a better way to do it, but it does work for the sample text. It functions by checking that a newline is:

  1. A singular newline
  2. Does not come after a : character

and replacing the matches with a space. However, it does add a single space to the first line and after some . characters which is less than ideal, but may not have a significant impact on the reader which I am unfamiliar with.

text = """
Introduction

The Book of Secrets became an Osho “classic” shortly after it was first
published.  And  no  wonder  –  it  contains  not  only  a  comprehensive
overview of Osho’s unique, contemporary take on the eternal human quest
for  meaning,  but  also  the  most  comprehensive  set  of  meditation
techniques available to help find that meaning within our own lives.

As Osho explains in the first chapter:
These  are  the  oldest,  most  ancient  techniques.  But  you  can  call  them
the latest, also, because nothing can be added to them. They have taken in
all  the  possibilities,  all  the  ways  of  cleaning  the  mind,  transcending  the
mind.  Not  a  single  method  could  be  added  to  [these]  one  hundred  and
twelve  methods.  It  is  the  most ancient  and yet  the  latest, yet  the  newest.
Old  like  old  hills  –  the  methods  seem  eternal  –  and  they  are  new  like  a
dewdrop before the sun, because they are so fresh. These one hundred and
twelve methods constitute the whole science of transforming mind.
"""

import re

with open("blank_txt.txt", mode="w", encoding="utf-8") as f:
    f.write(re.sub(r"(?<!\n|:)\n(?!\n)", " ", text))

Note: The use of file.write was for copying and display purposes since it would be too long to display in my terminal, it isn't a necessary part of the step.

 Introduction

The Book of Secrets became an Osho “classic” shortly after it was first published.  And  no  wonder  –  it  contains  not  only  a  comprehensive overview of Osho’s unique, contemporary take on the eternal human quest for  meaning,  but  also  the  most  comprehensive  set  of  meditation techniques available to help find that meaning within our own lives.

As Osho explains in the first chapter:
These  are  the  oldest,  most  ancient  techniques.  But  you  can  call  them the latest, also, because nothing can be added to them. They have taken in all  the  possibilities,  all  the  ways  of  cleaning  the  mind,  transcending  the mind.  Not  a  single  method  could  be  added  to  [these]  one  hundred  and twelve  methods.  It  is  the  most ancient  and yet  the  latest, yet  the  newest. Old  like  old  hills  –  the  methods  seem  eternal  –  and  they  are  new  like  a dewdrop before the sun, because they are so fresh. These one hundred and twelve methods constitute the whole science of transforming mind. 
0
Reilas On

For the second approach, you'll need to make 2 formats.

First, find and replace the following with, either 1 or 2, spaces.

The pattern is effectively saying, match on a new-line delimiter, which is followed by a lowercase letter.

(?:\r?\n|\r)(?=^[a-z])

This will produce the following.

Introduction

The Book of Secrets became an Osho “classic” shortly after it was first  published.  And  no  wonder  –  it  contains  not  only  a  comprehensive  overview of Osho’s unique, contemporary take on the eternal human quest  for  meaning,  but  also  the  most  comprehensive  set  of  meditation  techniques available to help find that meaning within our own lives.

As Osho explains in the first chapter:
These  are  the  oldest,  most  ancient  techniques.  But  you  can  call  them  the latest, also, because nothing can be added to them. They have taken in  all  the  possibilities,  all  the  ways  of  cleaning  the  mind,  transcending  the  mind.  Not  a  single  method  could  be  added  to  [these]  one  hundred  and  twelve  methods.  It  is  the  most ancient  and yet  the  latest, yet  the  newest.
Old  like  old  hills  –  the  methods  seem  eternal  –  and  they  are  new  like  a  dewdrop before the sun, because they are so fresh. These one hundred and  twelve methods constitute the whole science of transforming mind.

Which, from there, you can find any starting sentances, and replace with a new-line delimiter.

(?<=\.) +(?=[A-Z])
Introduction

The Book of Secrets became an Osho “classic” shortly after it was first  published.
And  no  wonder  –  it  contains  not  only  a  comprehensive  overview of Osho’s unique, contemporary take on the eternal human quest  for  meaning,  but  also  the  most  comprehensive  set  of  meditation  techniques available to help find that meaning within our own lives.

As Osho explains in the first chapter:
These  are  the  oldest,  most  ancient  techniques.
But  you  can  call  them  the latest, also, because nothing can be added to them.
They have taken in  all  the  possibilities,  all  the  ways  of  cleaning  the  mind,  transcending  the  mind.
Not  a  single  method  could  be  added  to  [these]  one  hundred  and  twelve  methods.
It  is  the  most ancient  and yet  the  latest, yet  the  newest.
Old  like  old  hills  –  the  methods  seem  eternal  –  and  they  are  new  like  a  dewdrop before the sun, because they are so fresh.
These one hundred and  twelve methods constitute the whole science of transforming mind.