I am trying to remove "\r\n-" in a text which I extracted from a PDF file using readtext() from readtext package in R Studio. Below is my code in R:
library(readtext)
jd <- readtext("C:/Users/HomeUser/Documents/Sales Manager.pdf")
jd_text <- jd$text
jd_text2 <- gsub(pattern = "\r\n-?|•", replacement = " ", jd_text)
Below is the original extracted text jd_text:
"Sales Manager\r\nCFB Bots is a technology service provider specializing in Intelligent Automation (IA). We partner with\r\nlarge enterprises in their Digital Transformation journey and help them and their employees thrive\r\nin the Future of Work. Our mission is to co-create the Digital Workforce of the Future, and our vision\r\nis to make work enjoyable. For more information, please visit www.cfb-bots.com.\r\nWe are looking for a high performing frontrunner to blaze the trail and make new connections for\r\nour growing business. As a Sales Manager, you will play a vital role in keeping the Company\r\ncompetitive by achieving our customer acquisition and revenue growth targets. You will be the key\r\nliaison in every stage of the sales process, from planning to closing the sales.\r\nIf you are passionate about technology and are motivated by a hunger to solve our clients’\r\nchallenges, read on to find out more.\r\nYou can gain:\r\n− Incentive for achieving sales targets\r\n− Exposure to the latest industry trends and technologies\r\n− Endless learning and growth opportunities\r\n− Sharpen sales planning, analytical and management skills\r\n− Flexible work-life benefits\r\nYou will do:\r\nSales Strategy\r\n- Develop ..."
I was able to remove many "\r\n-" in jd_text using gsub(). Output from jd_text2 below:
"Sales Manager CFB Bots is a technology service provider specializing in Intelligent Automation (IA). We partner with large enterprises in their Digital Transformation journey and help them and their employees thrive in the Future of Work. Our mission is to co-create the Digital Workforce of the Future, and our vision is to make work enjoyable. For more information, please visit www.cfb-bots.com. We are looking for a high performing frontrunner to blaze the trail and make new connections for our growing business. As a Sales Manager, you will play a vital role in keeping the Company competitive by achieving our customer acquisition and revenue growth targets. You will be the key liaison in every stage of the sales process, from planning to closing the sales. If you are passionate about technology and are motivated by a hunger to solve our clients’ challenges, read on to find out more. You can gain: − Incentive for achieving sales targets − Exposure to the latest industry trends and technologies − Endless learning and growth opportunities − Sharpen sales planning, analytical and management skills − Flexible work-life benefits You will do: Sales Strategy Develop ..."
As you can see, I was able to remove "\r\n-" occurring after "Flexible work-life benefits" while "-" from those first few "\r\n-" still remained. However, when I pasted the original text extract directly from the display of jd_text in R Studio console into a new variable jd_test, applied gsub() again, I was able to accomplish my goal:
jd_test <- "Sales Manager\r\nCFB Bots is a technology service provider specializing in Intelligent Automation (IA). We partner with\r\nlarge enterprises in their Digital Transformation journey and help them and their employees thrive\r\nin the Future of Work. Our mission is to co-create the Digital Workforce of the Future, and our vision\r\nis to make work enjoyable. For more information, please visit www.cfb-bots.com.\r\nWe are looking for a high performing frontrunner to blaze the trail and make new connections for\r\nour growing business. As a Sales Manager, you will play a vital role in keeping the Company\r\ncompetitive by achieving our customer acquisition and revenue growth targets. You will be the key\r\nliaison in every stage of the sales process, from planning to closing the sales.\r\nIf you are passionate about technology and are motivated by a hunger to solve our clients’\r\nchallenges, read on to find out more.\r\nYou can gain:\r\n− Incentive for achieving sales targets\r\n− Exposure to the latest industry trends and technologies\r\n− Endless learning and growth opportunities\r\n− Sharpen sales planning, analytical and management skills\r\n− Flexible work-life benefits\r\nYou will do:\r\nSales Strategy\r\n- Develop ..."
jd_test2 <- gsub(pattern = "\r\n-?|•", replacement = " ", jd_test)
Output from jd_test2:
Sales Manager CFB Bots is a technology service provider specializing in Intelligent Automation (IA). We partner with large enterprises in their Digital Transformation journey and help them and their employees thrive in the Future of Work. Our mission is to co-create the Digital Workforce of the Future, and our vision is to make work enjoyable. For more information, please visit www.cfb-bots.com. We are looking for a high performing frontrunner to blaze the trail and make new connections for our growing business. As a Sales Manager, you will play a vital role in keeping the Company competitive by achieving our customer acquisition and revenue growth targets. You will be the key liaison in every stage of the sales process, from planning to closing the sales. If you are passionate about technology and are motivated by a hunger to solve our clients’ challenges, read on to find out more. You can gain: Incentive for achieving sales targets Exposure to the latest industry trends and technologies Endless learning and growth opportunities Sharpen sales planning, analytical and management skills Flexible work-life benefits You will do: Sales Strategy Develop ..."
Anyone has any idea what is the problem and how do I go about it? I have tried using another function pdf_text() from pdftools package but it yielded the same frustrating result. At first I thought "-" for the first few "\r\n-" is slightly longer than the latter ones but the direct copy-paste attempt seems to contradict this observation. Is there something "hidden" in the object which is not migrated during the copy-paste action? Any suggestions is greatly appreciated!
I found a likely answer to my question. It seems the original extracted text from the PDF document is not in an encoding that R Studio could recognise. This would explain why for the first few "-"s were not removed. After I apply
jd_text <-iconv(jd_text,"UTF-8")to coerce the encoding to UTF-8, my problem was solved, and I am able to remove "\r\n-" completely.