I need to transform survey results from a comma-separated variable file (originally Excel) where each of the few hundred records contains an arbitrary number of quotes about a location (preceded by "Story xx: " and followed by a new line) to a JSON file that I can use as the data source for a website that will map the responses. A typical record looks like this (I had to remove some URLs or the post was flagged as spam):
1,Alethia Tanner Park,227 Harry Thomas Way NE,Washington,DC,20002,3,"Story 1: First quote of someone commenting on this location.
Story 2: Second quote from a resident that likes this place.
Story 3: Third quote about this park in the city.
",38.91143761,-77.00165721
I want to transform the above to this where each quote is surrounded by HTML left/right double quotes and followed by a horizontal line:
{
"latitudeLongitude": [38.91143761, -77.00165721],
"displayName": "Alethia Tanner Park",
"quote": "“First quote of someone commenting on this location.”<hr />“Second quote from a resident that likes this place.”<hr />“Third quote about this park in the city.”<hr />",
"numberOfCitations": 3
},
I am using regex101.com (which uses PCRE2 processor by default) and my input regular expression to extract the components of each record is:
^(?'groupId'[0-9]{1,3}),(?'displayName'[^,]+),(?'street'\?\?\?|[0-9 a-zA-Z]*),(?'city'\?\?\?|[ a-zA-Z]*),(?'state'\?\?\?|[A-Z]{2}),(?'zipCode'\?\?\?|[0-9]{5}),(?'numberOfCitations'[0-9]{1,2}),"(?:Story [0-9]{1,2}: (?'quote'[ 0-9a-zA-Z\(\)\-\+,.'&!?]*)\n)+\n*",(?'latitude'38\.[0-9]{4,8}),(?'longitude'-7[6|7]\.[0-9]{4,8})$
Note the part trying to capture all the individual quotes inside the string with a plus sign outside the group to indicate there is at least one:
"(?:Story [0-9]{1,2}: (?'quote'[ 0-9a-zA-Z\(\)\-\+,.'&!?]*)\n)+\n*"
My substitution is
{\n\t"latitudeLongitude": [${latitude}, ${longitude}], \n\t"displayName": "${displayName}", \n\t"quote": "(“${quote}”<hr />)+", \n\t"numberOfCitations": ${numberOfCitations}\n},
but I capture only the last of the three quotes so the output is
{
"latitudeLongitude": [38.91143761, -77.00165721],
"displayName": "Alethia Tanner Park",
"quote": "“Third quote about this park in the city.”<hr />",
"numberOfCitations": 3
},
I hoped that adding parentheses and a plus sign (“${quote}”<hr />)+ to indicate multiple instances would work but clearly it does not as they just appear in the output as you cn see here https://regex101.com/r/pYaN92/1.
Is there a way to capture all the quotes and process them? I can achieve what I need by processing the quotes separately but I am curious to know if what I want to achieve is possible in a single pass.This post (Regex: Is it possible to do a substitution within a capture group?) suggests not but that is somewhat different as he is trying to match just on quote marks which exist elsewhere; I have a match on a "quote" within the overall capture but only one is processed.