Regex pattern to select boundary words excluding words inside double quotes

37 Views Asked by At

I give up! I have tried so many different things here and what little hair I have left Im losing so if someone could help me out Id be most grateful.

I have some badly formed Json:

friend_in_need: {
    id: 3
    
    possible: {
        is_ironman: yes
        difficulty: {">": 1}
        has_start_date: {"<": "1936-01-02"}
        has_any_custom_difficulty_setting: no
        game_rules_allow_achievements: yes
    }
    
    happened: {
        has_country_flag: "achievement has joined faction"
    }
}

... that I am trying to clean up in various steps using Regex replacement statements.

The step I am trying to do next is to match each key and string values so that I append double quotes around them, like the following:

"friend_in_need": {
    "id": 3
    
    "possible": {
        "is_ironman": "yes"
        "difficulty": {">": 1}
        "has_start_date": {"<": "1936-01-02"}
        "has_any_custom_difficulty_setting": "no"
        "game_rules_allow_achievements": "yes"
    }
    
    "happened": {
        "has_country_flag": "achievement has joined faction"
    }
}

I have tried various different methods having some success sorting the keys first however I cannot find a way to select the string values excluding those values already in quotes. Id be more than happy to do this in multiple steps if necessary.

For example, I know these parts get me closer..

(?:\".+?\") matches everything between the brackets \b([a-zA-Z0-9@_]+)\b matches boundary words \S might be better

But I cant combine the two to not match the 1st. I thought this would work, but it didn't:

(?!(?:\".+?\")))\b([a-zA-Z0-9@_]+)\b

Any help would be greatly appreciated.

Thanks in advance!!!

2

There are 2 best solutions below

4
Karan Shishoo On BEST ANSWER

You can achieve this by using negative lookbehind and negative lookahead markers. if you used something along the lines of -

(?<![\"\w])\w+(?![\"\w])

It will match all word char groups that are not preceded by other word chars or " and that are not followed by other word chars or "

you can replace the \w+ in the middle to better fit your use-case as needed

0
Mr. Irrelevant On

You can use this python snippet to get this done:

import re

myjson = 'friend_in_need: { id: 3      possible: {       is_ironman: yes       
difficulty: {">": 1}       has_start_date: {"<": "1936-01-02"}      
has_any_custom_difficulty_setting: no       game_rules_allow_achievements: yes   
}   happened: {        has_country_flag: "achievement has joined faction"   }}'
x = re.sub(r"(?<![\"\w])\w+(?![\"\w])", r'"\g<0>"', myjson)
print(x)

result:

"friend_in_need": { "id": "3"      "possible": {       "is_ironman": "yes"       
"difficulty": {">": "1"}       "has_start_date": {"<": "1936-"01"-02"}      
"has_any_custom_difficulty_setting": "no"       
"game_rules_allow_achievements": "yes"   }   "happened": {        
"has_country_flag": "achievement "has" "joined" faction"   }}