GNU sed and newlines with multiple scripts

Question

GNU sed and newlines with multiple scripts

54 Views Asked by TeejMonster At 04 April 2020 at 22:04

Suppose we start with this string:

echo "1:apple:fruit.2:banana:fruit.3:cucumber:veggie.4:date:fruit.5:eggplant:veggie.">list.tmp

and want to end up with this result:

1-apple:fruit
2-banana:fruit
3-cucumber:veggie
4-date:fruit
5-eggplant:veggie

Why does this work:

sed -e 's/\./\n/g' -i list.tmp
sed -e 's/:/-/' list.tmp

But but not this:

sed -e 's/\./\n/g' -e 's/:/-/' list.tmp

The second command yields this, apparently ignoring the new newlines when looking for the first occurrence of ':' on each line.

1-apple:fruit
2:banana:fruit
3:cucumber:veggie
4:date:fruit
5:eggplant:veggie

With an extended version of the input:

echo "one:apple:fruit.two:banana:fruit.three:cucumber:veggie.four:date:fruit.five:eggplant:veggie.">list.tmp

I want to end up with this result:

one-apple:fruit
two-banana:fruit
three-cucumber:veggie
four-date:fruit
five-eggplant:veggie

Original Q&A

There are 2 best solutions below

potong On 05 April 2020 at 00:04

This might work for you (GNU sed):

sed -E 'y/./\n/;s/^([^:]*):/\1-/mg' file

Translate all periods to newlines.

Using the GNU specific m or multiline flag, replace from the start of each line in the pattern space (i.e. the start of a line as indicated by ^ is either the start of a string or following a newline), any non-colon characters followed by a colon by the non-colon characters and a dash -. This effectively replaces the first colon in each line by a dash.

**Jonathan Leffler** · Accepted Answer · 2020-04-04T22:44:30.137000

^{Transferring key comment into an answer.}

Original data

You forgot the g modifier on the second command in the double -e formulation. When the first -e completes, all the lines are still in the pattern space (the main working area in sed) — they do not become 5 separately read lines. You read one line; you're still processing it. Mind you, you'll need to use a modified pattern:

s/\([0-9]\):/\1-/g

Combining these, in GNU sed (as stipulated in the question title), you get:

sed -e 's/\./\n/g' -e 's/\([0-9]\):/\1-/g' list.tmp

Note that POSIX sed and other versions of sed have different rules about the newline substitution in the first -e expression.

Consider using `awk`

If changing tools from sed to awk is an option, you can do it more simply in awk, as shown by Ed Morton in a comment. Since that solution doesn't need to change to address the revised data, it clearly has advantages — the disadvantage is that it is not using sed. In 'the real world', you use the best tool for the job — and in this example, that's awk.

Extended data

With the 'extended' input, where there aren't convenient single digit numbers but you want to change the first colon on each line to a dash, you have to work harder:

sed -e 's/\./\n/g' \
    -e  's/^\([^:]*\):/\1-/' \
    -e 's/\(\n[^:]*\):/\1-/g' \
    list.tmp

The first -e in unchanged.
The second looks for a sequence of non-colons followed by a colon at the start of the pattern space and replaces it with the sequence of non-colons and a dash. The g modifier is irrelevant here.
The third -e looks for a newline followed by a sequence of non-colons followed by a colon, and replaces it with the newline, the non-colon sequence and a dash. The g modifier is very relevant here.

You can flatten that all onto one line, but it is easier to see the similarities between the last two -e options if they're laid out on separate lines.

You can also experiment with ERE (extended regular expressions) with the -E option, and group the two separate replacements into one:

{
echo "1:apple:fruit.2:banana:fruit.3:cucumber:veggie.4:date:fruit.5:eggplant:veggie."
echo "one:apple:fruit.two:banana:fruit.three:cucumber:veggie.four:date:fruit.five:eggplant:veggie."
} |
sed -E -e 's/\./\
/g' -e 's/((^|\n)[^:]+):/\1-/g'

That yields:

1-apple:fruit
2-banana:fruit
3-cucumber:veggie
4-date:fruit
5-eggplant:veggie

one-apple:fruit
two-banana:fruit
three-cucumber:veggie
four-date:fruit
five-eggplant:veggie

If you don't want the extra blank line, remove the final newline:

{
echo "1:apple:fruit.2:banana:fruit.3:cucumber:veggie.4:date:fruit.5:eggplant:veggie."
echo "one:apple:fruit.two:banana:fruit.three:cucumber:veggie.four:date:fruit.five:eggplant:veggie."
} |
sed -E -e 's/\./\
/g' -e 's/((^|\n)[^:]+):/\1-/g' -e 's/\n$//'

The backslash-newline notation works correctly in both GNU sed and POSIX (including BSD and macOS) sed; you can re-replace that with \n in GNU sed. The \n in the replacement part of the s/// command doesn't work in BSD (macOS) sed. POSIX sed requires that you use a backslash to escape a literal newline in the replacement text:

A line can be split by substituting a <newline> into it. The application shall escape the <newline> in the replacement by preceding it by a <backslash>.

GNU sed is more flexible.

Also (according to potong's answer), there is a GNU-specific modifier m that you can use to do the multi-line matching in one operation.

GNU sed and newlines with multiple scripts

There are 2 best solutions below

Original data

Consider using `awk`

Extended data

Related Questions in SED

Related Questions in GNU-SED

Trending Questions

Popular # Hahtags

Popular Questions

GNU sed and newlines with multiple scripts

There are 2 best solutions below

Original data

Consider using awk

Extended data

Related Questions in SED

Related Questions in GNU-SED

Trending Questions

Popular # Hahtags

Popular Questions

Consider using `awk`