I have a large file of mostly space-delimited data I want to parse into a hash. The problem is this is mostly space-delimited, so a simple string.split isn't going to work.
Here's a simplified example of one of the lines in the file:
field0 field1 [ [field2a] [field2b] ] field3
The contents contained by the outer brackets (including the outer brackets) need to be a hash member.
I wrote the following function, which works, but is very slow:
# row = String to be split
# fields = Integer indicating expected number of fields
def mysplit (row, fields)
# Variable to keep track of brackets
b = 0
# Variable to keep track of iterations for array index
i = 0
rowsplit = Array.new(fields)
rowsplit[0] = ""
row.each_char do |byte|
case byte
when ' '
if b == 0
i += 1
rowsplit[i] = ""
else
rowsplit[i] += byte
end
when '['
b += 1
rowsplit[i] += byte
when ']'
b -= 1
rowsplit[i] += byte
else
rowsplit[i] += byte
end
end
if i != fields - 1
raise StandardError,
"Resulting fields do not match expected fields: #{rowsplit}",
caller
elsif b != 0
raise StandardError, "Bracket never closed.", caller
else
return rowsplit
end
end
It takes 36 seconds to run this on a 7 MB file 6600 lines long. It's worth mentioning that my environment is running Ruby 1.8.7, which I have no control over.
Is it possible to make this faster?
You want
.squeezeand.stripSqueeze compresses any extra white space to just 1. Strip will remove leading and trailing space of the line.
From there you should be able to use regex pattern matching to parse each line into the data structure you're trying to create, but I can't help with that without knowing how to parse the data.
You also should try to raise expectations sooner, no need to iterate over the entire file.
If you know you're line will match this pattern in your example:
If you're good then you can split or whatever: