how to use yajl-ruby to parse and filter data from big json file (2G size)

514 Views Asked by At

i need to filter out some data from a json file(about 2G size). the josn is like

{ "dataName": "staff",
  "version": 5,
  "data": [
    {"name":"Fred",
    "team":"football",
    "hobby":"climbing"
    },
     {"name":"Tony",
     "team":"basketball",
     "hobby":"fishing"},

    {"name":"alex",
      "team":"soccer",
      "hobby":"movies"
    }
  ]
}

After doing some researches about parsing huge json in ruby, I found https://github.com/dgraham/json-stream and https://github.com/brianmario/yajl-ruby, I tried json_stream which takes about 20minutes, and this site https://github.com/dgraham/yajl-ffi#performance says that

yajl-ruby is faster

With json_stream, I could use some call_backs like start_object/end_object/key/value to know when an object is parsed and afterwards do some processing with this object and continue.

But with yajl-ruby, I only find a call_back named "on_parse_complete". Its doc(https://www.rubydoc.info/github/brianmario/yajl-ruby/Yajl/Parser) says that

"#on_parse_complete= ⇒ Object
call-seq: on_parse_complete = Proc.new { |obj| … }

This callback setter allows you to pass a Proc/lambda or any other object that responds to #call.

#It will pass a single parameter, the ruby object built from the last parsed JSON object"#

then I write a piece of code like


require 'yajl'
def parse_farquaad f, chunk_size
   parser = Yajl::Parser.new

    parser.on_parse_complete = Proc.new do |obj|
      yield obj
    end

    f.each(chunk_size) { |chunk| parser << chunk }
  end

  File.open("big_file.json") do |f|
      parse_farquaad f, 8092 do |current_data_unit|
        puts "obj is:"
        puts current_data_unit
  end

I test on sample json file with small size(

see the example given in the begining

)

but the output is #the whole JSON obj# (dump all in a time), instead what I want is to output each object in "data" part one by one and like I could get form json stream, after each obj in "data" is parsed and output, I could do sth on it like checking if each obj is the data I want.

my expected output is:

at first, obj { "name":"Fred", "team":"football", "hobby":"climbing" } do sth on this obj

then obj {"name":"Tony", "team":"basketball", "hobby":"fishing"} do sth on this obj

then obj {"name":"alex", "team":"soccer", "hobby":"movies" } do sth on this obj .....

Maybe I have some misunderstanding about this sentence

"It will pass a single parameter, the ruby object built from the last parsed JSON object"#

about the callback

"on_parse_complete"

described in the doc shown above.

Anyone knows how to do this with yajl-ruby? any help is appreciated.

0

There are 0 best solutions below