I am currently thinking about improving the performance of Rails.cache.write when using dalli to write items to the memcachier cloud. 
The stack, as it relates to caching, is currently:
heroku, memcachier heroku addon, dalli 2.6.4, rails 3.0.19
I am using newrelic for performance monitoring.
I am currently fetching "active students" for a given logged in user, represented by a BusinessUser instance, when its active_students method is called from a controller handling a request that requires a list of "active students":
class BusinessUser < ActiveRecord::Base
  ...
  def active_students
    Rails.cache.fetch("/studio/#{self.id}/students") do
      customer_users.active_by_name
    end
  end
  ...
end
After looking at newrelic, I've basically narrowed down one big performance hit for the app in setting key values on memcachier. It takes an average of 225ms every time. Further, it looks like setting memcache key values blocks the main thread and eventually disrupts the request queue. Obviously this is undesirable, especially when the whole point of the caching strategy is to reduce performance bottlenecks.
In addition, I've benchmarked the cache storage with plain dalli, and Rails.cache.write for 1000 cache sets of the same value:
heroku run console -a {app-name-redacted}
irb(main):001:0> require 'dalli'
=> false
irb(main):002:0> cache = Dalli::Client.new(ENV["MEMCACHIER_SERVERS"].split(","),
irb(main):003:1*                     {:username => ENV["MEMCACHIER_USERNAME"],
irb(main):004:2*                      :password => ENV["MEMCACHIER_PASSWORD"],
irb(main):005:2*                      :failover => true,
irb(main):006:2*                      :socket_timeout => 1.5,
irb(main):007:2*                      :socket_failure_delay => 0.2
irb(main):008:2>                     })
=> #<Dalli::Client:0x00000006686ce8 @servers=["server-redacted:11211"], @options={:username=>"username-redacted", :password=>"password-redacted", :failover=>true, :socket_timeout=>1.5, :socket_failure_delay=>0.2}, @ring=nil>
irb(main):009:0> require 'benchmark'
=> false
irb(main):010:0> n = 1000
=> 1000
irb(main):011:0> Benchmark.bm do |x|
irb(main):012:1*   x.report { n.times do ; cache.set("foo", "bar") ; end }
irb(main):013:1>   x.report { n.times do ; Rails.cache.write("foo", "bar") ; end }
irb(main):014:1> end
       user     system      total        real
 Dalli::Server#connect server-redacted:11211
Dalli/SASL authenticating as username-redacted
Dalli/SASL: username-redacted
  0.090000   0.050000   0.140000 (  2.066113)
Dalli::Server#connect server-redacted:11211
Dalli/SASL authenticating as username-redacted
Dalli/SASL: username-redacted
  0.100000   0.070000   0.170000 (  2.108364)
With plain dalli cache.set, we are using 2.066113s to write 1000 entries into the cache, for an average cache.set time of 2.06ms.
With Rails.cache.write, we are using 2.108364s to write 1000 entries into the cache, for an average Rails.cache.write time of 2.11ms.
⇒ It seems like the problem is not with memcachier, but simply with the amount of data that we are attempting to store.
According to the docs for the #fetch method, it looks like it would not be the way I want to go, if I want to throw cache sets into a separate thread or a worker, because I can't split out the write from the read - and self-evidently, I don't want to be reading asynchronously.
Is it possible to reduce the bottleneck by throwing Rails.cache.write into a worker, when setting key values? Or, more generally, is there a better pattern to do this, so that I am not blocking the main thread every time I want to perform a Rails.cache.write?
 
                        
There are two factors that would contribute to overall latency under normal circumstances: client side marshalling/compression and network bandwidth.
Dalli mashalls and optionally compresses the data, which could be quite expensive. Here are some benchmarks of Marshalling and compressing a list of random characters (a kind of artificial list of user ids or something like that). In both cases the resulting value is around 200KB. Both benchmarks were run on a Heroku dyno - performance will obviously depend on the CPU and load of the machine:
So we're basically seeing anywhere from a 40ms to 150ms performance hit just for Marshaling and/or zipping data. Marshalling a String will be much cheaper, while marshalling something like a complex object will be more expensive. Zipping depends on the size of the data, but also on the redundancy of the data. For example, zipping a 1MB string of all "a" characters takes merely about 10ms.
Network bandwidth will play some of a role here, but not a very significant one. MemCachier has a 1MB limit on values, which would take approximately 20ms to transfer to/from MemCachier:
This amounts to about 400Mbps (1MB * 8MB/Mb * (1000ms/s / 20ms)), which makes sense. However, for even a relatively large, but still smaller value of 200KB, we'd expect a 5x speedup:
So, there are several things you might be able to do to get some speedup:
Use a faster marshalling mechanism. For example, using
Array#pack("L*")to encode a list of 50,000 32-bit unsigned integers (like in the very first benchmark) into a string of length 200,000 (4 bytes for each integer), takes only 2ms rather than 40ms. Using compression with the same marshalling scheme, to get a similar sized value is also very fast (about 2ms as well), but the compression doesn't do anything useful on random data anymore (Ruby's Marshal produces a fairly redundant String even on a list of random integers).Use smaller values. This would probably require deep application changes, but if you don't really need the whole list, you should be setting it. For example, the memcache protocol has
appendandprependoperations. If you are only ever adding new things to a long list, you could use those operations instead.Finally, as suggested, removing the set/gets from the critical path would prevent any delays from affecting HTTP request latency. You still have to get the data to the worker, so it's important that if you're using something like a work queue, the message you send to the worker should only contain instructions on which data to construct rather than the data itself (or you're in the same hole again, just with a different system). A very lightweight (in terms of coding effort) would be to simply fork a process:
I haven't benchmarked a full example, but since you're not
execing, and Linux fork is copy-on-write, the overhead of the fork call itself should be minimal. On my machine, it's about 500us (that's micro-seconds not milliseconds).