Moving from Redis to Couchbase

Feature Overview

Publisher Frequency Capping is a feature which allows publishers on our platform to limit how often a Content Card is shown to a user.

A Content Card is a container for advertisements. When a Content Card is shown, an ad of various types (e.g. video, full screen interstitial, marketing message, etc) is presented to the end user.

A Placement is a container for various Content Cards. Placements usually represent an event triggered by the publisher app, e.g. Game Over, New Level, Level Up, and so on. Content Cards are served in a waterfall manner based on their respective requirements.

Currently there are two frequency capping options:

Ad Pacing: Number of times a Content Card can be shown to a user per time interval (minutes, hours, days)
Lifetime Maximum: Number of times a Content Card can be shown to a user overall

These are Content Cards within a Placement.

If the requirements of the upper content card is not met, the second one is checked and so on.

These are the frequency capping settings for a given Content Card.

It has been four years since we implemented the Publisher Frequency Capping feature, and up until recently we’ve been using Redis to provide the backend data store. This specific implementation, however, has become technical debt over time, for reasons including infrastructure cost, operational complexity, and the negative impact of a data loss scenario. In order to pay down this technical debt, we have recently migrated from Redis to Couchbase.

DOWNLOAD NOW

Why Move?

  • Persistence

Adoption of our frequency capping feature early on was low, so data loss wouldn’t have a significant impact so long as we could recover from relatively a recent backup. Redis’ snapshot capability allowed us to sufficiently recover, with end users potentially only seeing an extra ad or two beyond what publishers wanted.

Additionally, when we started this process, Couchbase had not yet been fully integrated and battle-tested enough within the Tapjoy engineering tooling from both an application and operations standpoint. Simply put, at the time, it didn’t warrant the additional engineering effort that would have been required at the time. Redis was then the best choice with its atomic operations, low latency, and the availability of fully managed Redis services in the cloud provider ecosystem.

  • Scale & Cost

As the size of the Publisher Frequency Capping dataset grew over the course of the last four years, the in-memory nature of Redis forced us to scale vertically by upgrading the memory space of the machines we were running Redis on. This approach has become increasingly expensive compared to Couchbase, especially considering the fact that Couchbase uses both memory and disk to provide data services, so we decided that switching was the best approach moving forward. In terms of cost, the switch resulted in a savings of approximately 66% overall over the course of one year.

  • Internal Organizational Knowledge

When considering whether to attempt to further scale Redis vs. adopting another backend store, we opted to go with Couchbase due to prior engineering organization experience with scaling Couchbase to high throughput with cross-zone redundancy and replication. This came in the form of more comprehensive tooling, documentation, and integration patterns within our codebase. This allows us to maintain a more consistent implementation throughout our code base along with a better understanding of traffic patterns.

 

Our Approach

Publisher Frequency Capping has the following capabilities:
1. Content Card level capping per device being served (~500 Million Devices)
2. Publisher self-service without delay
3. Maintain a throughput of 500k reads and 50k writes per minute

To accomplish this with the two forms of frequency capping mentioned earlier (max limit and pacing) we store two types of keys in Redis.

The device level data is stored in the following format:

device_key => { total: int, count: int, time: timestamp }

A key for the duration that the publisher sets as the duration at which the capping resets (e.g. once every 5 days). This is done with the content card ID and the duration in milliseconds:

durations => { content_card_id: duration_in_ms }

The ‘durations’ key points to a hash that contains all Content Card IDs that have Frequency Capping enabled. Content Card ID keys are removed from the hash if the filter is turned off for that particular Content Card.

With these two sets of keys, we can easily check on reads if the content is frequency capped or not by checking the device level key and checking the following:

1. Did we hit the total limit? If so, don’t show it
2. Is the count above the count limit per time interval? If so, don’t show it

With regards to writes, whenever we receive an impression for any given piece of content, we can check whether we need to update the device level key with the following:

1. Check if the content served is present in the `durations` key hash; if so increment the device total by 1
2. Check if the device timestamp + duration is less than the current time; if so, reset the timestamp and set count to 0
3. Increment the count

Migrating the data to Couchbase, the data can be easily translated to JSON documents with some tweaks.

Couchbase document lookups tend to be more expensive than hash lookups in Redis because you have to pull the entire document into application memory to parse it rather than having the data store pull the specific key out of the hash.

Due to the above, we break apart the `durations` hash key in Redis to multiple keys in Couchbase:

content_card_id => duration_in_ms

Whereas Redis supports atomic increments on hash integer values, Couchbase relies on a Check and Set (CAS) functionality.

Quoting Couchbase’s documentation: CAS is acronym for “Check and Set” and is useful for ensuring that a mutation of a document by one user or thread does not override another near simultaneous mutation by another user or thread. The CAS value is returned by the server with the result when you perform a read on a document using Get or when you perform a mutation on a document using Insert, Upsert, Replace or Remove.

This value can be provided on any Couchbase update operation to ensure that the Couchbase data hasn’t changed since the last operation (in this case, a read). In the case that the CAS values do not align, the library we use (https://github.com/couchbase/couchbase-ruby-client) will return an error that we can catch. For our use case, we re-read the value and attempt the operation again up to a limit of two times.

Instead of a one-off data migration, we opted to go with a fallthrough approach. When we write out updates, we fill the data from both Redis and Couchbase. If the data is different between the two (e.g. the data doesn’t exist in Couchbase but does in Redis) we take Redis to be the source of truth and write the update to both using the Redis data as the base.

Using these ideas, here is the basic outline of the update function:

def update
  return unless duration_key.present?
  # We read Redis here to get a value to compare our Couchbase value to
    
  count, time, total = redis.hmget(redis_fcap_key, 'count', 'time', 'total').map(&:to_i)
  redis_values = {
    'count' => count,
    'time'  => time,
    'total' => total
  }

  old_values, _, cas = couchbase.get(cb_bucket, cb_key, extended: true)
  old_values = { 'count' => 0, 'time' => 0, 'total' => 0 } if old_values.blank?

  if old_values != redis_values
    # if our values do not align, treat the redis value as the truth and proceed
      
    old_values = redis_values
  end

  new_values = update_data(old_values)
  success = couchbase_update(new_values, cas)
  log if success
ensure
  redis_update(redis_values) if duration_key.present?
end

and the CAS re-fetch and retry code:

def couchbase_update(new_values, cas)
    CB_ATTEMPT_LIMIT.times do
      begin
        CouchbaseWrapper.set(cb_bucket, cb_key, new_values, cas: cas, ttl: time)
        return true
      rescue Couchbase::Error::KeyExists
        # Don't do a GET since we will just drop it on the last loop
        break if CB_TRY_LIMIT == i + 1 
        
        # Get updated CB values
        refreshed_values, _, cas = couchbase.get(cb_bucket, cb_key, extended: true)
        new_values = update_data(refreshed_values)
      end
    end
    false
  end

When reading we simply check Couchbase and if it’s missing, fallthrough to Redis.

def key_values
    values = couchbase.key_values(key)

    if values.blank?
      redis_values = redis.key_values(key).map(&:to_i)                    
      values = {
        'count' => redis_values[0],
        'time'  => redis_values[1],
        'total' => redis_values[2]
      }
    end

    [values['count'], values['time'], values['total']]
  end

This avoids the risk of the one-off migration having mistakes in it and allows us to monitor the behavior of the new system with us still being able to rollback easily in case of unforeseen problems.

Within the next couple of months, once the number of Couchbase misses has dropped to an acceptable level, we will fully cutover by removing the writes and reads to Redis.

Performance-wise, the transition from Redis to Couchbase was roughly equal with Couchbase times occasionally increasing by 1-2 ms during TTL key cleanup.

Want to work at Tapjoy?

Find out why Tapjoy is the best place to work.