Netflix's Distributed Counter Abstraction

(netflixtechblog.com)

125 points | by benocodes 66 days ago

11 comments

  • notfried 66 days ago
    Netflix Engineering is probably far ahead of any other competitor streaming service, but I wonder how much the ROI is on that effort and cost. As a user, I don’t see much difference in reliability between Netflix, Disney, HBO, Hulu, Peacock and Paramount+. They all error out every now and then. Maybe 5-10 years ago you needed to be much more sophisticated because of lower bandwidth and less mature tech. Ultimately, the only real difference that makes me go for one service over the other is the content.
    • Galaco 65 days ago
      All the money they spend still fails at basic UX, creating a finely crafted, polished turd of engineering perfection. I don't see the Netflix experience to be superior in any way to the other big services; they all have their own issues that engineering haven't solved properly, most likely some different ones.

      An example (in this case ONE PIECE on Netflix Japan region): My Netflix language is set to English, but I am not in an English speaking country. And yet, all shows have an English description and title, as well as episode lists with descriptions also in English. And yet, the program itself is not in English, nor does it have English audio options or subtitles. And the UI does not indicate if there are English subtitles until I play an episode and open the subtitles menu. This is compounded to be even worse when there are so many shows that are only half subtitled (different rights holders I assume). Why does the UI lie to me by showing everything about the series/movie in my native language except the actual content itself? This is really common in non-English speaking regions, and it looks like a basic engineering failure of looking up their global database content in MY language while ignoring whether the actual CONTENT is available in my language. I suppose this could be a UX issue, but it also looks like an engineering one to me. And aren't those intertwined to some extent anyway?

      • flakeoil 65 days ago
        I agree with your example and that this functionality sucks. However it cannot be an engineering issue as it must be technically very easy to implement in a better way. It must be a product/UX/marketing kind of decision behind it or just plain ignorance.
        • zoover2020 65 days ago
          This. There's no technical limitations behind offering an A-Z catalog experience, it's all decided by that VP of user experience who wants to squeeze the most amount of screen / device time out of each user

          Am I really alone to notice that each year, our collective UX across consumer (SaaS) software gets worse?

      • aprilthird2021 65 days ago
        Netflix has far better UI/UX and features than it's competitors. I'm sure your specific issue is a legit one, but Netflix as a whole is much better
        • agos 65 days ago
          I don't know if Netflix has caught up recently, but during the pandemic times Disney+'s group viewing was a very good feature with good UX, and I haven't seen anything like it on other services
    • com2kid 65 days ago
      I worked at HBO Max when it launched. Less than 300 engineers, backend services written in either Node or Go. Mostly plain REST calls between services, fancier stuff was only used when there was a performance bottleneck.

      The engineering was very thoughtful and the internal tooling was really good.

      One underappreciated aspect of simply engineered systems is that they are simple to think about, simple to scale up, and simple to debug when something goes wrong.

    • brundolf 66 days ago
      I notice their front-end engineering (and design). The app experience has always been noticeably better than every other streaming service, which matters to me

      It's also possible that savings on infrastructure costs, reduced engineering hours spent on debugging, etc give them a quiet competitive advantage

      Most engineering is not visible to the end-user

      • mark_and_sweep 65 days ago
        Lately, the user experience has gotten much worse for Windows users: In July they removed the ability to download content for offline viewing. So everyone who wants to watch content while travelling, for example, has no other option but to use a competitor like Prime Video.

        I would also like to point out that those "savings on infrastructure costs" seemingly do not benefit their users: They have repeatedly increased prices over the past few years.

        I'm also unsure whether they are using their competitive advantage properly. Anecdotally, it used to be that almost everyone I knew was only streaming on Netflix. These days, the streaming services that people use seem much more diverse.

      • lofaszvanitt 66 days ago
        Their front end is just plain garbage. It does everything besides helping discovery of content. And people look up to Netflix like it's some miracle thing.
        • thereisnospork 65 days ago
          It is well engineered and meticulously manicured garbage though. In your example, the harder it is to discover content the easier it is to realize there is nothing new to watch. (Hence the rotating thumbnails, for a trivial example)
          • alexander2002 65 days ago
            One thing I hate about netflix that category recommendation also include other category stuff like Trending/Recently watched etc I just want to watch some classical movies so plz show me your whole catalog of such movies I don't use netflix often so this issue is more prominent for me atleast since I am not used to the UI bloat
          • lofaszvanitt 65 days ago
            There is always a lot of good things to watch, but you have to dance around the idiotic recommendation system to find the hidden pearls. I'm sure there are a lot of good, talented people working there, but for what, when the frontend team kneecaps their work with the abysmal user interface.

            And the best part... you chat with the support and they try to force a how to use our webpage shit on you when you want to report an annoying UI issue. It's really mind boggling how a congregation of idiots with their heads up their asses runs that place.

        • assimpleaspossi 65 days ago
          That you don't like how you have to use the Netflix interface has nothing to do with how it's engineered.
    • paxys 65 days ago
      All of the services you mention have large engineering/ops teams and similar monitoring systems in place. Their is absolutely an ROI on spending on reliability.
    • dheerkt 66 days ago
      Not an expert by any means but streaming HQ video is pretty expensive (even more so for live content), seems like the only providers that can do so profitably are YouTube and Netflix. I'm sure a big reason for that is the engineering (esp. CDN)
      • akutlay 64 days ago
        This is actually not true nowadays. Streaming HQ video is pretty cheap (check out per GB pricing from Cloudfront or Fastly and divide that by 5-10 to get a realistic number)
    • TZubiri 66 days ago
      The value of netflix is probably not only on its technical prowess and app experience, but it seems they are pretty involved in content direction through metrics. ¹ ²

      Sources:

      [1] https://youtu.be/xL58d1l-6tA

      [2] https://youtu.be/uFpK_r-jEXg

      • rhplus 65 days ago
        Jury’s out on their content direction. They cancel great shows before the first season has even had a chance to permeate. It’s clear they have little interest in long term artistic investments.
        • autoexec 65 days ago
          It also shows that they have no interest in building a library of quality content. They could invest in shows that have proper endings but instead they pollute their library with unfinished works that are certain to either be avoided by netflix customers who know that the show was canceled before it concluded, or piss off any netflix customer who doesn't.

          Netflix doesn't care though because they just want people watching the newest thing they shovel at us. They go out of their way to hide a lot of older content from users to keep people's attention on the new stuff. Half of the categories they show you are just new ways to show you the latest things they're pushing (recently added, trending, top 10, new on netflix, your next watch, top picks for you, we think you'll love these, etc.) and every other category just repeats the same shows.

          Who cares if some show leaves you disappointed because it got you hooked but then was canceled after 3 weeks, Netflix will just push some other newer show to keep you watching until they cancel that too.

    • l33t7332273 66 days ago
      I’ve thought this for a while, and it’s sad because I want to reward good tech. I usually think about this while I’m waiting for paramount to load for the third time because picture in picture has gone black again when I tried to full screen .
    • parhamn 66 days ago
      Although, I agree with the general point, I am thankful for the bit of extra work they do when I travel abroad. The quality and availability of Netflix is far superior (many don't even work outside the US).
  • millipede 66 days ago
    > EVCache

    EVCache is a disaster. The code base has no concept of a threading model. The code is almost completely untested* too. I was on call at least 2 time when EVcache blew up on us. I tried root causing it and the code is a rats nest. Avoid!

    * https://github.com/Netflix/EVCache

    • jolynch 66 days ago
      (I work at Netflix on these Datastores)

      EVCache definitely has some sharp edges and can be hard to use, which is one of the reasons we are putting it behind these gRPC abstractions like this Counter one or e.g. KeyValue [1] which offer CompletableFuture APIs with clean async and blocking modalities. We are also starting to add proper async APIs to EVCache itself e.g. getAsync [2] which the abstractions are using under-the-hood.

      At the same time, EVCache is the cheapest (by about 10x in our experiments) caching solution with global replication [3] and cache warming [4] we are aware of. Every time we've tried alternatives like Redis or managed services they either fail to scale (e.g. cannot leverage flash storage effectively [5]) or cost waaay too much at our scale.

      I absolutely agree though EVCache is probably the wrong choice for most folks - most folks aren't doing 100 million operations / second with 4-region full-active replication and applications that expect p50 client-side latency <500us. Similar I think to how most folks should probably start with PostgreSQL and not Cassandra.

      [1] https://netflixtechblog.com/introducing-netflixs-key-value-d...

      [2] https://github.com/Netflix/EVCache/blob/11b47ecb4e15234ca99c...

      [3] https://www.infoq.com/articles/netflix-global-cache/

      [4] https://netflixtechblog.medium.com/cache-warming-leveraging-...

      [5] https://netflixtechblog.com/evolution-of-application-data-ca...

      • dormando 65 days ago
        Throwing out a clarification: EVcache is effectively a complex memcached client + an internal ecosystem at Netflix. You can get much of its benefits with other systems (such as the memcached internal proxy: https://docs.memcached.org/features/proxy/).

        For plugging into other apps they may only need a small slice of EVCache; just the fetch from local-then-far, copy sets to multiple zones, etc. A greenfield client with the same backing store could be trivial to do.

        That all said I wouldn't advise people copy their method of expanding cache clusters: it's possible to add or remove one instance at a time without rebuilding and re-warming the whole thing.

      • lksajdl3lfkjasl 65 days ago
        Curious to how your getting <500us latencies. Connection pooling, GRPC?
        • jolynch 65 days ago
          Every zone has a copy, and clients always read their local zone copy (via pooled memcached connections) first and fallback only once to another zone on miss. Key is staying in zone and memcached protocol plus super fast server latencies. It's been a little while since we measured, but memcached has a time to first byte of around 10us and then scales sublinearly with payload size [1]. Single zone latency is variable but generally between 150 and 250us roundtrip, cross AZ is terrible at up to a millisecond [2].

          So you put 200us network with 30us response time and get about 250us average latency. Of course the P99 tail is closer to a millisecond and you have to do things like hedges to fight things like the hard coded eternity 200ms TCP packet retry timer ... But that's a whole other can of worms to talk about.

          [1] https://github.com/Netflix-Skunkworks/service-capacity-model...

          [2] https://jolynch.github.io/pdf/wlllb-apachecon-2022.pdf

    • jedberg 66 days ago
      I'm surprised it's still there! It was built over a decade ago when I was still there. At the time there were no other good solutions.

      But Momento exists now. It solves every problem EVCache was supposed to solve.

      There are other options too. They should retire it by now.

    • Alupis 66 days ago
      Can you elaborate?

      From the looks of it, each module has plenty of tests - and the codebase is written in a spring/boot style, making it fairly intuitive to navigate.

  • vlovich123 66 days ago
    It's a bit weird to not compare this to HyperLogLog & similar techniques that are designed to solve exactly this problem but much more cheaply (at least as far as I understand).
    • rshrin 65 days ago
      Added a note on why we chose EvCache for the "Best-Effort" use-case instead of probabilistic data structures like HLL/CMS. Appreciate the discussion.
    • zug_zug 66 days ago
      I came here to write the same thing. Getting an estimate accurate for at least 5 digits on all netflix video watches worldwide can all be done with intelligent sampling (like hyperloglog) and likely one macbook air as the backend. And aside from the compute save the complexity and implementation time would be much lower too.
      • rshrin 66 days ago
        Fwiw, we didn't mention any probabilistic data structures because they don't satisfy some of the basic requirements we had for the Best-Effort counter. HyperLogLog is designed for cardinality estimation, not for incrementing or decrementing specific counts (which in our case could be any arbitrary +ve/-ve number per key). AFAIK, both Count-Min Sketch and HyperLogLog do not support clearing counts for specific keys. I believe Count-Min Sketch cannot support decrement as well. The core EvCache solution for the Best-Effort counter is like 5 lines of code. And EvCache can handle millions of operations/second relatively cheaply.
        • vlovich123 66 days ago
          Including this in the blog would have been helpful although I don’t think the decrement explanation is unsolvable - just have a second field for decrements that is incremented when you want to decrement & then the final result is a sum of the two.
          • rshrin 66 days ago
            True. You could do decrements that way. We trimmed this article as the post is already quite long. But considering the multiple threads on this, we might add a few lines. There is also something to be said on operating data stores that support HLL or similar probabilistic data structures. Our goal is to build taller on what we already operate and deploy (like EvCache)
    • willcipriano 66 days ago
      Can you do a distributed HyperLogLog? Wouldn't you have to have a single instance of it somewhere?
      • adgjlsfhk1 66 days ago
        hyperloglog distributes perfectly. Each node keeps track of the "best" hash, and then to query the global maximum, you just ask each node for their value.
        • rshrin 65 days ago
          Querying each instance can lead to availability and latency challenges. Moreover, HLL is not suited for tasks like increments/decrements, TTLs on counts, and clearing of counts. Count-Min Sketch could work if we're willing to forgo certain requirements like TTLs, but relying solely on in-memory data structures on every instance isn't ideal (think about instance crashes, restarts, new code deployments etc.) Instead, using data stores that support HLL or Count-Min Sketch, like Redis, would offer better reliability. That said, we prefer to build on top of data stores we already operate. Also, the "Best-Effort" counter is literally 5 lines of code for EvCache. The bulk of the post focuses on "Eventually Consistent Accurate" counters, along with a way to track which systems sent the increments, which probabilistic counting is not ideal for.
          • rshrin 65 days ago
            Just to add, we are also trying to support idempotency wherever possible to enable safe hedging and retries. This is mentioned a bunch in the article on Accurate counters. So take that into consideration.
  • mannyv 66 days ago
    I wonder how they're going to go about purging all the counters that end up unused once the employee and/or team leaves?

    I can see someone setting up a huge number of counters then leaving...and in a hundred years their counters are taking up TB of space and thousands of requests-per-second.

    • singron 66 days ago
      There is a retention policy, so the raw events aren't kept very long. The rollups probably compress really well in their time series database, which I'm guessing also has a retention policy.

      If you have high cardinality metrics, it can still be really painful, although I think you will feel the pain initially and it won't take years. Usually these systems have a way to inspect what metrics or counters are using the most resources and then they can be reduced or purged from time to time.

      • rshrin 66 days ago
        Yes, once the events are aggregated (and optionally moved to a cost-effective storage for audits), we don't need them anymore in the primary storage. You can check the retention section in the article. The rollups themselves can have TTL if the users wish to set that on a namespace. Although doing that, they have to be fine with certain timing issues on when the rollups expire and new events are aggregated. We also have automation to truncate/delete namespaces.
  • fire_lake 65 days ago
    I think the design could have been simpler with Kafka (which they touched on briefly):

    - Write counter changes to a Kafka topic with many partitions. The partition key is derived from the counter name.

    - Use Kafka connect to push all counter events to S3 for audit and analysis.

    - Write a Kafka consumer that reads events in batches and updates a persistent store with the current count.

    - Pick a good Kafka message lifetime to ensure that topic size is kept under control, but data is not lost.

    This gives us:

    - Fast reads (count is precomputed, but potentially stale)

    - Fast writes (Kafka)

    - Correctness (every counter is assigned exactly one consumer)

    - Durability (all state is in Kafka or the persistent store)

    - Scalable storage and compute requirements over time

    If I were to really go crazy with this, I would shard each counter further and use CRDTs to compute the total across all shards.

    • rshrin 65 days ago
      Yes, this is one of the approaches mentioned in the article and is indeed a valid approach. One thing to keep in mind is that we are already operating the TimeSeries service for a lot of other high ROI use cases within Netflix. There already exists a lot of automation to self-provision, configure, deploy and scale TimeSeries. There already exists automation to move data from Cassandra to S3/Iceberg. We somewhat get all that for free. The Counter service is really just the Rollup layer on top of it. The Rollup operational nuances are just to give it that extra edge when it comes to accuracy and reliability.
      • fire_lake 65 days ago
        Did you also consider AWS managed services? Like a direct write to Dynamo?
        • rshrin 65 days ago
          Not for this use case. Other use cases at Netflix use AWS Managed service when it makes sense from a use-case and cost perspective. In this case, using TimeSeries opens the door to a lot of other potential future use cases:

          1. What was the count for counter X between times T1 and T2? 2. "I am going to re-run my batch job again from yesterday. Adjust the increments for this window and re-compute the final count".

          Although the #2 use-case requires lot of other nuances around Recounting, which we allude to but don't expand upon in the article (adjustable retention, multiple rollup checkpoints per counter, pushing back accept-limit for backfills etc.)

  • bob1029 65 days ago
    This seems a bit overcooked to me.

    I suspect if I were to recursively ask "why?", we may eventually wind up at some triviality (to the actual business/customer) that could have easily gone another way and obviated the need for this abstraction in the first place.

    Just thinking about the raw information, I don't see how the average streaming media consumer produces more than a few hundred kb of useful data per month. I get that the delivery of the content is hard, but I dont see why we need to torture ourselves over the gathering of basic statistics.

  • dopamean 66 days ago
    Why would netflix put their blog on medium?
    • jedberg 66 days ago
      Better distribution. More people will read it there.
      • paxys 65 days ago
        That was probably true 5 years ago, but today a large chunk of readers are going to be driven away by the intrusive popups and auth walls.
        • dopamean 65 days ago
          I didnt read it and that was the reason I asked.
    • lmm 65 days ago
      Because it's the least bad way to do a blog when you want to focus on the content.
  • Dylan16807 65 days ago
    Well okay, that's some neat implementation stuff.

    But what in the world are they using a global counter for that needs "low millisecond latencies"? I don't see a single example in the entire article, and I can't think of any.

    • rshrin 65 days ago
      Use cases fetching counts directly in the path of Netflix users/streaming, e.g. user-personalization, feature-gating > what features are shown when you load the home page, dictated by how many times these have been shown before for a given device. The article hints at this in the beginning. Also, there were some initial use cases related to interactive titles, details of which can't be publicly shared [although that is winding down now]
      • Dylan16807 65 days ago
        > user-personalization, feature-gating > what features are shown when you load the home page, dictated by how many times these have been shown before for a given device

        I don't see how any of those would suffer if the numbers took seconds to update instead of milliseconds.

        We're talking about having updated numbers in milliseconds, right? Not just "the database responds in a reasonable amount of time" because that's been solved many many times over and the article specifically says "this category requires near-immediate access to the current count at low latencies".

        > some initial use cases related to interactive titles

        Maybe 1 second of latency for a group interaction? That's still orders of magnitude more slack. And I'd expect only moderate accuracy requirements.

        • rshrin 65 days ago
          For the Eventually Consistent counter, the low millisecond requirement is for reads and writes, not for the convergence of counts. For this category, the convergence is in the order of seconds (user-personalization, feature-gating fall in this category). For the "Best-effort" category, there are some use cases that run experiments in a single-region and need access to current counts at low latencies (they basically add increments and read the value back in the same call i.e. AddAndGet), but are willing to sacrifice "some degree" of accuracy for it. See the table in the 2nd section. There are multiple dimensions in terms of Read/Write Latency, Staleness, Global reads/writes etc. mapped to the two kinds of use cases. Maybe you are conflating a few things. Finally, there is the experimental type of Accurate counters that can get the current count with high degree of accuracy (but the latency there depends on a few things as explained in the article). The last type is more like what can be done using this approach, no current use case for it.
  • leakyabstxns 66 days ago
    Given the complexity of the system, I'm curious to know how many people maintain this service
    • rshrin 66 days ago
      5 people (who also maintain a lot of other services like the linked TimeSeries service). The self-service to create new namespaces is pretty much autonomous (see attached link in the article on "Provisioning"). The stateless layer auto-scales up and down based on attached CPU/Network-based scaling policies. The exports to audit stores can be scheduled at a cadence. The only intervention is when we have to scale the storage layer (although parts of it also automated using the same Provisioning workflow). I guess the other intervention is when we decide to change the configs (like number of queues) and trigger a re-deploy. But thats about it. So far, we have spent a very small percentage of our support budget for this.
  • est 65 days ago
    if anyone want a non-distributed but still very powerful counter service, I'd recommend Telegraf from Grafana or https://vector.dev/
  • ilrwbwrkhv 66 days ago
    Looks a bit overengineered due to Netflix's own microservices nonsense.

    I would be more interested in how a higher traffic video company like Pornhub handles things like this.

    • Alupis 66 days ago
      > Netflix's own microservices nonsense

      How many times has Netflix been entirely down over the years?

      Seems it's not "nonesense".

    • philjohn 66 days ago
      Have you worked with distributed counters before? It's a hard problem to solve. Typical tradeoffs are lower cardinality for exact counters.

      The queue solution is pretty elegant.

      • ElevenLathe 66 days ago
        The queuing reminds me of old "tote board" technology. I can't find a reference easily but these were machines used to automatically figure and display parimutuel payouts at horse tracks. One particular kind of them would have the cashiers' terminals emit ball bearings on a track back at the central (digital but electromechanical) counter for each betting pool. This arrangement allowed the cashiers to sell as fast as they liked without jamming the works, and then allowed the calculation/display of odds to be "eventually consistent".
    • oreoftw 66 days ago
      How would you design it to support mentioned use cases?