JoelFernandes.org

Age is of no importance unless you're a cheese -- Billie Burke

Hello! I’m Joel and this my personal website built with Jekyll! I currently work at Google. My interests are scheduler, RCU, tracing, synchronization, memory models and other kernel internals. I also love contributing to the upstream Linux kernel and other open source projects.

Connect with me on Twitter, and LinkedIn. Or, drop me an email at: joel at joelfernandes dot org

Look for my name in the kernel git log to find my upstream kernel patches. Check out my resume for full details of my work experience. I also actively present at conferences, see a list of my past talks, presentations and publications.

Full list of all blog posts on this site:
  • 25 Jun 2023   SVM and vectors for the curious
  • 10 Jun 2023   SELinux Debugging on ChromeOS
  • 28 Apr 2023   Understanding Hazard Pointers
  • 25 Apr 2023   PowerPC stack guard false positives in Linux kernel
  • 24 Feb 2023   Getting YouCompleteMe working for kernel development
  • 29 Jan 2023   Figuring out herd7 memory models
  • 13 Nov 2022   On workings of hrtimer's slack time functionality
  • 25 Oct 2020   C++ rvalue references
  • 06 Mar 2020   SRCU state double scan
  • 18 Oct 2019   Modeling (lack of) store ordering using PlusCal - and a wishlist
  • 02 Sep 2019   Making sense of scheduler deadlocks in RCU
  • 23 Dec 2018   Dumping User and Kernel stacks on Kernel events
  • 15 Jun 2018   RCU and dynticks-idle mode
  • 10 Jun 2018   Single-stepping the kernel's C code
  • 10 May 2018   RCU-preempt: What happens on a context switch
  • 11 Feb 2018   USDT for reliable Userspace event tracing
  • 08 Jan 2018   BPFd- Running BCC tools remotely across systems
  • 01 Jan 2017   ARMv8: flamegraph and NMI support
  • 19 Jun 2016   Ftrace events mechanism
  • 20 Mar 2016   TIF_NEED_RESCHED: why is it needed
  • 25 Dec 2015   Tying 2 voltage sources/signals together
  • 04 Jun 2014   MicroSD card remote switch
  • 08 May 2014   Linux Spinlock Internals
  • 25 Apr 2014   Studying cache-line sharing effects on SMP systems
  • 23 Apr 2014   Design of fork followed by exec in Linux
  • Most Recept Post:

    SRCU state double scan

    | Comments

    The SRCU flavor of RCU uses per-cpu counters to detect that every CPU has passed through a quiescent state for a particular SRCU lock instance (srcu_struct).

    There’s are total of 4 counters per-cpu. One pair for locks, and another for unlocks. You can think of the SRCU instance to be split into 2 parts. The readers sample srcu_idx and decided which part to use. Each part corresponds to one pair of lock and unlock counters. A reader increments a part’s lock counter during locking and likewise for unlock.

    During an update, the updater flips srcu_idx (thus attempting to force new readers to use the other part) and waits for the lock/unlock counters on the previous value of srcu_idx to match. Once the sum of the lock counters of all CPUs match that of unlock, the system knows all pre-existing read-side critical sections have completed.

    Things are not that simple, however. It is possible that a reader samples the srcu_idx, but before it can increment the lock counter corresponding to it, it undergoes a long delay. We thus we end up in a situation where there are readers in both srcu_idx = 0 and srcu_idx = 1.

    To prevent such a situation, a writer has to wait for readers corresponding to both srcu_idx = 0 and srcu_idx = 1 to complete. This depicted with ‘A MUST’ in the below pseudo-code:

            reader 1        writer                        reader 2
            -------------------------------------------------------
            // read_lock
            // enter
            Read: idx = 0;
            <long delay>    // write_lock
                            // enter
                            wait_for lock[1]==unlock[1]
                            idx = 1; /* flip */
                            wait_for lock[0]==unlock[0]
                            done.
                                                          Read: idx = 1;
            lock[0]++;
                                                          lock[1]++;
                            // write_lock
                            // return
            // read_lock
            // return
            /**** NOW BOTH lock[0] and lock[1] are non-zero!! ****/
                            // write_lock
                            // enter
                            wait_for lock[0]==unlock[0] <- A MUST!
                            idx = 0; /* flip */
                            wait_for lock[1]==unlock[1] <- A MUST!
    

    NOTE: QRCU has a similar issue. However it overcomes such a race in the reader by retrying the sampling of its srcu_idx equivalent.

    Q: If you have to wait for readers of both srcu_idx = 0, and 1, then why not just have a single counter and do away with the “flipping” logic?

    Ans: Because of updater forward progress. If we had a single counter, then it is possible that new readers would constantly increment the lock counter, thus updaters would be waiting all the time. By using the ‘flip’ logic, we are able to drain pre-existing readers using the inactive part of srcu_idx to be drained in a bounded time. The number of readers of a ‘flipped’ part would only monotonically decrease since new readers go to its counterpart.

    2023 update: I have more detailed notes with diagrams and such on this and other cases. Just reach out to me if you want to take a look at those.