BTree Index Prefetch

This document outlines how we decide to pre-read extra nodes in the btree index.

Rationale

Because of the latency involved in making a request, it is often better to make fewer large requests, rather than more small requests, even if some of the extra data will be wasted.

Example

Using my connection as an example, I have a max bandwidth of 160kB/s, and a latency of between 100-400ms to London, I'll use 200ms for this example. With this connection, in 200ms you can download 32kB. So if you make 10 requests for 4kB of data, you spend 10*.2s = 2s sending the requests, and 4*10/160 = .25s actually downloading the data. If, instead, you made 3 requests for 32kB of data each, you would take 3*.2s = .6s for requests, and 32*3/160 = .6s for downloading the data. So you save 2.25 - 1.2 = 1.05s even though you downloaded 32*3-4*10 = 56kB of data that you probably don't need. On the other hand, if you made 1 request for 480kB, you would take .2s for the request, and 480/160=3s for the data. So you end up taking 3.2s, because of the wasted 440kB.

BTree Structure

This is meant to give a basic feeling for how the btree index is laid out on disk, not give a rigorous discussion. For that look elsewhere[ref?].

The basic structure is that we have pages of 4kB. Each page is either a leaf, which holds the final information we are interested in, or is an internal node, which contains a list of references to the next layer of nodes. The layers are structured such that all nodes for the top layer come first, then the nodes for the next layer, linearly in the file.

Example 1 layer

In the simplest example, all the data fits into a single page, the root node. This means the root node is a leaf node.

Example 2 layer

As soon as the data cannot fit in a single node, we create a new internal node, make that the root, and start to create multiple leaf nodes. The root node then contains the keys which divide the leaf pages. (So if leaf node 1 ends with 'foo' and leaf node 2 starts with 'foz', the root node would hold the key 'foz' at position 0).

Example 3 layer

It is possible for enough leaf nodes to be created, that we cannot fit all there references in a single node. In this case, we again split, creating another layer, and setting that as the root. This layer then references the intermediate layer, which references the final leaf nodes.

In all cases, the root node is a single page wide. The next layer can have 2-N nodes.

Current Info

Empirically, we've found that the number of references that can be stored on a page varies from about 60 to about 180, depending on how much we compress, and how similar the keys are. Internal nodes also achieve approximately the same compression, though they seem to be closer to 80-100 and not as variable. For most of this discussion, we will assume each page holds 100 entries, as that makes the math nice and clean.

So the idea is that if you have <100 keys, they will probably all fit on the root page. If you have 100 - 10,000 keys, we will have a 2-layer structure, if you have 10,000 - 1,000,000 keys, you will have a 3-layer structure. 10^6-10^8 will be 4-layer, etc.

Data and Request

It is important to be aware of what sort of data requests will be made on these indexes, so that we know how to optimize them. This is still a work in progress, but generally we are searching through ancestry. The final information (in the leaf nodes) is stored in sorted order. Revision ids are generally of the form "prefix:committer@email-timestamp-randomtail". This means that revisions made by the same person around the same time will be clustered, but revisions made by different people at the same time will not be clustered. For files, the keys are (file-id, revision-id) tuples. And file-ids are generally basename-timestamp-random-count (depending on the converter). This means that all revisions for a given file-id will be grouped together, and that files with similar names will be grouped together. However, files committed in the same revisions will not be grouped together in the index.[1]_

[1]One interesting possibility would be to change file-ids from being 'basename-...', to being 'containing-dirname-filename-...', which would group files in the similarly named directories together.

In general, we always start with a request for the root node of the index, as it tells us the final structure of the rest of the index. How many total pages, what pages are internal nodes and what layer, which ones are leaves. Before this point, we do know the size of the index, because that is stored in the pack-names file.

Thoughts on expansion

This is just a bullet list of things to consider when expanding a request.

Algorithm

This is the basic outline of the algorithm.

  1. If we don't know the size of the index, don't expand as we don't know what is available. (This only really applies to the pack-names file, which is unlikely to ever become larger than 1 page anyway.)

  2. If a request is already wide enough to be greater than the number of recommended pages, don't bother trying to expand. This only really happens with LocalTransport which recommends a single page.

  3. Determine what pages have already been read (if any). If the pages left to read can fit in a single request, just request them. This tends to happen on medium sized indexes (ones with low hundreds of revisions), and near the end when we've read most of the whole index already.

  4. If we haven't read the root node yet, and we can't fit the whole index into a single request, only read the root node. We don't know where the layer boundaries are anyway.

  5. If we haven't read "tree depth" pages yet, and are only requesting a single new page don't expand. This is meant to handle the 'lookup 1 item in the index' case. In a large pack file, you'll read only a single page at each layer and then be done. When spidering out in a search, this will cause us to take a little bit longer to start expanding, but once we've started we'll be expanding at full velocity. This could be improved by having indexes inform each other that they have already entered the 'search' phase, or by having a hint from above to indicate the same.

    However, remember the 'bi-modal' distribution. Most indexes will either be very small, or very large. So either we'll read the whole thing quickly, or we'll end up spending a lot of time in the index. Which makes a small number of extra round trips to large indexes a small overhead. For 2-layer nodes, this only 'wastes' one round trip.

  6. Now we are ready to expand the requests. Expand by looking for more pages next to the ones requested that fit within the current layer. If you run into a cached page, or a layer boundary, search further only in the opposite direction. This gives us proper locality of reference, and also helps because when a search goes in a single direction, we will continue to prefetch pages in that direction.