Introducing Indexer to SP

Indexer Motivation:

Filecoin have lots of content being stored (through SPs), would be great to make them useful when people are looking for it, i.e. look up CID to know where to retrieve it. This requires content discoverability and content transferability.

Content Discovery:

IPFS use DHT for node lookup, index spread across 10K nodes on who has what data.

We can’t put all CIDS from Filecoin deals into the DHT (will blow over), the plan is to spin off index node to store high volume indices. It is merged to lotus-master recently, with support in market process to publish to indexers. We are test running this process with a subset of storage providers right now, and will be expanding to more SP.

IPFS currently uses the DHT for lookups, spreading lookup information between participating nodes
Filecoin has a few orders of magnitude more content than IPFS. If we put all that routing information in the current DHT, we’d risk overloading current IPFS DHT participants.
The network indexers is an alternative that does a similar thing: we’ll have a few entities run big machines that have the whole table of content routing, and offer that service in a federated way.

There are two visible things to SP for indexing:

Indexer library is integrated with storage provider market node, using the same market libp2p host (added new libp2p subscription) that SP already exposing for storage and retrieval deals, within the same graphsync libp2p handler that’s been connected to, a new type of voucher (voucher is the way market tracks payments, and is being used to exchange payment come out of market deal or used on free retrieval already from verified data), now there is another voucher requesting indexes to be provided - indexer node will connect to SP’s market node to request indexes, index is a list of CIDS which is stored as a linked chain of deals and expands as deals come in. The communication that SP does to connect to Indexer is under the same port and same structure. The indexer updates over graphsync, meaning once a day indexer connects and ask for delta of new CIDs found available on the nodes since last day, and store them to multiple providers, and indexer only need to get it from one of them. This saves bandwidth by deduping when CIDs is provided from multiple providers, so it will only take relatively small amount of bandwidth compared to actual data (SP can expect minimal bandwidth impact), with at most one or two additional connections that are short-lived for indexing. We are seeing providers doing backlog ingestion of historic data takes ~15-20 min over the network and after that only takes very short time for the delta.
There will be a gossipsub message when you make deals, which goes from market to lotus daemon, then goes out over the same lotus gossipsub network which is already used for blocks. The message says there is a new index available, this happens when unsealed data gets put into the storage and available for retrieval market, and it allows indexers on the network to notice there is new content (when new deal is available) and optimize to pull with lower latency.

Content Transfer:

There will be a bridge to the current IPFS DHT, since SP nodes dont speak bitswap that IPFS client speaks. (might have an optional extension to allow providing index in bitswap later, IPFS nodes will not start contacting SP nodes until that is turned on.)

The retrieval gateways is coming in a month or two that will watch for bitswap request from IPFS client, translate to graphsync request, to retrieve and provide content to IPFS client and gateway (ipfs.io). Note: we have not figured out payment policy details yet for paid retrieval.

Overall, general expectation from SP is pretty minimal. And with Lotus update, SP will be opted in by default to be part of the indexer network to provide index (with the option to opt out).

Indexer Demo:

https://drive.google.com/drive/folders/1a69KcgrG1CVOSjYJOvcrJJd7RuCmWt8z

(17:44 - 23:08)