Purpose
Make sure we’re aware of all operational readiness items we need to complete before passive testing and activation. This helping drive https://github.com/filecoin-project/go-f3/issues/722to close.
2025-03-24
- Review the recovery steps in F3 Failure Scenarios and Disaster Recovery
- Do they seem right?
- Which ones do we need runbooks for?
- Desired output: list of runbooks we need to create
- Notes
- (Assuming a post activation world)
- Producing snapshots
- What data is missing from snapshots today?
1.
2. If f3 freezes
1. People have a habbit of wiping Lotus nodes
2. We’ll need to provide a way for SPs to get this data (power tables)
1. This is a new tool we’d need to create
3. We think it would take a few hours to create the tool
4. The impact if we don’t have the tool in advance
1. F3 is already frozen, and this is just pushing out the time to resolution
5. Cacade of things that would need to fail
1. Ohshitstore : activates when 900 epochs behind and keeps foreven until we recover
2. Users need to wipe their nodes
- Summary: edge case and not going to hit a cliff. We’ll just be reactive here.
- Disabling/pausing F3
- Use environment variables to disable F3 (e.g, Use LOTUS_DISABLE_F3_ACTIVATION=ContractAddress)
- TODO: create this small runbook
- Adjusting parameters
- Depends on what is the root cause / what happened
- Generally these are going to need to be a coordinated effort to upgrade together
- Topic of how to “emergency upgrade” (very slim scope)
- Snapshot discussion
- Usecases
- Distribute cert chain
- Where you need to have some state in order to progress
- With headloockback = 4, the possibility of “By the time finality processing ends a heavier chain exists that is selected by non-participating SPs” is low
- Today
- download snapshot
- sync cert chain
- Future (https://github.com/filecoin-project/go-f3/issues/480)
2025-03-20
- Review F3 Launch Planning Notes from 2024 and see if there’s anything we need to carry over and do for this latest round.
- Example: Make a new node release with the power table that the first f3 instance is based on (TODO: we need operator docs on how we expect people to update their nodes and/or config their nodes with env car)
- 🎬 Create a v1.32.1 release placeholder issue that will have initial power table CID and the activation parameters
- ✅ https://github.com/filecoin-project/lotus/issues/12970
- Discussion on power table metric
- Total power in bytes as seen by F3 for the currently instance being processed
- Purpose: for detecting power swings
- 🎬 Create task to add this metric
- ✅ https://github.com/filecoin-project/go-f3/issues/925
- Discussion on how we can do cleanup after the activation (but this doesn’t have to get rushed out right after activation. It should be done right away but can go out with the next normal Lotus release)
- 🎬 Create cleanup issue to start accumulating the things to delete! w00t w00t!
- ✅ https://github.com/filecoin-project/lotus/issues/12971
- Ensure we’re good with how we’ll start an incident
- Start an incident
- Who else do we need to communicate this to?
- What are the potential things that could go wrong?
- Starter list: ‣
- For each item
- How will we know?
- What would we do?
- F3 Failure Scenarios and Disaster Recovery
- Discussion around CPU usage
- Idea: run a ddos against our devnet
- Knobs that we have at our disposal?
- None
- 🎬 Need to update based on openings from chain exchange
- ✅ Tracking item: https://github.com/orgs/filecoin-project/projects/114/views/2?pane=issue&itemId=102954830
- Dicussion of next steps
- Metrics
- Unit tests
- L/R side hashing
- ddos testing our infra network (this informs grouping)
- Grouping (planning to punt this - too complex of a change too close to the release)
- On the attack
- We have opened ourselves to Sybil
- To handle messages that anyone can send
- Reduce potency by reduce the CPU consumption
- Kuba’s work (merkle tree optimization)
- We did 2x reduction
- We think we can get another 2x squeeze
- Also, reduce the amount you need to hash (grouping)
- Goal here is to not make spam attack vector worse
- Not critical right now but we want to have it ready to release at the minimum
- Continuing with release train items
- We’ll get data by having the adversary