What happened?

On January 13th at 22:10 UTC on block 18906878, the Neutron chain unexpectedly halted due to a non-deterministic query in Neutron's built-in oracle, Slinky. The Neutron core team immediately received alerts through their monitoring system, prompting them to investigate the underlying issue and announce to the validators to pause their nodes.

The chain halted on block 18906878, and validators could not agree on the next block (18906879) due to the LastResultHash error:

ERR prevote step: consensus deems this block invalid; prevoting nil err="wrong Block.Header.LastResultsHash.  Expected 3B3107D3C0279673500EABA54F63B2946808267CFE4623AE4AB5157075906917, got A775933550BBED5C79A0D5B5671E858285DFF617F539CDE878DA4A22A23E6A6B" height=18906879 module=consensus round=73

<aside> đź’ˇ

What is LastResultsHash? LastResultsHash of a block is a hash of deterministic results of transactions from a previous block.

</aside>

That means validators can’t reach quorum on block 18906879 because validators get a different LastResultsHash when validating results from the previous block (18906878).

By looking at block results of block 18906878(neutrond q block-results 18906878) using different nodes, the team discovered that a single transaction in the block was consuming a different amount of gas on other nodes (gas_used field in the response on the command is different on different nodes). The most suspicious thing was that gas_used differed only by 3-30 units of gas.

This indicated that there was a non-determinism in the execution flow of the transaction. The transaction was a simple contract instantiation, which should not cause any problems in general.

Fortunately, the core team was able to quickly identify the owner/deployer of the contract, which made the debugging process easier. The team started to look at the contract’s code and research the contract’s calls via https://tracing.cosmwasm.com/.

The tracing node and the source code showed that during the instantiation, the contract makes three gRPC queries to the Slinky Oracle:

The team started looking at their implementations, starting with the MarketMap query, which was the cause of the halt. The GetAllMarkets call uses a Golang’s hashmap type to store markets and then returns it as a result of the query (meaning it will be passed to the contract since the contract called the MarketMap query).

It’s unsafe to use the native Golang’s implementation of a map because the order of keys and values in a map is not specified and is not guaranteed to be the same from one call to the other (especially on different machines).

That’s exactly what led to different gas consumption and the halt:

  1. The contract makes a call to /slinky.marketmap.v1.Query/MarketMap query.
  2. The markets are being read from a storage to a map.
  3. The map is being returned from the query and being encoded to a protobuf response inside Cosmos-SDK (since the query was done via gRPC). Keep in mind that the order of keys and values in a map is not deterministic, so the resulting protobuf bytes are not the same on different validators.