# Decentralized Storage

In the InterLink ID system, decentralized storage is a cornerstone for securely managing encrypted biometric embeddings, which are critical for robust identity verification. As depicted in \<Figure>, intermediate nodes form a distributed layer between clients and the aggregator, enhancing security, privacy, and operational continuity. These nodes store encrypted biometric embeddings—feature vectors extracted from biometric data such as facial recognition or fingerprint scans—used in the identity verification process, potentially augmented by federated learning to refine verification models. To ensure resilience against failures, attacks, or data corruption, InterLink ID implements a comprehensive backup mechanism comprising redundant storage, real-time monitoring, automatic failover, data integrity checks, and node recovery protocols. This section elaborates on these components, integrating mathematical formulations to provide a rigorous technical foundation.

### Redundant Storage

To guarantee high availability and fault tolerance, each encrypted biometric embedding is replicated across multiple intermediate nodes. The system adopts a replication factor of $$k = 3$$, meaning three copies of each embedding are stored on distinct nodes. This configuration ensures that the system can tolerate up to two node failures without losing access to any embedding, providing robust resilience for identity verification services.

The distribution of replicas is managed using a consistent hashing algorithm, which ensures even load distribution and minimizes data reassignment when nodes are added or removed. Formally, let $$\mathcal{N} = {n\_1, n\_2, \dots, n\_m}$$ denote the set of intermediate nodes, and $$\mathcal{E} = {e\_1, e\_2, \dots, e\_p}$$ represent the set of encrypted embeddings. A hash function $$h: \mathcal{E} \cup \mathcal{N} \rightarrow \[0, 1)$$ maps both embedding identifiers and nodes to a unit circle (the hash ring). For an embedding $$e$$, its replicas are stored on the $$k$$ nodes $$n\_{i\_1}, n\_{i\_2}, \dots, n\_{i\_k}$$ such that $$h(n\_{i\_j})$$ are the closest hash values to $$h(e)$$ in a clockwise direction. This approach ensures efficient data retrieval and load balancing across the network.

The reliability of this replication strategy can be quantified by considering node failure probabilities. If each node has an independent failure probability of $$p = 0.01$$, the probability of losing all three replicas of an embedding is $$p^k = 0.01^3 = 10^{-6}$$, indicating a highly reliable storage system. Alternatively, the system could employ erasure coding, such as a (5, 3) Reed-Solomon code, where data is split into three fragments and encoded into five, allowing reconstruction from any three fragments. This reduces storage overhead to $$\frac{5}{3} \approx 1.67$$ compared to replication’s overhead of 3, while still tolerating two failures. However, replication is preferred for its simplicity and faster data access, critical for real-time identity verification.

### Real-time Monitoring

Continuous monitoring of intermediate nodes is essential to detect failures or security threats promptly, ensuring the system’s reliability. A dedicated monitoring subsystem collects metrics including node uptime, response times, resource utilization (CPU, memory, disk), and network performance (latency, bandwidth). Security logs are analyzed for signs of unauthorized access, unusual access patterns, or other anomalies indicative of potential attacks.

Anomaly detection employs a combination of heuristic rules and machine learning algorithms. For example, a rule-based system might trigger an alert if a node’s response time exceeds a threshold (e.g., 500 ms) or if error rates increase beyond 5%. Machine learning models, trained on historical node performance data, can identify subtle deviations from normal behavior, such as unexpected spikes in CPU usage or irregular data access patterns. These alerts are evaluated to determine whether they indicate a genuine failure or security threat, prompting responses ranging from notifications to administrators to initiating failover procedures.

### Automatic Failover

To maintain service continuity in the face of node failures or security breaches, InterLink ID implements an automatic failover mechanism. When a node is detected as unresponsive—via monitoring metrics like heartbeat signals or timeout thresholds—the system updates the network’s routing information to exclude the affected node. Requests for embeddings stored on the failed node are redirected to nodes holding the replicas, leveraging the consistent hashing scheme to locate alternative sources.

In the case of a security breach, such as a node exhibiting signs of compromise (e.g., unauthorized access attempts), the system isolates the node by revoking its network access. Embeddings stored on the compromised node are marked as potentially untrustworthy, and the system relies on replicas from verified nodes. To prevent data corruption, each embedding is associated with a version number or timestamp, ensuring that only the most recent and verified versions are used. This failover process is designed to be seamless, with minimal disruption to the identity verification service, typically achieving recovery within seconds.

### Data Integrity Checks

Ensuring the authenticity and consistency of stored embeddings is critical for secure identity verification. Each encrypted embedding ( e ) is accompanied by a cryptographic hash, computed using SHA-256: $$h = \text{SHA-256}(e)$$. This hash is stored alongside the embedding or in a separate integrity database. When an embedding is accessed, or during periodic audits, the hash is recomputed as $$h' = \text{SHA-256}(e)$$ and compared to the stored $$h$$. A mismatch indicates potential tampering or corruption, prompting the system to discard the embedding and retrieve a valid copy from a replica node.

This hash-based verification leverages the collision-resistant properties of SHA-256, where the probability of two distinct embeddings producing the same hash is negligible (approximately $$2^{-256}$$). Periodic integrity checks are scheduled to proactively identify issues, complementing on-access verification to ensure continuous data reliability. In cases where the encryption scheme itself provides integrity (e.g., using AES-GCM for authenticated encryption), the hash serves as an additional layer of assurance, particularly for detecting errors introduced by hardware failures or network issues.

### Node Recovery and Data Resynchronization

When an intermediate node recovers from a failure or a new node is added to the network, it must synchronize its data to hold the correct replicas of encrypted embeddings. The recovering node queries the distributed hash table (DHT) to determine the range of hash values it is responsible for, based on the consistent hashing scheme. It then requests the corresponding embeddings from other nodes holding replicas of those embeddings.

To optimize resynchronization and reduce bandwidth usage, the node employs Merkle trees to efficiently identify missing or outdated data. A Merkle tree organizes the hashes of stored embeddings into a hierarchical structure, allowing the node to compare its data with that of its peers and download only the differing portions. For an embedding set of size $$n$$, the Merkle tree enables verification with $$O(\log n)$$ comparisons, significantly improving efficiency for large datasets.

To maintain consistency during resynchronization, the system uses versioning or locking mechanisms to prevent concurrent modifications from interfering with the recovery process. Once synchronized, the node becomes fully operational, capable of serving requests and storing new embeddings. This dynamic resynchronization ensures the decentralized storage network remains scalable and adaptable, supporting the addition of new nodes or recovery from disruptions without compromising data availability.

### Integration with Federated Learning

As illustrated in \<Figure>, the intermediate nodes not only store encrypted biometric embeddings but also support the federated learning process, which may be used to refine identity verification models. In this context, clients access embeddings from the intermediate nodes to perform local computations, sending encrypted model updates to the aggregator. The backup mechanism ensures that these embeddings are consistently available, enabling uninterrupted learning cycles. The encrypted data layers shown in \<Figure> correspond to these intermediate nodes, emphasizing their role in maintaining data privacy and security during both storage and computation phases.

### Summary and Implications

The decentralized storage architecture of InterLink ID, fortified by its backup mechanism, ensures that encrypted biometric embeddings remain secure, accessible, and intact. By employing replication with consistent hashing, real-time monitoring with advanced anomaly detection, automatic failover with rapid recovery, rigorous data integrity checks, and efficient node resynchronization, the system achieves high reliability and resilience. These features are critical for maintaining the trustworthiness of the identity verification service, particularly under adversarial conditions or hardware failures.

The use of mathematical formulations, such as consistent hashing and cryptographic hashing, provides a rigorous foundation for the system’s design, balancing efficiency with fault tolerance. While replication is currently favored for its simplicity, future enhancements could explore erasure coding to optimize storage overhead, potentially reducing costs while maintaining reliability. This robust infrastructure supports InterLink ID’s mission to deliver a privacy-preserving, scalable, and dependable identity verification platform.

#### Table: Comparison of Redundancy Strategies

| **Strategy**         | **Replication (k=3)**           | **Erasure Coding (5,3)**              |
| -------------------- | ------------------------------- | ------------------------------------- |
| **Fault Tolerance**  | Up to 2 node failures           | Up to 2 node failures                 |
| **Storage Overhead** | 3x (three copies)               | 1.67x (5 fragments for 3 data pieces) |
| **Access Speed**     | Fast (direct replica access)    | Slower (requires reconstruction)      |
| **Complexity**       | Low (simple to implement)       | Higher (encoding/decoding required)   |
| **Use Case**         | Real-time identity verification | Storage-efficient archival            |

This table highlights the trade-offs between replication and erasure coding, with replication chosen for its speed and simplicity in InterLink ID.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://whitepaper.interlinklabs.ai/technical-implementation/decentralized-storage.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.