February 2024
·
8 Reads
Lecture Notes in Computer Science
Coded computing has proven its efficiency in tolerating stragglers in distributed computing. Workers return the sub-computation results to the master after computing, and the master recovers the final computation result by decoding. However, the workers may provide incorrect results, which leads to wrong final result. Therefore, it is meaningful to improve the resilience of coded computing against errors. Most existing verification schemes only use the workers’ fully correct computations to recover the final result, and the defective computations are not considered for decoding. In this paper, we focus on matrix multiplication and design a general Test-and-Decode (TD) scheme to recover the final result efficiently. Furthermore, we divide each sub-computation result into multiple parts and fully use the correct parts for partial recovery, which can improve the tolerance for errors in computations. Decoding is performed only when the verification result satisfies the permission, which avoids repetitive decoding. We conduct extensive simulation experiments to evaluate the probability of successful recovery of the results and the computation time of the TD scheme. We also compare the TD scheme with other verification schemes and the results show that it outperforms the current schemes in terms of efficiency in verifying and recovering computational results.