Over the past decade, single-cell datasets have grown in both size and complexity, enabling the construction of large-scale cell atlases. Technical variability in data generation, also known as batch effects, hinders meaningful comparisons. Although numerous batch-correction algorithms have been developed, they often struggle with overcorrection or undercorrection. Here we review commonly used data cleaning and integration methods. We envision that future frameworks will learn interpretable gene and cell representations and achieve informed modeling of technical and biological variation.
