Self-supervised learning has great potential for the remote sensing domain, where unlabelled observations are abundant, but labels are hard to obtain. This work leverages unlabelled multi-modal remote sensing data for augmentation-free contrastive self-supervised learning. Deep neural network models are trained to maximize the similarity of latent representations obtained with different sensing techniques from the same location, while distinguishing them from other locations. We showcase this idea with two self-supervised data fusion methods and compare against standard supervised and self-supervised learning approaches on a land-cover classification task. Our results show that contrastive data fusion is a powerful self-supervised technique to train image encoders that are capable of producing meaningful representations: Simple linear probing performs on par with fully supervised approaches and fine-tuning with as little as 10% of the labelled data results in higher accuracy than supervised training on the entire dataset.