While it's not publicly available yet, I have strong suspicions that multimodal GPT-4 may actually be SOTA in image-to-text. The examples shown in the Sparks of AGI paper were extremely impressive imo, though of course those are cherry-picked so it's unclear how well the model will perform on non-cherry-picked images.