You would need the accuracy of these models to be way, way, better than today (so you do not lie to the user which has no idea about the fact that the output is a lie, because he can't compare the model output with the input, because he can't see that), their hardware requirements way, way, lower (so you don't need a game rig for useful computing in these groups), so, no, I can't agree, this A.I. approach will not save you from correct semantic markup.
What evidence do you have to say that the accuracy right now is bad? UI recognition is generally much easier than photographic recognition. Additionally, per the original point re: boling the ocean, it is more likely that a person has a CV-based screen reader than every website they are visiting having ARIA tags.