Method and system for aligning natural and synthetic video to speech synthesis