Quantifiers in a MultimodalWorld: Hallucinating Vision with Language and Sound