March 2025
·
79 Reads
With the rise of smartphone use in web surveys, voice or oral answers have become a promising methodology for collecting rich data. Voice answers not only facilitate broader and more detailed narratives but also include additional metadata, such as voice amplitude and pitch, to assess respondent engagement. Despite these advantages, challenges persist, including high item non-response rates, mixed respondent preferences for voice input, and labor-intensive, manual answer transcription and coding. This study addresses these last two challenges by evaluating two critical aspects of processing voice answers. First, it compares the transcription performance of three leading Automatic Speech Recognition (ASR) tools—Google Cloud Speech-to-Text API, OpenAI Whisper, and Vosk—using voice answers collected from an open-ended question on nursing home transparency that was administered in an opt-in online panel in Spain. Second, it evaluates the efficiency and quality of coding these transcriptions using human coders and GPT-4o, a Large Language Model (LLM) developed by OpenAI. We found that each of the ASR tools has distinct merits and limits. Google sometimes fails to provide transcriptions, Whisper produces hallucinations (false transcriptions), and Vosk has clarity issues and high rates of incorrect words. Human and LLM-based coding also differ significantly. Thus, we recommend using several ASR tools for voice answer transcription and implementing human as well as LLM-based coding, as the latter offers additional information at minimal added cost.