Audio Transcription functionality

Hello! I am currently building a voice recording app, and I think it would really benefit from giving our users the option to have their audio transcribed instead of saved.
I thought one way to do this would be to start an infinite loop, in which the voice recognition extension continually adds to a saved text variable, and the loop is broken when the user indicates they are done recording.
These recordings could go up to an hour long, in theory. Unfortunately, when trying to test this, I am noticing that the app crashes almost instantly when this loop begins.

Has anyone had any success using the voice recognition tool for longer than 1 sentence? If you have any experience with the tool and can help me out, I’d really appreciate it.

(What I would REALLY appreciate is some sort of extension that takes a pre-recorded audio file, and then extracts the text from it. Does an extension like this already exist? I would be able to pay for it.)


The challenge has to do with this requirement:

The speech recognizer will automatically stop if it detects silence so unless you are either talking all the time or making some noise to keep the recognizer active, this will be quite complicated to implement, if at all possible. How “strict” is this requirement?

EDIT: Would pressing the “Listen” button before each transcription be acceptable supposing there is a way to notify the user when the previous transcription has stopped?

1 Like

Yeep, that’s my predicament precisely. It’s a pretty strict requirement, in that it is 50% of the app’s main functionality, haha.
I currently have the idea to use some sort of open-source transcription tool, like, after the audio has already been recorded.
Could there be any way to take a Thunkable AudioFile object and run that through a web tool like that, then get the text variable back in Thunkable? I am currently saving the audio files in the user’s local database.

Curious, what does this loop look like? Can you please take a screenshot of your blocks?

Unfortunately I have decoupled these blocks from the main, active blocks as I wanted to make sure everything else was still working. But basically, what I had was that if the record button was clicked, a while loop conditioned on “ContinueListening” would begin that continuously updates a “transcription” variable. “ContinueListening” becomes false if the stop button is pressed, and this loop breaks.
Do you think this implementation works? It feels sound in theory. I’m wondering if mine crashes because there’s so much other stuff going on at the same time, and this ensemble actually functions without my extraneous image files loading in, etc. :thinking:

Aaah!!! Your logic is almost the same as the one I started building on a test app to assist you. But I think I know why your app crashes. With the “while” loop you spawn a new instance of the Speech Recognizer on every CPU cycle so that definitely leads to a crash.

My logic adds the following to the recipe:

  • Two buttons, one “Start” and one “Stop” for starting and stopping the transcription
  • We know that the Speech Regognizer’s timeout is about 3 seconds (I can calculate with more precision if needed)
  • We can compare our transcription text length vs. the 3-seconds-ago one using two variables and a timer which will fire every 3 seconds
  • If after 3 seconds the length of the text has remained the same then we definitely know that the Speech Recognizer has timed out. When that happens, we copy the speech transcribed in the previous cycle to a variable that holds all previous transcriptions and re-spawn the Speech Recognizer to continue (re-start) transcribing; then we apply the same logic over and over.

I would be happy to work with you on this but I see that you are already on the right way. I hope that the above helps. Ping me if otherwise!


Thank you so much for the enthusiastic support, I really appreciate it! I didn’t know about the 3 second grace period with the voice recognizer - that’s definitely good to know!
The more I play with this, though, the more I’m thinking I should definitely go the route of transcribing a pre-recorded audio file to text in batches, since I cannot have an audio recorder going at the same time as this VoiceRecognizer loop. I would like to have both the audio file and text file when the user is done talking, if possible. If you have any ideas on how I could do that, I’d greatly appreciate it.
Thanks again!

I am checking Cloudinary’s (MediaDB) capabilities to see if they can transcribe a recording to text but can’t but no such feature is offered. So you may want to leverage Watson via the API it provides:


You might also consider recording the audio in Thunkable and uploading it to Google’s Speech to Text. It’s not free but it does look like it can handle large audio files. I’m not sure how quick it is at transcribing, though.


Oohh! This is an interesting lead. Do you have any guidance on how exactly one could do something like this, and then retrieve any resulting output from the API? As far as I can tell, Google Cloud is not an integrated extension for Thunkable at the moment. I suppose I could save the media files in Firebase or Cloudinary, but I’m not sure of how I would run it through an API or webapp without manual involvement. :thinking:

It appears that with Watson you can upload up to 200 MB for free. I am not sure what bitrate Thunkable uses to record audio but if it is anything below 22.5khz and in mono then 200 MB would work, at least at first :slight_smile:

EDIT: Even if it is recorded at a higher bitrate, you can then use Cloudinary’s audio transformations and use the output as source to Watson.

1 Like

Ooh, this is awesome actually. Ok, it looks like I should go the Web API route + Watson/Cloudinary!! Thanks so much for the assistance, hope you have a great rest of your week!