Interview Setup and Video Add
To create a user-friendly interface for organising interviews and offering video hyperlinks, I used Google Colab’s forms functionality. This permits for the creation of textual content fields, sliders, dropdowns, and extra. The code is hidden behind the shape, making it very accessible for non-technical customers.
Audio obtain and conversion
I used the yt-dlp lib to obtain solely the audio from a YouTube video and convert it to the mp3 format. It is rather simple to make use of, and you’ll check its documentation here.
Audio transcription
To transcribe the assembly, I used Whisper from Open AI. It’s an open-source mannequin for speech recognition skilled on greater than 680K hours of multilingual knowledge.
The mannequin runs extremely quick; a one-hour audio clip takes round 6 minutes to be transcribed on a 16GB T4 GPU (provided by free on Google Colab), and it helps 99 different languages.
Since privateness is a requirement for the answer, the mannequin weights are downloaded, and all of the inference happens contained in the colab occasion. I additionally added a Mannequin Choice type within the pocket book so the consumer can select totally different fashions based mostly on the precision they’re searching for.
Audio system Identification
Speaker identification is completed via a method known as Audio system Diarization. The concept is to establish and phase audio into distinct speech segments, the place every phase corresponds to a specific speaker. With that, we are able to establish who spoke and when.
For the reason that movies uploaded from YouTube do not have metadata figuring out who’s talking, the audio system will probably be divided into Speaker 1, Speaker 2, and so on.… Later, the consumer can discover and change these names in Google Docs so as to add the audio system’ identification.
For the diarization, we are going to use a mannequin known as the Multi-Scale Diarization Decoder (MSDD), which was developed by Nvidia researchers. It’s a subtle method to speaker diarization that leverages multi-scale evaluation and dynamic weighting to realize excessive accuracy and adaptability.
The mannequin is understood for being fairly good at figuring out and correctly categorizing moments the place a number of audio system are speaking—a factor that happens regularly throughout interviews.
The mannequin can be utilized via the NVIDIA NeMo framework. It allowed me to get MSDD checkpoints and run the diarization straight within the colab pocket book with just some traces of code.
Trying into the Diarization outcomes from MSDD, I observed that punctuation was fairly dangerous, with lengthy phrases, and a few interruptions resembling “hmm” and “yeah” had been taken under consideration as a speaker interruption — making the textual content tough to learn.
So, I made a decision so as to add a punctuation mannequin to the pipeline to enhance the readability of the transcribed textual content and facilitate human evaluation. So I acquired the punctuate-all model from Hugging Face, which is a really exact and quick answer and helps the next languages: English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portuguese, Slovak, and Slovenian.
Video Synchronization
From the trade options I benchmarked, a robust requirement was that each phrase ought to be linked to the second within the interview the speaker was speaking.
The Whisper transcriptions have metadata indicating the timestamps when the phrases had been mentioned; nonetheless, this metadata is just not very exact.
Subsequently, I used a mannequin known as Wav2Vec2 to do that match in a extra correct method. Principally, the answer is a neural community designed to be taught representations of audio and carry out speech recognition alignment. The method includes discovering the precise timestamps within the audio sign the place every phase was spoken and aligning the textual content accordingly.
With the transcription <> timestamp match correctly performed, via easy Python code, I created hyperlinks pointing to the second within the video the place the phrases begin to be mentioned.
The LLM Mannequin
This step of the pipeline has a big language mannequin able to run domestically and analyze the textual content, offering insights in regards to the interview. By default, I added a Gemma Mannequin 1.1b with a immediate to summarize the textual content. If the customers select to have the summarization, it is going to be in a bullet listing on the high of the doc.
Additionally, by clicking on Present code, the customers can change the immediate and ask the mannequin to carry out a unique process.
Doc era for tagging, highlights, and feedback
The final process carried out by the answer is to generate Google Docs with the transcriptions and hyperlinks to the interviews. This was performed via the Google API Python client library.
For the reason that product has grow to be extremely helpful in my day-to-day work, I made a decision to offer it a reputation for simpler reference. I known as it the Insights Gathering Open-source Tool, or iGot.
When utilizing the answer for the primary time, some preliminary setup is required. Let me information you thru a real-world instance that will help you get began.
Open the iGot pocket book and set up the required libraries
Click on on this link to open the pocket book and run the primary cell to put in the required libraries. It’s going to take round 5 minutes.
When you get a immediate asking you to restart the pocket book, simply cancel it. There isn’t a want.
If the whole lot runs as anticipated, you will get the message “All libraries put in!”.
Getting the Hugging Consumer Entry Token and mannequin entry
(This step is required simply the primary time you might be executing the pocket book)
For operating the Gemma and punctuate-all fashions, we are going to obtain weights from hugging face. To take action, you will need to request a consumer token and mannequin entry.
To take action, it’s worthwhile to create a hugging face account and comply with these steps to get a token with studying permissions.
After getting the token, copy it and return to the lab pocket book. Go to the secrets and techniques tab and click on on “Add new secret.”
Title your token as HF_TOKEN and previous the important thing you bought from Hugging Face.
Subsequent, click on this link to open the Gemma mannequin on Hugging Face. Then, click on on “Acknowledge license” to get entry the mannequin.
Sending the interview
To ship an interview to iGot, it’s worthwhile to add it as an unlisted video on YouTube beforehand. For the aim of this tutorial, I acquired a bit of the Andrej Karpathy interview with Lex Fridman and uploaded it to my account. It’s a part of the dialog the place Andrej gave some recommendation for Machine Studying Newcomers.
Then, it’s worthwhile to get the video URL, paste within the video_url discipline of the Interview Choice pocket book cell, outline a reputation for it, and point out the language spoken within the video.
When you run the cell, you will obtain a message indicating that an audio file was generated.t into
Mannequin choice and execution
Within the subsequent cell, you’ll be able to choose the scale of the Whisper mannequin you need to use for the transcription. The larger the mannequin, the upper the transcription precision.
By default, the biggest mannequin is chosen. Make your selection and run the cell.
Then, run the fashions execution cell to run the pipeline of fashions confirmed within the earlier part. If the whole lot goes as anticipated, you must obtain the message “Punctuation performed!” by the top.
When you get prompted with a message asking for entry to the cuddling face token, grant entry to it.
Configuring the transcript output
The ultimate step is to save lots of the transcription to a Google Docs file. To perform this, it’s worthwhile to specify the file path, present the interview title, and point out whether or not you need Gemma to summarize the assembly.
When executing the cell for the primary time, you will get prompted with a message asking for entry to your Google Drive. Click on in permit.
Then, give Colab full entry to your Google Drive workspace.
If the whole lot runs as anticipated, you will see a hyperlink to the google docs file on the finish. Simply click on on it, and you will have entry to your interview transcription.
Gathering insights from the generated doc
The ultimate doc could have the transcriptions, with every phrase linked to the corresponding second within the video the place it begins. Since YouTube doesn’t present speaker metadata, I like to recommend utilizing Google Docs’ discover and change instrument to substitute “Speaker 0,” “Speaker 1,” and so forth with the precise names of the audio system.
With that, you’ll be able to work on highlights, notes, reactions, and so on. As envisioned at first:
The instrument is simply in its first model, and I plan to evolve it right into a extra user-friendly answer. Perhaps internet hosting an internet site so customers don’t must work together straight with the pocket book, or making a plugin for utilizing it in Google Meets and Zoom.
My primary aim with this mission was to create a high-quality assembly transcription instrument that may be useful to others whereas demonstrating how obtainable open-source instruments can match the capabilities of business options.
I hope you discover it helpful! Be at liberty to reach out to me on LinkedIn when you have any suggestions or are fascinated about collaborating on the evolution of iGot 🙂