Dataset
After all, the very first thing I wanted was a music lyrics dataset. Fortuitously, I discovered one on Kaggle! This dataset is beneath a Inventive Commons (CC0: Public Area) license.
This dataset incorporates about 60K music lyrics together with the title and artist title. I do know 60K may not cowl all of the songs you’re keen on, however I feel it’s a superb start line for LyRec.
songs_df = pd.read_csv(f"{root_dir}/spotify_millsongdata.csv")
songs_df = songs_df.drop(columns=["link"])
songs_df["song_id"] = songs_df.index + 1
I didn’t have to carry out any pre-processing on this information. I simply eliminated the hyperlink column and added an ID for every music.
Fashions
I wanted to pick two LLMs: One for computing the embeddings and one other for producing the music summaries. Selecting the proper LLM on your process could also be just a little tough due to the sheer variety of them! It’s a good suggestion to have a look at the leaderboard to seek out the present greatest ones. For the embedding mannequin, I checked the MTEB leaderboard hosted by HuggingFace.
I used to be on the lookout for a smaller mannequin (clearly!) with out compromising an excessive amount of accuracy; therefore, I made a decision on GTE-Qwen2-1.5B-Instruct.
from sentence_transformers import SentenceTransformer
import torchmannequin = SentenceTransformer(
"Alibaba-NLP/gte-Qwen2-1.5B-instruct",
model_kwargs={"torch_dtype": torch.float16}
)
For the summarizer, I simply wanted a sufficiently small instruction following LLM, so I went with Gemma-2–2b-It. In my expertise, it’s top-of-the-line small fashions as of now.
import torch
from transformers import pipelinepipe = pipeline(
"text-generation",
mannequin="google/gemma-2-2b-it",
model_kwargs={"torch_dtype": torch.bfloat16},
system="cuda",
)
Pre-computing the Embeddings
Computing the lyrics embeddings was fairly simple. I simply used the .encode(…) methodology with a batch_size of 32 for sooner processing.
song_lyrics = songs_df["text"].valueslyrics_embeddings = mannequin.encode(
song_lyrics,
batch_size=32,
show_progress_bar=True
)
np.save(f"{root_dir}/60k_song_lyrics_embeddings.npy", lyrics_embeddings)
At this level, I saved these embeddings in a .npy file. I may have used a extra structured format, nevertheless it did the job for me.
Coming to the abstract embeddings, I first wanted to generate the summaries. I had to make sure that the abstract captured the emotion and the music’s theme whereas not being too prolonged. So, I got here up with the next immediate for Gemma-2.
You might be an skilled music summarizer.
You can be given the complete lyrics to a music.
Your process is to write down a concise, cohesive abstract that
captures the central emotion, overarching theme, and
narrative arc of the music in 150 phrases.{music lyrics}
Right here’s the code snippet for abstract era. For simplicity, the next reveals a sequential processing. I’ve included the batch-processing model within the GitHub repo.
def get_summary(song_lyrics):
messages = [
{"role": "user",
"content": f'''You are an expert song summarizer.
You will be given the full lyrics to a song.
Your task is to write a concise, cohesive summary that
captures the central emotion, overarching theme, and
narrative arc of the song in 150 words.nn{song_lyrics}'''},
]outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
return assistant_response
songs_df["summary"] = songs_df["text"].progress_apply(get_description)
Unsurprisingly, this step took probably the most time. Fortunately, this must be completed solely as soon as, and naturally, after we need to replace the database with new songs.
Then, I computed and saved the embedding similar to the final time.
song_summary = songs_df["summary"].valuessummary_embeddings = mannequin.encode(
song_summary,
batch_size=32,
show_progress_bar=True
)
np.save(f"{root_dir}/60k_song_summary_embeddings.npy", summary_embeddings)
Vector Search
With the embeddings in place, it was time to implement the semantic search based mostly on embedding similarity. There are quite a lot of superior open-source vector databases obtainable for this job. I made a decision to make use of a easy one known as FAISS (Fb AI Similarity Search). It simply takes two traces so as to add the embeddings into the database. First, we create a FAISS index. Right here, we have to point out the similarity metric you need to make the most of for looking out and the dimension of the vectors. I used the dot product (inside product) because the similarity measure. Then, we add the embeddings to the index.
Observe: Our database is sufficiently small to do an exhaustive search utilizing dot product. For bigger databases, it’s advisable to carry out an approximate nearest neighbor (ANN) search. FAISS has assist for that.
import faisslyrics_embeddings = np.load(f"{root_dir}/60k_song_lyrics_embeddings.npy")
lyrics_index = faiss.IndexFlatIP(lyrics_embeddings.form[1])
lyrics_index.add(lyrics_embeddings.astype(np.float32))
summary_embeddings = np.load(f"{root_dir}/60k_song_summary_embeddings.npy")
summary_index = faiss.IndexFlatIP(summary_embeddings.form[1])
summary_index.add(summary_embeddings.astype(np.float32))
To seek out probably the most comparable songs given a question, we first have to generate the question embedding after which name the .search(…) methodology on the index. Beneath the hood, this methodology computes the similarity between the question and each entry in our database and returns the highest okay entries and the corresponding scores. Right here’s the code performing a semantic search on lyrics embeddings.
query_lyrics = 'Think about the final music you fell in love with'
query_embedding = mannequin.encode(f'''Instruct: Given the lyrics,
retrieve related songsnQuery: {query_lyrics}''')
query_embedding = query_embedding.reshape(1, -1).astype(np.float32)
lyrics_scores, lyrics_ids = lyrics_index.search(query_embedding, 10)
Discover that I added a easy immediate within the question. That is advisable for this mannequin. The identical applies to the abstract embeddings.
query_description = 'Describe the kind of music you wanna hearken to'
query_embedding = mannequin.encode(f'''Given an outline,
retrieve related songsnQuery: {query_description}''')
query_embedding = query_embedding.reshape(1, -1).astype(np.float32)
summary_scores, summary_ids = summary_index.search(query_embedding, okay)
Professional tip: How do you do a sanity examine?
Simply put any entry from the database within the question and see if the search returns the identical because the top-scoring entry!
Implementing the Options
At this stage, I had the constructing blocks of LyRec. Now, it was the time to place these collectively. Keep in mind the three objectives I set to start with? Right here’s how I applied these.
To maintain issues tidy, I created a category named LyRec that will have a way for every function. The primary two options are fairly simple to implement.
The strategy .get_songs_with_similar_lyrics(…) takes a music (lyrics) and an entire quantity okay as enter and returns a listing of okay most comparable songs based mostly on the lyrics similarity. Every aspect within the record is a dictionary containing the artist’s title, music title, and lyrics.
Equally, .get_songs_with_similar_description(…) takes a free-form textual content and an entire quantity okay as enter and returns a listing of okay most comparable songs based mostly on the outline.
Right here’s the related code snippet.
class LyRec:
def __init__(self, songs_df, lyrics_index, summary_index, embedding_model):
self.songs_df = songs_df
self.lyrics_index = lyrics_index
self.summary_index = summary_index
self.embedding_model = embedding_modeldef get_records_from_id(self, song_ids):
songs = []
for _id in song_ids:
songs.prolong(self.songs_df[self.songs_df["song_id"]==_id+1].to_dict(orient='information'))
return songs
def get_songs_with_similar_lyrics(self, query_lyrics, okay=10):
query_embedding = self.embedding_model.encode(
f"Instruct: Given the lyrics, retrieve related songsn Question: {query_lyrics}"
).reshape(1, -1).astype(np.float32)
scores, song_ids = self.lyrics_index.search(query_embedding, okay)
return self.get_records_from_id(song_ids[0])
def get_songs_with_similar_description(self, query_description, okay=10):
query_embedding = self.embedding_model.encode(
f"Instruct: Given an outline, retrieve related songsn Question: {query_description}"
).reshape(1, -1).astype(np.float32)
scores, song_ids = self.summary_index.search(query_embedding, okay)
return self.get_records_from_id(song_ids[0])
The ultimate function was just a little tough to implement. Recall that we have to first retrieve the highest songs based mostly on lyrics after which re-rank them based mostly on the textual description. The primary retrieval was straightforward. For the second, we solely want to contemplate the top-scoring songs. I made a decision to create a short lived FAISS index with the highest songs after which seek for the songs with the best abstract similarity scores. Right here’s my implementation.
def get_songs_with_similar_lyrics_and_description(self, query_lyrics, query_description, okay=10):
query_lyrics_embedding = self.embedding_model.encode(
f"Instruct: Given the lyrics, retrieve related songsn Question: {query_lyrics}"
).reshape(1, -1).astype(np.float32)scores, song_ids = self.lyrics_index.search(query_lyrics_embedding, 500)
top_k_indices = song_ids[0]
summary_candidates = []
for idx in top_k_indices:
emb = self.summary_index.reconstruct(int(idx))
summary_candidates.append(emb)
summary_candidates = np.array(summary_candidates, dtype=np.float32)
temp_index = faiss.IndexFlatIP(summary_candidates.form[1])
temp_index.add(summary_candidates)
query_description_embedding = self.embedding_model.encode(
f"Instruct: Given an outline, retrieve related songsn Question: {query_description}"
).reshape(1, -1).astype(np.float32)
scores, temp_ids = temp_index.search(query_description_embedding, okay)
final_song_ids = [top_k_indices[i] for i in temp_ids[0]]
return self.get_records_from_id(final_song_ids)
Viola! Lastly, LyRec is prepared. You could find the entire implementation on this repo. Please depart a star if you happen to discover this useful! 😃