How to practice data analyst interviews with AI | by Mathew Wang

Utilizing LLMs to generate artificial knowledge and code

I’ve been engaged on weekend LLM tasks. When considering what to work on, two concepts struck me:

There are few sources for training knowledge analytics interviews in distinction to different roles like software program engineering and product administration. I relied on mates within the business to make up SQL and Python interview questions after I practiced interviewing for my first knowledge analyst job.
LLMs are actually good at producing artificial datasets and writing code.

Consequently, I’ve constructed the AI Knowledge Evaluation Interviewer which robotically creates a singular dataset and generates Python interview questions so that you can remedy!

This text gives an outline of the way it works and its technical implementation. You may take a look at the repo here.

After I launch the online app I’m prompted to offer particulars on the kind of interview I wish to follow for, particularly the corporate and a dataset description. Let’s say I’m interviewing for a knowledge analyst function at Uber which focuses on analyzing experience knowledge:

After clicking Submit and ready for GPT to do its magic, I obtain the AI generated questions, solutions, and an enter discipline the place I can execute code on the AI generated dataset:

Superior! Let’s attempt to remedy the primary query: calculate the overall distance traveled every day. As is sweet analytics follow, let’s begin with knowledge exploration:

It appears like we have to group by the ride_date discipline and sum the distance_miles discipline. Let’s write and submit that Pandas code:

Seems to be good to me! Does the AI reply agree with our method?

The AI reply makes use of a barely totally different methodology however solves the issue basically in the identical manner.

I can rinse and repeat as a lot as wanted to really feel nice earlier than heading into an interview. Interviewing for Airbnb? This instrument has you coated. It generates the questions:

Together with a dataset you possibly can execute code on:

Take a look at the readme of the repo here to run the app domestically. Sadly I didn’t host it however I would sooner or later!

The remainder of this text will cowl the technical particulars on how I created the AI Knowledge Evaluation Interviewer.

LLM structure

I used OpenAI’s gpt-4o because it’s at present my go-to LLM mannequin (it’s fairly simple to swap this out with one other mannequin although.)

There are 3 varieties of LLM calls made:

Dataset era: we ask a LLM to generate a dataset appropriate for an analytics interview.
Query era: we ask a LLM to generate a few analytics interview questions from that dataset.
Reply era: we ask a LLM to generate the reply code for every interview query.

Entrance-end

I constructed the front-end utilizing Flask. It’s easy and never very fascinating so I’ll deal with the LLM particulars under. Be at liberty to take a look at the code in the repo nevertheless!

LLM supervisor

LLMManager is an easy class which handles making LLM API calls. It will get our OpenAI API key from a neighborhood secrets and techniques file and makes an OpenAI API name to cross a immediate to a LLM mannequin. You’ll see some type of this in each LLM challenge.

class LLMManager():
def __init__(self, mannequin: str = 'gpt-4o'):
self.mannequin = mannequinload_dotenv("secrets and techniques.env")
openai_api_key = os.getenv("OPENAI_API_KEY")
self.shopper = OpenAI(api_key=openai_api_key)
def call_llm(self, system_prompt: str, user_prompt: str, temperature: float) -> str:
print(f"Calling LLM with system immediate: {system_prompt}nnUser immediate: {user_prompt}")
response: ChatCompletion = self.shopper.chat.completions.create(
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
mannequin=self.mannequin,
temperature=temperature
)
message = response.decisions[0].message.content material
print(response)
return message

Dataset era

Right here is the place the enjoyable begins!

We first immediate a LLM to generate a dataset with the next immediate:

SYSTEM_TEMPLATE = """You're a senior employees knowledge analyst at a world class tech firm.
You're designing a knowledge evaluation interview for hiring candidates."""DATA_GENERATION_USER_TEMPLATE = """Create a dataset for a knowledge evaluation interview that comprises fascinating insights.
Particularly, generate comma delimited csv output with the next traits:
- Related to firm: {firm}
- Dataset description: {description}
- Variety of rows: 100
- Variety of columns: 5
Solely embrace csv knowledge in your response. Don't embrace some other info.
Begin your output with the primary header of the csv: "id,".
Output: """

Let’s break it down:

Many LLM fashions comply with a immediate construction the place the LLM accepts a system and consumer message. The system message is meant to outline common conduct and the consumer message is meant to offer particular directions. Right here we immediate the LLM to be a world class interviewer within the system message. It feels foolish however hyping up a LLM is a confirmed immediate hack to get higher efficiency.
We cross the consumer inputs in regards to the firm and dataset they wish to follow interviewing with into the consumer template via the string variables {firm} and {description}.
We immediate the LLM to output knowledge in csv format. This looks like the only tabular knowledge format for a LLM to provide which we will later convert to a Pandas DataFrame for code evaluation. JSON would additionally in all probability work however could also be much less dependable given the extra complicated and verbose syntax.
We would like the LLM output to be parseable csv, however gpt-4o tends to generate additional textual content possible as a result of it was skilled to be very useful. The tip of the consumer template strongly instructs the LLM to only output parseable csv knowledge, besides we have to post-process it.

The category DataGenerator handles all issues knowledge era and comprises the generate_interview_dataset methodology which makes the LLM name to generate the dataset:

    def generate_interview_dataset(self, firm: str, description: str, mock_data: bool) -> str:
if not mock_data:
data_generation_user_prompt = DATA_GENERATION_USER_TEMPLATE.format(firm=firm, description=description)
dataset = self.llm_manager.call_llm(
system_prompt=SYSTEM_TEMPLATE,
user_prompt=data_generation_user_prompt,
temperature=0
)dataset = self.clean_llm_dataset_output(dataset)
return dataset
return MOCK_DATASET
def clean_llm_dataset_output(self, dataset: str) -> str:
cleaned_dataset = dataset[dataset.index("id,"):]
return cleaned_dataset

Word that the clean_llm_dataset_output methodology does the sunshine post-processing talked about above. It removes any extraneous textual content earlier than “id,” which denotes the beginning of the csv knowledge.

LLMs solely can output strings so we have to rework the string output into an analyzable Pandas DataFrame. The convert_str_to_df methodology takes care of that:

 def convert_str_to_df(self, dataset: str) -> pd.DataFrame:
csv_data = StringIO(dataset)attempt:
df = pd.read_csv(csv_data)
besides Exception as e:
elevate ValueError(f"Error in changing LLM csv output to DataFrame: {e}")
return df

Query era

We will immediate a LLM to generate interview questions off of the generated dataset with the next immediate:

QUESTION_GENERATION_USER_TEMPLATE = """Generate 3 knowledge evaluation interview questions that may be solved with Python pandas code based mostly on the dataset under:Dataset:
{dataset}
Output the questions in a Python checklist the place every component is a query. Begin your output with [".
Do not include question indexes like "1." in your output.
Output: """

To break it down once again:

The same system prompt is used here as we still want the LLM to embody a world-class interviewer when writing the interview questions.
The string output from the dataset generation call is passed into the {dataset} string variable. Note that we have to maintain 2 representations of the dataset: 1. a string representation that a LLM can understand to generate questions and answers and 2. a structured representation (i.e. DataFrame) that we can execute code over.
We prompt the LLM to return a list. We need the output to be structured so we can iterate over the questions in the answer generation step to generate an answer for every question.

The LLM call is made with the generate_interview_questions method of DataGenerator:

    def generate_interview_questions(self, dataset: str) -> InterviewQuestions:question_generation_user_prompt = QUESTION_GENERATION_USER_TEMPLATE.format(dataset=dataset)
questions = self.llm_manager.call_llm(
system_prompt=SYSTEM_TEMPLATE,
user_prompt=question_generation_user_prompt,
temperature=0
)
try:
questions_list = literal_eval(questions)
except Exception as e:
raise ValueError(f"Error in converting LLM questions output to list: {e}")
questions_structured = InterviewQuestions(
question_1=questions_list[0],
question_2=questions_list[1],
question_3=questions_list[2]
)
return questions_structured

Reply era

With each the dataset and the questions obtainable, we lastly generate the solutions with the next immediate:

ANSWER_GENERATION_USER_TEMPLATE = """Generate a solution to the next knowledge evaluation interview Query based mostly on the Dataset.Dataset:
{dataset}
Query: {query}
The reply needs to be executable Pandas Python code the place df refers back to the Dataset above.
At all times begin your reply with a remark explaining what the next code does.
DO NOT DEFINE df IN YOUR RESPONSE.
Reply: """

We make as many reply era LLM calls as there are questions, so 3 since we exhausting coded the query era immediate to ask for 3 questions. Technically you possibly can ask a LLM to generate all 3 solutions for all 3 questions in 1 name however I believe that efficiency would worsen. We would like the maximize the power of the LLM to generate correct solutions. A (maybe apparent) rule of thumb is that the tougher the duty given to a LLM, the much less possible the LLM will carry out it properly.
The immediate instructs the LLM to confer with the dataset as “df” as a result of our interview dataset in DataFrame type known as “df” when the consumer code is executed by the CodeExecutor class under.

class CodeExecutor():def execute_code(self, df: pd.DataFrame, input_code: str):
local_vars = {'df': df}
code_prefix = """import pandas as pdnresult = """
attempt:
exec(code_prefix + input_code, {}, local_vars)
besides Exception as e:
return f"Error in code execution: {e}nCompiled code: {code_prefix + input_code}"
execution_result = local_vars.get('consequence', None)
if isinstance(execution_result, pd.DataFrame):
return execution_result.to_html()
return execution_result

I hope this text sheds mild on construct a easy and helpful LLM challenge which makes use of LLMs in quite a lot of methods!

If I continued to develop this challenge, I might deal with:

Including extra validation on structured output from LLMs (i.e. parseable csv or lists). I already coated a few edge instances however LLMs are very unpredictable so this wants hardening.

2. Including extra options like

Producing a number of relational tables and questions requiring joins
SQL interviews along with Python
Customized dataset add
Problem setting

Source link

How Have Data Science Interviews Changed Over 4 Years? | by Matt Przybyla | Dec, 2024

Master Machine Learning: 4 Classification Models Made Simple | by Leo Anello 💡 | Dec, 2024

Is Complex Writing Nothing But Formulas? | by Vered Zimmerman | Dec, 2024

No. 1 Tennessee survives upset bid from Illinois

New Coin Listing – Sealana Crypto Presale Hits $5 Million, 24 Hours Left

Financial Peace University vs. True Financial Freedom vs. Crown Financial MoneyLife

Nigeria not an easy place for startups

Best AI Nude Generators Revealed (2024)

Our Picks

US emergency responders, forecasters facing threats in wake of hurricanes | Weather News

7 Expert Strategies to Beat Every Google SEO Update

Why is lorry ‘blind spot’ tech to protect London cyclists delayed? …The Standard podcast

Most Popular