# 70 Best Data Scientist Interview Questions And Answers

Harvard Business Review calls data scientists “one of the sexiest jobs” of this century, and Glassdoor’s research says they rank #1 on their list of 25 best jobs. Data scientists are in high demand right now as companies find ways to use customer data for improved product or service offerings.

Finding a qualified data scientist can be tough, but it doesn’t have to be! To ensure you find the best candidate from day one, we’ve put together this list of 70 interview questions for quantifying your potential candidates.

# Best Data Scientist Interview Questions

A data scientist (data science) is someone who applies scientific, mathematical, and computational knowledge to extract insights from raw data with the goal of discovering hidden patterns. Data scientists use algorithms and machine learning with the goal of discovering hidden patterns from raw data in order to understand how it works–to either explain or predict what might happen next.

This list of the best 70 interview questions will help you to get a better understanding of candidates’ skills and their potential.

The following interview questions are categorized as follows:

- Problem-Solving Skills Questions
- Statistics Questions
- Programming Questions
- Data Modeling Questions

**Data Science Problem-Solving Skills Interview Questions**

A problem-solving skillset is an integral part of the data scientist’s repertoire. One way to gauge a candidate’s problem-solving abilities–whether they have them or not–is by incorporating these questions at the start of an interview.

## 1. What is the most difficult challenge you have faced and how did you solve it?

This question can tell you a lot about this candidate’s thinking process and problem-solving abilities–whether they are quick to identify a solution, what their approach was like in general.

## 2. What did you learn from your most recent data science project?

This question helps gauge their knowledge on topics of importance to data science – whether they have an in-depth understanding, how well-read they are, what skillset gaps exist that need filling at this point.

## 3. You want to send out millions of emails. What is your strategy for ensuring delivery without overwhelming the response?

This question will help you understand how well the candidate thinks on their feet and whether they can think through logistical problems.

## 4. You have a data set of 100,000 rows and 100 columns. The first column is the dependent variable for a problem you want to solve. You’re going to have to identify two techniques that will be helpful in predicting this parameter so you can narrow your work scope down quickly.

This question will help you understand how well the candidate can think through a problem and gather their thoughts to articulate ideas in an effective way. There are no right or wrong answers since there are many ways to skin a cat – the idea is to have the candidate explain to you how they would approach the problem.

## 5. What is your favorite data set?

This question helps gauge the candidate’s interests as well as how much knowledge they have on specific topics. They should be able to speak at length about this topic or else it may indicate that they lack interest or confidence in themselves.

## 6. How would you handle a data set that is not distributed evenly?

This question will help you understand how well the candidate can think through and articulate solutions to problems with messy datasets. One of the most common non-uniform data sets in social sciences are surveys, where people tend to answer more questions as they go along. For example, if someone only answered five out of ten survey questions, their answers might be weighted differently than somebody who answered all ten questions. In this circumstance, it’s important for your model to know which were asked or skipped so everything can be normalized appropriately.

## 7. What are some common data mining techniques?

- Association analysis: find associations between different variables in a dataset, e.g., identify which items tend to be purchased together;
- Correlation analysis: measure the degree of association or dependence between two variables, often using Pearson’s correlation coefficient; can also be used to estimate linear relationships and explore interactions between multiple explanatory and response variables simultaneously while controlling for possible confounding factors
- Regression analysis: seeks an equation that describes the relationship among one or more independent (explanatory) variables with one dependent (response) variable by finding values of coefficients such as slope, intercept, etc.; regression analyses are generally done without consideration of causality
- Collaborative filtering: a method for predicting the interests of one or more users on the basis of those that are known to be similar
- – K-NN (K nearest neighbors): A grouping algorithm typically used in predictive modeling, which aims at classifying objects with unknown labels by considering only their k closest data points
- Clustering analysis: clustering is an exploratory data mining technique for ordering observations so they share common characteristics. Data sets having attributes divided into classes can be clustered together as well. The “cluster” thus identifies natural groups within seemingly unorganized data and provides important insights about how it should best be analyzed further; also referred to as cluster analysis

## 8. What do you mean by a social network?

A social network is a web-based service, platform, or website that focuses on creating and sharing content among members.

## 9. What do you mean by market segmentation?

The process of dividing the total population into subsets with similar needs

- Concept drift: A phenomenon in machine learning wherein learned concepts can gradually change due to changes in training data; also called concept corruption
- Supervised Learning: “Supervised” means someone has already gone through the data beforehand and provided labels for it. This is typically done when there are two distinct categories (e.g., spam vs not spam) but may be more complex than this depending on the problem being solved
- Unsupervised Data Analysis: There’s no one telling you what you’re looking at
- Clustering: The process of grouping data into clusters that are similar in some way, for example, based on their age or income levels. Sometimes the number of clusters is specified beforehand, but often clustering involves choosing a model (e.g., which type of algorithm to use) and then finding out how many clusters it finds
- Dimensionality Reduction: This technique reduces the dimensionality of data by projecting onto a lower-dimensional space (e.g., from 100 features down to just 20). One popular approach is called Principal Component Analysis

## 10. What is the most popular data analysis software?

The most popular data analysis software are Python and R with packages such as Pandas and Spark respectively

## 11. Is there any difference between a “data scientist” versus a “statistician”?

Data scientists use data to find patterns in order to make predictions about future events while statisticians gather information from large datasets and analyze it for insights into human behavior. So if you need predictive power then hire a data scientist because statisticians cannot predict when something unexpected happens. The only thing that statisticians can do better than Data Scientists is helping you see why certain things happened – which might not be very helpful if you’re trying to predict what’s going to happen.

**Statistics Interview Questions**

## 12. What’s your favorite machine learning algorithm?

ML algorithms use a variety of different strategies to identify patterns in data sets. They include, but not limited to: supervised classification, unsupervised clustering, dimensionality reduction techniques such as Principal Components Analysis (PCA), support vector machines, Bayesian networks and Hidden Markov Models; decision trees methods including CART & CHi-squared Automatic Interaction Detection Trees (CHAID); rule induction systems like those created by decision trees methods.

## 13. What are the differences between supervised and unsupervised learning?

This question is good for gauging how well the candidate understands their field. Unsupervised learning deals with identifying patterns in data without any labels, and supervised learning uses labeled information to make predictions about new observations.

## 14. What are some advantages of using a code library?

This question will help you understand if the candidate has done research on your company or industry, as many data scientists rely heavily on software packages like Julia or Python’s R-Package Numpy which both come prepackaged with libraries full of handy methods that can be used quickly. This may also show what language they can work best in and its corresponding frameworks (e.g., Python). It would likely reveal whether there is experience working within other languages such as JAVA, MATLAB, or SQL.

## 15. How is logistic regression done?

Logistic regression can be used to measure the relationship between a dependent variable and one or more independent variables.

The dependent variable is categorical and the linear regression coefficient measures how each of the other variables affects it.

Logistic regression can be used for scenarios where there are only two outcomes: success (win) and failure (lose).

## 16. Explain the steps in making a decision tree.

A decision tree is created by placing a question at each branch of the tree. The questions are either yes/no or multiple-choice, and the next branches are based on which answer was selected.

## 17. Python or R – Which one would you prefer for text analytics?

The best possible programming language would be Python because it has a Pandas library that provides easy-to-use data structures and high-performance data analysis tools.

## 18. Which technique is used to predict categorical responses?

Classification is a technique often used by mining for classifying data sets.

## 19. Explain the central limit theorem.

The central limit theorem is a law of statistics that states an average (arithmetic mean) of many independent, identically distributed random variables tends toward the normal distribution as the number of samples increases.

## 20. What is the relevance of the central limit theorem to a class of freshmen in the social sciences who hardly have any knowledge about statistics?

The central limit theorem is important for students in the social sciences because it helps to establish that a sampling distribution of averages will tend towards normality as the sample size increases.

## 21. What are some advantages and disadvantages of using classification?

Some advantages include simplicity, speed, and easy-to-understand results. There are also several potential disadvantages including its limited applicability to data sets with many dimensions or few samples, which makes generalization difficult.

## 22. What other techniques do you recommend when dealing with large data sets? Give two examples besides classification.

Two additional techniques that can be used to deal with large data sets would be clustering and regression analysis. Clustering involves splitting up your dataset into groups based on similar attributes, while regression analysis involves fitting an equation to your data.

## 23. Given a dataset, show me how Euclidean Distance works in three dimensions.

Given a dataset, showing Euclidean Distance in three dimensions is accomplished by taking the distance of each point to every other point. The sum of these distances from all points to one another will then be calculated for that set and squared before dividing by the total number of data points or samples within your sample size range (usually D). This calculation is known as a variance.

For an example: Let’s say you have two sets with four data-points per set, where A has x = (-0.25), y = 0, z= +0.75) while B has x= (+0.625), y=-(-0 -375), z=(+/- 500%). So what we would do first is calculate the distance between our points. For A, the distances would be (0 – (-0.25) = 0), (0- (+0.625)=600), and for B they are (-(-500)-(+/- 500= 250). The next step is to square these values before dividing them by their respective number of data points or samples in order to obtain variance; this will allow you to calculate Euclidean Distance across dimensions:

A’s variance: ((x*x + y*y + z*z)/(n))^-½

B’s variance: ((x* x/100)^-½ +(y * y / 100)+z * z/(100))) ^- ½

A’s Euclidean Distance: ((0 – (-0.25) + 0*0))/(n-x)

B’s Euclidean Distance: [(+/- 500 * (500/100))-((-375)*(375)/100)]/(400/400) = 0.88

So in this case, A and B are the same distance from one another across dimensional variance because of their respective number of data points. But when we calculate our Euclidean distances, it becomes obvious that all three dimensions have a different magnitude for our two sets; which means there is not equality in variance between them as they are still the same distance apart but just with different values on each dimension.

**Programming Interview Questions**

These programming interview questions can be given verbally only, in which the candidate would answer how they would solve a problem without writing their own code. For candidates who want to code during their interview, you can offer whiteboard practices for them to conduct on the spot.

## 24. What is your favorite programming language and why?

You can use this question to gauge their knowledge of programming languages, as well as the type of work environment in which they thrive best.

## 25. What is the difference between storage, retrieval, query time?

This question will help you determine if the candidate is familiar with data warehousing concepts and terminology.

## 26. What are your thoughts on NoSQL databases?

This question gives you a sense of how well-versed in data storage and retrieval technologies they are, as well as their opinion on these types of database designs.

## 27. How would you optimize a query that returns too many rows?

This question will help you find out if they know how to optimize a SQL-based data extraction.

## 28. How would you design an API?

The best answer is one that speaks about RESTful APIs, which are widely used and understood by many developers. The response should also include discussions on security considerations, data serialization formats like JSON or XML, hypertext, and response codes.

## 29. How do you get data off of a relational database?

This question will help you understand the candidates’ knowledge of SQL queries to extract data from databases.

## 30. What are some common use cases for TensorFlow?

TensorFlow is one of many machine learning frameworks that allow developers to harness their big data, build machine learning models and deploy them.

## 31. What are the tradeoffs between MySQL and PostgreSQL?

Candidates should be able to discuss the benefits of each database engine as well as its drawbacks when compared to one another. They should also provide insight into why they prefer one over the other if necessary.

## 32. What are the two main components of the Hadoop framework?

The two main components of the Hadoop framework are HDFS and MapReduce.

## 33. What is indexing?

The process of adding additional data to a database table in order to make it more efficient for retrieval. For example, you could add an index on the column that stores emails so that whenever someone enters their email into the search bar, they can quickly find rows containing that particular value as opposed to scanning through every row looking for matches.

## 34. How do you normalize databases?

Normalizing databases involves creating tables with columns representing entities or attributes of interest and linking them together through common identifiers (e.g., IDs). This helps reduce redundancy among different datasets while also taking up less space on disk because many values might exist only once rather than being duplicated.

## 35. What are the downsides of normalizing databases?

Normalization is a great way to ensure data quality and reduce redundancy, but it does come with some drawbacks: doing so can make queries more complicated because you need to know which table or tables to search based on your criteria; this also means that if a change needs to be made in one place, updates will need to happen across all related tables.

The disadvantages of database normalization include creating complex queries for any given query due to multiple joins necessary when querying from regularized databases as well as requiring changes at every level if an attribute has changed. These issues should not stop anyone from performing database normalizations since they do have many benefits such as reducing redundant data by breaking it down into smaller sets and distributing data more evenly across tables.

## 36. How would you sort a large list of numbers?

This question will test your candidate’s ability to use a data structure like a queue, stack, or heap.

How are you going to solve this problem? Will they make an assumption about what type of sorting algorithm is needed or ask the candidate for more information before providing their answer? It will also test whether they can come up with solutions without being told how it needs to be done.

## 37. What is ETL?

ETL stands for “Extract, Transform and Load”. It’s an important process in Data Science that enables companies to extract raw data from one system (e.g., Salesforce), transform it by adding or removing fields as necessary, and load it into another system (e.g., SQL Server) where analytics can take place on top of the transformed data set. This way new insights about business habits can be extracted without having to endlessly read through terabytes of unprocessed data sets stored elsewhere within your organization.

**R Programming Language Used in Data Science**

## 38. What are the different types of sorting algorithms available in R language?

This question will test the candidate’s knowledge of data structures and algorithms. They should be able to identify sorting, merge, and quicksort as examples of different types of sorting processes from R language options. It also tests whether they can work through a problem without being told how it needs to be done

## 39. What are the different data objects in R?

Data objects are used for data storage and manipulation. Data types can be vector, matrix, list, or data frame depending on the type of information that needs to be entered.

- – Vectors: contain elements with a single data type (e.g., int)
- – Matrices: have rows and columns with different dimensions of one data type (e.g., integer), typically a square matrix is considered two dimensional while any other shape is considered three dimensional; also called arrays, tables or array matrices
- – Lists: contains an ordered set of values from each value having its own order number which identifies how it was sorted – lists should not include duplicates since this would mean there were duplicate values present
- – Data Frames: contain columns and rows where each column contains an ordered set of values from the same data type; these types can be vectors, lists, or matrices which are used for display in a data frame

## 40. What is the command used to store R objects in a file?

The command used to store R objects in a file is “save”.

## 41. What is the best way to use Hadoop and R together for analysis?

The best way to use Hadoop and R together for analysis is by using the RHipe package.

## 42. What are some data types used in R?

There are four data types that can be used in R: logical, integer, numeric, and character; each one should have a specific range of values so they can properly represent the information being stored.

## 43. Why should I store my R object as .RData file instead of just saving it as an .R script?

The best way to store your R object as a.RData file if you want others with incompatible versions of packages to load them without error messages, because they do not need specific paths. If you save them with other files such as scripts (.r ) or .txt files, they will not load without errors.

## 44. How do you use R for data analysis?

R can be used to analyze data by importing the library and using it.

**Python Programming Language Used in Data Science**

## 45. What is Python’s major strength in data science?

Python’s major strength is its large set of libraries, which make it easier to work with different types of problems than other languages. For example, SciPy provides tools for working with matrices, whereas NumPy handles scalars like integers or floats. Similarly, there are modules that handle social network analysis (SNA), visualization tools for producing charts from plots collected by graphing packages such as Matplotlib and Bokeh. There are also a variety of machine learning algorithms available within Python through libraries such as sci-kit-learn, TensorFlow, and PyBrain.

Python is also a general-purpose programming language, which means it can be used for more than data science alone – the same way that R has broader uses beyond statistics or MATLAB’s use in engineering. This wide applicability makes Python an attractive choice because it offers lots of flexibility when tackling different problems without needing to learn new languages for each one.

## 46. What are the supported data types in Python?

There are three broad classes of data in Python: strings, lists, and tuples. Tuples can be thought of as a list with only one dimension (a single column). You will often see them used to return results from functions that have side effects or for passing multiple values together as an argument to a function.

## 47. In Python, how is memory managed?

Python uses reference counting to manage memory. When you create a new object, the Python interpreter allocates some space in memory and sets its reference count to one. The first time that object is assigned to another variable or passed as an argument (i.e., used), it takes up two references: one for where the value of that object lives and one for who’s using it right now.

## 48. How do you use Python for data analysis?

Python can be used for data analysis by importing the Pandas library and using Pandas to manipulate data.

## 49. What’s the difference between a function and a stored procedure?

A function returns data whereas a stored procedure executes an action.

It is easier to debug if you are not in production mode because there are no side effects with executing code that has already been confirmed as working on your development machine, which may be a laptop or desktop computer.

In Python, what does it mean when a line of code doesn’t execute? Why might this happen? When would you want your script to stop running instead of continuing indefinitely until execution reaches its end? A Python script will never ‘run out’ before reaching the end unless some error occurs. This could occur for example by dividing 0/0 (i.e., zero divided by zero).

**SQL Interview Questions**

## 50. What is the purpose of the group functions in SQL?

The purpose of the group functions in SQL (GROUP BY, CUBE, ROLLUP) is to aggregate data based on a given set of columns.

- – Grouping allows us to generate subtotals and grand totals as well as compute averages by grouping rows together that have a particular value for one or more column(s). We can also use GROUP BY to change the ordering of results from ascending order with row number increasing from left to right, descending order with row numbers decreasing from left to right, and ordered alphabetically within each level using ASC/DESC.
- – Cube will produce subtotal values corresponding to not just two but three levels: dimensions at which we’re aggregating, e.g., city, region, country.
- – ROLLUP is only available with the GROUP BY clause and will produce subtotal values corresponding to all levels of one or more dimensions for which we’re aggregating data (e.g., city, region, country for example).

## 51. Describe the difference between an inner join, left join/right join, and union.

- An
**inner join**only returns records where there is a corresponding match between the left table and the right table. - A
**left join**will return all of the records from both tables including ones with no matching data to the other. - An
**outer join**will return all of the rows in either or both tables, but it can exclude some fields found in one or more of those rows. **Union joins**are similar to an outer join but they combine both sets into one resulting set – for example, you could union two resultsets that didn’t have any data in common back together again by using UNION ALL instead of UNION.

## 52. What is the difference between SQL and MySQL or SQL Server?

SQL is a type of programming language that allows developers to use SQL queries in order to access, manipulate and store data.

MySQL is an open-source relational database management system (RDBMS) – one of the most popular examples of this class of software. It’s often used as a server for many types of Web applications, particularly because it supports various storage engines like InnoDB which are optimized for web workloads. MySQL isn’t owned by any single company but instead has been developed over time by a worldwide community where all users have equal say in development decisions and hence ownership rights.

## 53. What does JOIN mean?

JOIN stands for “join” or “combination”. When two tables from different sources are joined, the JOIN keyword is used to show which fields they have in common.

## 54. What’s a connection string?

A database connection string tells MySQL where and how to find data on your computer or network. The format of this kind of statement includes driver type (e.g., MySQL), hostname, port number, user name, and password for authentication purposes, as well as any additional options that can be queried by typing SHOW DATABASES after entering the command-line interface with no arguments:

MySQL -u root -p’passw0rd!’ –show databases;

The syntax typically involves specifying at least one option from each section: “driver”, “host”, “port”, “user” and “password”.

**Data Modeling Interview Questions**

## 55. What is your favorite data visualization technique?

The best tool for the job! The more you know about what to use and when, the better. Data scientists should be comfortable with all of these tools in order to find the right one for any given task. Here are some popular techniques: – Scatter plots – Boxplots – Histograms – Heatmaps

- – Line graphs (also known as line charts)
- – Stacked area charts (also called stacked bar chart or grouped bar chart)
- – Area maps (sometimes also referred to as mosaic map)
- – Pie Charts
- – Density Plots/Dot Plots

## 56. What do you think of Excel pivot tables?

Pivot tables can be very powerful for summarizing data and creating simple summaries of datasets. However, they can quickly become difficult to use if the goal is more advanced analysis or visualization.

## 57. Explain the 80/20 rule, and tell me about its importance in model validation.

The 80/20 rule is a popular concept in data science that states that about 20% of the input features are used to describe ~80% of the output values.

## 58. What do you mean by “data provenance?” What is its importance?

Data Provenance refers to tracking and documenting where any given dataset has come from. It’s useful when attempting to reproduce results, ensuring they can be reproduced later on without having original data sets or code documentation for all steps.

This lets us trust our analysis more because we know what techniques were applied and how it was done with confidence instead of guessing based on an undocumented process. Data reproducibility reduces risk so long as there is traceability back to known good sources (e.g. data sets) and processes.

## 59. What is the most important consideration when working with predictive modeling?

The most important consideration when working on a prediction model should be accuracy, but there are many other areas where you need to put time into developing your models (such as cost). Accuracy of course will always come first because without it then everything else becomes useless.

However, we also want our models to have good coverage for any given area– this means that they can’t predict just one outcome or output an answer if they’re not sure what’s right in every situation. If you only try to make predictions about things that are easy to do so then your results won’t be very accurate for anything outside the base set of input features.

For example: if your model predicts the chances of getting a heart attack in people over 65 and you only use age as an input feature, but not other things like gender or what medications they take– then it’ll never be able to accurately predict who will get a disease outside that set.

## 60. What is root cause analysis?

Root cause analysis is the process of looking for factors that can be fixed or changed in order to improve a system.

## 61. How does root cause analysis work?

Root cause analysis is done by breaking down problems into smaller parts and understanding how each part contributed to the problem, what may have caused it, and who might be able to fix it. Root causes are then ranked based on which ones offer the most potential for improvement. This ranking helps data scientists target their efforts so they’re not wasting time-solving things that will never get better.”

## 62. How is k-NN different from k-means clustering?

k-NN is a classification algorithm used to predict the probability of observation into one of two categories based on its similarity to previous observations. It is different from k-means clustering because it does not use cluster centers and, instead, every data point creates its own nearest neighbors.

## 63. What are some techniques for anomaly detection?

Some techniques that can be employed in order to detect anomalies include:

- Quantile regression methods – quantile regression looks at normal aggregate values over time in order to identify outliers;
- Outlier analysis via standard deviation or variance – these measures help us understand how much variation there is in our data set
- Nonlinear transformation such as logarithmic or power transformations may change scale properties and make hidden patterns manifest themselves through new parameters, and

## 64. Data mining to discover patterns in data. What are the advantages of using machine learning for anomaly detection?

Machine-learning is a good technique because it can handle noisy and complicated data sets that humans may not be able to process as quickly or efficiently. It also has the ability to learn from mistakes by detecting problems with previous predictions which will better inform future decisions. Machine-learning models provide higher accuracy due to their lack of human bias since they do not have the cognitive limitations that come with being made up of neurons; furthermore, they save time on manual analysis tasks while providing automated insights into what might need attention next.

The disadvantages of this type of approach include the high cost associated with digital ars, the technical expertise required to construct models for it and maintain them over time, as well as the difficulty in assigning accurate costs to predictors when they are based on probabilities.

## 65. What are hash table collisions?

Hash tables are a data structure that helps find an element in a table by using the hash function. They provide very fast lookups, and they use less space than arrays while being more flexible; furthermore, you can have multiple different types of keys (integers, strings) even if only one is allowed per slot. The disadvantage of this type of approach is that there’s no guarantee as to whether or not two items will be put into the same bucket because it all depends on how your key hashed out.

## 66. What are your thoughts on sparse optimization and how do you deal with these types of problems?

Spartan optimization problems are those which have a single variable and many constraints. They usually involve finding the optimal value for the given variables, subject to all of its constraints – that is what you need to solve in order to find an answer.

## 67. What is data clustering?

Data clustering refers to any type of task where one tries to group (or cluster) data into categories based on shared characteristics, according to some criteria or set of rules defined by someone beforehand. A classic example might be grouping students who took different courses at school as friends because they share similar levels of interest in certain subjects; another would be grouping people together based on their age and hobbies/interests so we can better predict how they would react to advertising.

## 68. What is data classification?

Data Classification refers to any type of task where one tries to assign class labels or tags (also known as labels) for the given variables, subject to all their constraints – that is what you need to solve in order to find an answer.

Classification tasks have many real-world applications: from labeling people based on age and sex so we can better predict their preference for a certain product; assigning grades such as A+, B+, etc.; predicting whether someone will develop chronic diseases later in life; detecting fraud on financial transactions.

## 69. What are some situations where a general linear model fails?

In situations with a large number of input variables and one output variable, General Linear Models – or GLMs for short – can fail. The most common example is the prediction of weight in pounds from height in inches data set; while it might seem logical that there’s a linear equation to predict this ratio, using the same logic we could also say that there should be some relationship between weight in kilograms divided by height squared!

A better way to solve such problems is through classification models such as Logistic Regression (LR), Support Vector Machine (SVM), or Naïve Bayes Classifier (NBC). These tools allow us to assign class labels for given variables based on their constraints.

## 70. When a new algorithm is developed, how do you know that it’s better than the old one?

In order to know whether your changes are an improvement, you need to measure the performance of both models. To do this we can calculate a confusion matrix with data that comes from a validation dataset and compare them on their accuracy (e.g., AUC).

In addition, one could train two different models – say LR model and SVM classifier- by setting aside another validation set for each algorithm, then generate confusion matrices based on these two different evaluations. If they have similar accuracies it would suggest that there is no difference between the algorithms in terms of predicting outcomes. When comparing non-identical metrics such as ROC curves or cross-entropy / cost function values it becomes more difficult to make conclusions about which method performs better.

In closing …

These interview questions are meant to gauge a candidate’s knowledge in areas like programming, data preparation, modeling, and visualization. Candidates with at least two years’ worth of experience should have no trouble answering these examples.

This list of 70 must-know questions for interviewing a data scientist is designed to give you the confidence to make an informed decision about who you hire.

We hope this was helpful and that you will find these questions useful when conducting your next round of interviews.

Good luck!