Register Now


Lost Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Captcha Click on image to update the captcha .


Register Now

Lorem ipsum dolor sit amet, consectetur adipiscing elit.Morbi adipiscing gravdio, sit amet suscipit risus ultrices eu.Fusce viverra neque at purus laoreet consequa.Vivamus vulputate posuere nisl quis consequat.


According to new data from Glassdoor, “data scientist” is that the highest-paying entry-level job within the U.S. LinkedIn states that, the Data Scientist jobs are amongst the top 10 jobs in the United States. Data Science profiles have grown over 400 times in the past one year as quoted by ‘The Economic Times’. With the upcoming of Thousands of start-ups in India in the last 3 years, the demand for data scientists has shot up in India as well. Therefore, it can be rightly concluded that data science jobs are the hottest jobs in today’s scenario and Data Scientists are the rock stars of this era. Data Science isn’t a simple field to urge into. This is something all data scientists will agree on. Apart from having a degree in mathematics/statistics or engineering, data scientists need to be technically proficient in programming skills, statistics knowledge, and machine learning techniques.

 So, if you would like to start out your career as a profound data Scientist, you want to be wondering what kind of questions are asked within the Data Science interview. Here’s an inventory of the foremost popular data science interview questions you’ll expect to face. With the help of these questions you can aim to impress potential employers, by knowing about your subject and being able to show them the practical implications of data science.


1) What is Data Science?

According to Wikipedia’ Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structured and unstructured data’. It includes data mining, deep learning and big data.

2) How can you Differentiate between data science, artificial intelligence and machine learning?

Artificial Intelligence Machine LearningData Science
It is a science which studies ways, to develop intelligent programs to add capability to computers, to solve problems which have been human prerogative.Machine Learning uses efficient programs that can use data without being explicitly asked to do so. Fits within data science. Is an application of Artificial Intelligence.Data Science includes sourcing, cleaning, and processing data to extract meaningful insight out of it for analytical purposes. Is a broader term and includes various Data Operations.
Some of the popular tools that AI uses are- CaffeScikit learnTensor flowGoogle cloud AI platformKeras

Various popular tools used by Machine Learning are- Amazon machine learning   IBM Watson Studio   Microsoft Azure ML Studio   Google tensor flow.Most popular tools of   Data Science are-                   1. SAS 2. Big ML 3.Excel 4.MATLAB 5.Tableau
Skills required- Maths-statistics, algebra probability, calculus logic, Bayesian algorithms etc Science which includes physics, mechanics, cognitive behaviour Computer science which incorporates data structures, programming, efficiency.Skills required- Knowledge of statistics and probability, Expert in computer fundamentals and programming, data modelling and evaluation.Skills required- Strong knowledge of SQL, Data based coding, programming languages like python, R, SAS, Scala, Machine learning etc.
Chatbots, Voice assistants, Immersive visual experience etc are the most popular applications of AI.  Recommendation Engines such as Netflix viewing suggestions, Spotify, Self-driving cars such as Waymo cars use ML to understand surroundings and Facial Recognition are popular examples.Targeted advertising, Fraud Detection, Gaming, Healthcare analysis etc are popular examples of data science.

3) Explain what is meant by supervised and unsupervised learning in data? Also mention their types?

  • Supervised machine learning: This model uses previous historical data to understand behaviour and predict outcomes for unforeseen data. Just like learning in the presence of a teacher, this learning algorithm learns from labelled data, that is some data which is already tagged with the correct answer and is therefore suitable in various types of real-world computation problems. For example, if we want to find out how much time it takes to reach home from office, the algorithm will use a set of labelled data such as weather conditions, time of the day, holidays etc. to find the answer. Regression and classification are two types of supervised machine learning.
  • Unsupervised machine learning: This type of ML algorithm uses unclassified or unlabelled parameters and does not require us to   supervise the model. It focuses on discovering hidden structures from unlabelled data to perform more complex processing tasks as compared to supervised learning however unsupervised learning can be more unpredictable as compared to other learning techniques. Clustering and Association are two types of unsupervised learning.

4)  Which language is most suitable for text analysis python or R?

Often programmers are faced with a dilemma whether to use Python or R. Python is preferred by developers for data analysis or to apply statistical tools whereas R is chosen by engineers, statisticians or scientists who do not possess computer programming skills.


5) What are the data types used in Python?

Python has the following built-in data types:

  • Python Numbers- Including Integers, Floating point numbers and complex Numbers.
  • Python List- It is an ordered sequence of items.
  • Python Strings- It is a sequence of Unicode characters.
  • Python Tuple- They are same as list but cannot be modified.
  • Python Set- It is an unordered collection of unique items.
  • Python Dictionary- It is an unordered collection of Key-value pairs and is useful for retrieving huge amount of data with help of Key.

6) What Are the Types of Biases That Can Occur During Sampling?

The various types of biases occurring in research are-

  • Selection Bias 
  • Confirmation bias
  • Outliers
  • Overfitting and underfitting
  • Confounding variables.

7) What is Survivorship Bias?

Survivorship Bias a form of selection bias and is the logical error of focusing on successful people, businesses, strategies or aspects and casually overlooking those that did not work because of their lack of prominence. This can lead to distorted facts and produce wrong conclusions.

8) What is selection bias?

Selection bias or sampling bias is an experimental error which occurs when the sample data that is gathered and prepared for modeling is not representative of the true, future population of cases the model will see. That is, active selection bias occurs when a subset of the data is systematically (i.e., non-randomly) excluded from analysis.” It is also called as selection effect.

There are various types of selection bias such as:

  • Sampling bias: It is a bias which occurs due to a systematic error when a non-random sample of a population is selected and some members of the population are excluded likely.
  • Time Interval bias: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
  • Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds.
  • Attrition: Attrition bias is a kind of selection bias caused by attrition, that is tests that did not run to completion due to loss of participants.

9) Explain how a ROC curve works?

AUC- ROC curve is probability curve with graphical representation of the contrast between true positive rates and false-positive rates at various thresholds. Its AUC represents the degree or measure of separability. It tells how much model is capable of distinguishing between classes. It is used in many areas such as medicine, radiology, natural hazards and machine learning.

10) Explain Star Schema?

It is a traditional database schema with a central table. Satellite tables map IDs to physical names or descriptions and can be connected to the central fact table using the ID fields; these tables are known as lookup tables and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.

11)Which technique is used to predict categorial responses?

Classification Algorithms are ML techniques required for predicting categorial outcomes.

12) What is logistic regression? Or State an example when you have used logistic regression recently?

Logistic Regression often referred as logit model is appropriate regression analysis to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, etc.

13) Is it possible to perform logistic regression with Microsoft Excel?

Yes, Microsoft Excel is a very powerful tool and it is possible to perform logistic regression with it.

14) How can you assess a good logistic model?

There are various methods to assess the results of a logistic regression analysis-

  • Using Classification Matrix to look at the true negatives and false positives.
  • Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.
  • Lift helps assess the logistic model by comparing it with random selection.

15)  What are recommender systems?

A recommendation engine is a subclass of information filtering systems which uses data analysis of history of users and behaviour of users to predict the preferences of ratings that a user would give to a product. Recommender systems are widely used in movies, news, social tags, music etc.

16) What is Collaborative filtering?

This process of filtering is unsupervised learning used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents. Most websites such as Amazon, Netflix, You tube etc. use collaborative filtering for their recommendations.

17) What are various steps involved in an analytics project?

  • Understanding the business problem that is clearly defining what objectives the company wants to achieve with this project.
  • Obtaining and Extracting the necessary data and becoming familiar with it.
  • Polishing the data for modelling by detecting outliers, treating missing values, transforming variables, etc. in order to avoid distorted results.
  • After data preparation, the collected data is modelled, treated and analysed using regression formulas and algorithms to predict future results. This is an iterative step till the best possible outcome is achieved.
  • The next step is to interpret the data and gather insights to find generalisations and patterns to apply to future data.
  • Start implementing the model and track the result to analyse the performance of the model over the period of time.

18) Why data cleaning plays a vital role in analysis?

Data Cleaning from multiple sources to transform it into a format that data analysts can work with, is a cumbersome and unwise process but is crucial for analysts as data is always available in raw form and ML algorithms cannot work on raw data and need well labelled data. As number of data sources increases, the time take to clean the data increases exponentially and might take up to 80% of the time for just cleaning data making it a critical part of analysis task.

19) What do you understand by linear regression?

Linear regression is one of the most well-known and well understood algorithms in ML which helps in understanding the linear relationship between the dependent and the independent variables.

It is a supervised learning algorithm and belongs to both Statistics and ML. One is the predictor or the independent variable and the other is the response or the dependent variable. In Linear Regression, we try to understand how the dependent variable changes with respect to the independent variable.

There are two types of Linear Regression, called simple linear regression, and multiple linear regression.

20) What are the assumptions required for linear regression?

There are four major assumptions:

 1. There is a linear and additive relationship between the dependent variables(response) and the regressors meaning the model you are creating actually fits the data.

2. The errors or residuals of the data must be normally distributed and independent from each other that is they should not be correlated.

3. It assumes that there is minimal multicollinearity between explanatory variables.

 4. Homoscedasticity means “having the same scatter” that is variance around the regression line is the same for all values of the predictor variable.

21) What is the difference between Regression and classification ML techniques.

Both Regression and classification machine learning techniques come under supervised machine learning. The process of classification involves discovering a model or function which helps in separating the data into multiple categorical classes i.e. discrete values. Whereas Regression involves finding a model or function for distinguishing the data into continuous real values instead of using classes or discrete values. The nature of predicted data in classification is unordered as compared to ordered predicted data in Regression. Various examples of algorithms in classification are Decision tree, Logistic Regression etc whereas Random Forest, Linear Regression are examples of Regression algorithms.

22) What are the drawbacks of the linear model?

  • The assumption of linearity is that, it assumes a linear relationship or a straight-line relationship between dependent and independent variables which does not hold true all the time.
  • It can’t be used for count outcomes or binary outcomes.
  • There are overfitting problems that it can’t solve.
  • It assumes that data is independent which is often but not always true.

23) Explain what regularization is and why it is useful?

Regularisation is the process of making modifications to the learning algorithm to improve generalisation performance but does not improve the performance on data sets. This process regularizes or shrinks the coefficients towards zero. In simple words, regularisation discourages learning a more complex or flexible model to prevent overfitting.

24) Explain L1 and L2 Regularization?

Both L1 and L2 regularizations are used to avoid overfitting in the model. L1 regularization or Lasso and L2 regularization or Ridge Regularization remove features from our model. The key difference between them is the Penalty term. L1 regularization, however, is more tolerant to outliers and works much better for feature selection in case we have huge number of features.

25) Differentiate between univariate, bivariate and multivariate analysis?

 Descriptive statistical analysis techniques employing single variable   at a given point of time are called Univariate Analysis and are one of the simplest forms of statistical analysis. Pie charts of sales based on territory involve only one variable, Bar charts, histogram etc are some of the examples of Univariate Analysis.

If the quantitative analysis attempts to understand the changes that occur between 2 variables and to what extent, it is often denoted as bivariate analysis. The variables generally used are denoted as x and y, where one variable is dependent and the other independent Analysing the volume of sale and spending can be considered an example of bivariate analysis. These analyses are often used in quality of life search.

Analysis that deals with the study of more than two independent variables to predict the effect on the dependent variable is referred to as multivariate analysis. Multivariate analysis due to the size and complexity of the underlying data sets, requires much computational effort.

26) What do you mean by the law of large numbers?

This is a principle of probability, according to which the frequency of occurrence of events that possess the same likelihood are evened out after they undergo a significant number of trials.

27) What is a statistical interaction?

It is a statistically established relation between two or more variables when the effect of one factor (input variable) on the dependent variable (output variable) differs among levels of another factor.

28) What do you understand by the term Normal Distribution?

An arrangement of data in which most values cluster in middle range and rest taper off symmetrically towards left or right is called Normal distribution. The graphical representation of random variables is distributed in the form of a symmetrical bell-shaped curve in Normal distribution with peak always in middle. For example, if we measure height, maximum people are of average height and very small numbers of people are taller and shorter than average height. Mean, mode, and median are same in case of Normal Distribution.

29) What is the goal of A/B Testing?

The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of interest. A/B testing is a fantastic method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales, emails to search ads and has thus become the domain for Marketers.

An example of this could be identifying the click-through rate for a banner ad.

30) What is p-value?

When you perform a hypothesis test in statistics, a p-value approach uses calculated probability to help you determine the strength of your results and determine whether there is evidence to reject the null hypothesis. p-value is a number between 0 and 1. Based on the value it will denote the strength of the results.

Low p-value (≤ 0.05) indicates strength against the null hypothesis which means we can reject the null Hypothesis. High p-value (≥ 0.05) indicates strength for the null hypothesis which means we can accept the null Hypothesis. p-value of 0.05 indicates the Hypothesis could go either way. Therefore, in case of High P values, your data are consistent with a true null whereas in Low P values, your data are not consistent with a true null.

31) What is the difference between Point Estimates and Confidence Interval?

 A Point Estimate gives statisticians a single value as the estimate of a given population parameter denoted by p. Point estimates are subject to bias, where the bias is the difference between the expected value of the estimator and the true value of the population parameter. A well-defined formula is used to calculate point estimates. Method of Moments and Maximum Likelihood estimator methods are used to derive Point Estimators for population parameters.

A confidence interval gives us a range of values which is likely to contain the population parameter. It is generally preferred, as it has the lower and upper limits which serve as the bounds of the interval and tells us how likely this interval is to contain the population parameter. This likeliness or probability is called Confidence Level or Confidence coefficient and represented by 1 — alpha, where alpha is the level of significance. How precise the interval is, depends on sample statistics and margin of error.

32) What is Interpolation and Extrapolation?

Interpolation is a very useful statistical and mathematical tool to estimate value from two unknown values from a list of known values. Investors and stock analysists often use line charts to visualise changes in price of securities. Extrapolation is approximating a value by extending a known set of values or facts. That is the process of estimating something if the present situation continues and is an important concept not only in Mathematics but also in other disciplines like Psychology, Sociology, Statistics etc.

33) What is power analysis?

An experimental design technique for determining probability of detecting an effect of a given size with a given level of confidence, under sample size constraints is called Statistic Power Analysis.

34)  What is K-means? How can you select K for K-means?

According to Wikipedia K clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centres or cluster centroid), serving as a prototype of the cluster.

You can choose the number of cluster visually but there is lot of ambiguity. You just have to experiment on instance data set, see the results of different K, and find the better one.

35) How is K-NN different from K-means clustering?

Both K-NN and K-means may seem similar and easy to confuse with one another, because they both involve comparing the distances of a given input data point to a set of other stored data points but they tackle different problems.

  1. K-NN is Supervised machine learning technique whereas K-means is an unsupervised machine learning.

  2. K-NN is a classification or regression machine learning algorithm while K-means is a clustering machine learning algorithm.

  3. K-NN is a lazy learner which means it does not have a training phase whereas K-Means is an eager learner implying it has a model fitting that means, a training step.

  4. K-NN performs much better if all of the data have the same scale and is the simplest ML algorithm, but this does not hold true for K-means.

36) What is cluster sampling?

Cluster sampling is used in statistics when natural groups are present in a population. The whole population is subdivided into clusters, or groups, and random samples are then collected from each group.

Cluster sampling is typically used in market research. It’s used when a researcher can’t get information about the population as a whole but they can get information about the clusters. For example, a researcher may be interested in data about city taxes in Florida. The researcher would compile data from selected cities and compile them to get a picture about the state. The individual cities would be the clusters in this case. Cluster sampling is often more economical ormore practical than stratified sampling or simple random sampling.

37)  What is the difference between Cluster and Systematic Sampling?

Cluster sampling is the most economical and practical sampling technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. In this technique the researcher creates multiple clusters of people from a population where they are indicative of homogeneous characteristics and have an equal chance of being a part of the sample. Systematic sampling is a random statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, researchers calculate the sampling interval by dividing the entire population size by the desired sample size. The best example for systematic sampling is equal probability method.

38) Explain the steps in making a decision tree.

1. Take the entire data set as input- Identify the points of decision and alternatives available at each point.

2. Identify the points of uncertainty and the type or range of alternative outcomes at each point. (we gain information on sorting different objects from each other)

3.Calculate the values needed to make the analysis, especially the probabilities of different events or results of action and the costs and gains of various events and actions. 

4. Analyse the alternative values to choose a course.

5.Exibit decision tree with financial data.

39) What is a random forest model? State its advantages and disadvantages?

A random forest model is built up of thousands of decision trees each trained with a slightly different set of observations and picks predictions from each tree. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.

A Random forest algorithm is considered as a highly accurate algorithm because of the results derived due to building of multiple decision trees. Since it considers all multiple decision tree outputs, no bias is generated in results, so there is no issue of overfitting. It is an important feature of selection process and can be used to build both random forest classification and regression models.

Despite of its several advantages a random tree forest model is comparatively slow in generating predictions as it has to build multiple decision trees.

40) How can you avoid the overfitting your model?

Overfitting refers to a model that is trained or fitted with a lot of data due to which it starts learning from noise and inaccurate data entries in our model set. It doesn’t categorize the data correctly, because of too many details and noise. We can reduce overfitting by-

  1. Increasing the training data.
  2. Keep the model simple—take fewer variables into account, thereby removing some of the noise in the training data.
  3. Use regularization techniques, such as LASSO, that penalize certain model parameters if they’re likely to cause overfitting.
  4.  Early stopping during the training phase that is keeping an eye, over the loss over the training period and stopping as soon as the loss begins to increase.
  5.  Use dropout for neural networks to tackle overfitting.

41) If your training model gives 90% accuracy whereas the testing model gives 60%, Identify the problem in your model?

This problem can be termed as the problem of overfitting which refers to a model that models the training data too well. Overfitting is bad because the model learns the detail and noise in training data and consequently makes poor predictions which imparts the performance of the model on new data. This problem can be reduced by many methods such as Regularisation, Cross validation, Ensembling, Early stopping etc

42) What are the feature selection methods used to select the right variables?

Feature selection is one of the important concepts of ML and involves removal of irrelevant or partially relevant features which can negatively impact the performance of our model.

There are three feature selection methods –

1.Filter Methods- This method of selection of features is independent of any ML algorithm and is generally used as a pre-processing step. The feature is selected on the basis of their statistical tests.

This process involves: 

  • Pearson’s correlation
  • Linear discrimination analysis
  • Chi-Square

Filter methods do not remove multicollinearity therefore this must be considered before training our data models.

2.Wrapper Methods 

In this method we use a subset of features and train a model using them and based on the inferences that we draw from the previous model, we decide to add or remove features from our subset. Wrapper methods are very labour-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method. 

This involves: 

  • Forward Selection: We test one feature at a time and keep adding them until we get a good fit.
  • Backward Selection: We test all the features and start removing them to see what works better.
  • Recursive Feature Elimination: Recursively looks through all the different features and how they pair together.

3.Embedded Methods

They are a combination of both filter and wrapper methods and comprise of qualities of both. It’s implemented with algorithms that have their own built-in feature selection methods.

They include LASSO and RIDGE regression methods which have inbuilt penalization functions to reduce overfitting.

43) How do you deal with missing data in a dataset? What percentage of missing data is acceptable?

The problem of missing data can have significant effect on the results drawn from the data, can reduce the representations of the sample and the lost data could also cause bias in the estimation of parameters.

Now what percentage is acceptable, some studies suggest that a missing rate of 5% or less is inconsequential.

The following are ways to handle missing data values:

  • Case deletion- In this method if the number of cases having missing values is less then it is better to drop them.
  • Imputation- This means substituting the missing data by some statistical methods, it replaces the missing data by an estimated value.
  • Imputation by Mean, Mode, Median- In case of numerical data, missing values can be replaced by mean of the complete cases of the variable.If the case is suspected to have outliers, we can replace missing values by median, for categorical feature, the missing values could be replaced by the mode of the column.
  • Regression Methods- In this method the missing values are treated as dependent variables whereas complete cases are considered as predictors, these predictors are used to fit a linear equation for the observed values of the dependent variable and this equation is then used to predict values for the missing data points.
  • K -Nearest Neighbour Imputation (KNN)- In this method the k-nearest neighbour algorithms is used to estimate and replace missing data. This algorithm choses K-neighbours using some distance measures and their average is used as an imputation estimate. 
  • Multiple Imputation- It is an iterative method in which multiple values are estimated for the missing data points using the distribution of the observed data.

44) What are dimensionality reduction and its benefits?

In ML there are many factors called features on which final classification is based, sometimes most of these features are correlated and hence redundant. Dimensionality reduction refers to the process of reducing these redundant features in a vast data set with fewer dimensions (fields) to convey similar information concisely. It is a very important feature of machine learning and predictive modeling.

This reduction helps in compressing data and reducing storage space hence resulting in less computation time as fewer dimensions lead to less computing. It is used in case of email in identifying whether it is spam or not. Despite its benefits it might lead to loss of some data.

45) What are the feature vectors?

A vector means an array of numbers. In machine learning, these vectors are called feature vectors as each of these values are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that’s easy to analyse. 

46) What is sampling? How many sampling methods do you know?

Data sampling is a statistical analysis technique used to collect, manipulate and analyse data to identify patterns and trends in case of larger data set being examined.

The main types of probability sampling methods are simple random sampling, stratified sampling, cluster sampling, multistage sampling, and systematic random sampling. The key benefit of probability sampling methods is that they guarantee that the sample chosen is representative of the population.

47) Why is resampling done?

While dealing with an imbalanced dataset one of the possible strategies is to resample either the minority or the majority class to artificially generate a balanced training set that can be used to train a machine learning model. Resampling methods are easy to use and require little mathematical knowledge.

Resampling is done to improve the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points. The two common validating models used in resampling by using random subsets are bootstrapping and cross-validation.

48) What is root cause analysis?

Root cause analysis was initially developed to analyse industrial accidents but is now widely used in other areas specially medicine. It is a problem-solving technique that helps people answer the question of why the problem occurred in the first place. This analysis assumes that systems and events are interrelated which means an action in one area triggers an action in another, and another, and so on so by tracing back these actions, we can discover where the problem started and how it grew into the symptom we are now facing. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.

49) Explain cross-validation?

Cross-validation is a model validation technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set, this is done by reserving some portion of data set, using the rest data to train the model and finally testing the model on the reserved portion of the data set. It is mainly used in backgrounds where the objective is to forecast and one wants to estimate how accurately a model will accomplish in practice. 

The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) to limit problems like overfitting and gain insight into how the model will generalize to an independent data set.

50)  What are the confounding variables?

These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. Simply put it is an extra variable entered into the equation that was not accounted for and can ruin an experiment and produce useless results. The estimate fails to account for the confounding factor.

51) What is gradient descent?

Gradient descent is an optimization algorithm used best when the parameters cannot be calculated analytically. It is used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient in order to reduce loss as quickly as possible. In machine learning, we use gradient descent to update the parameters of our model.

52) Why is gradient descent stochastic in nature?

The term stochastic means random probability. Therefore, in the case of stochastic gradient descent, the samples are selected at random instead of taking the whole in a single iteration. Stochastic Gradient descent often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. 

53) Do, gradient descent methods always converge to the same point?

Gradient descent converges to a local minimum if it starts close enough to that minimum. If there are multiple local minimums, its convergence depends on where the iteration starts. It is very hard to converge to a global minimum. Simulated annealing does this by probabilistic perturbations.

54) What do you understand by Hypothesis in content learning of machine learning?

 Hypothesis is a provisional idea, an educated guess which requires some evaluation. Specifically, supervised learning, can be described as the desire to use available data to learn a function that best maps inputs to outputs. It can also be termed as the problem of approximating an unknown target function (that we assume exists) that can best map inputs to outputs on all possible observations from the problem domain.

Therefore, hypothesis in machine learning can be rightly defined as a model that approximates the target function and performs mappings of inputs to outputs.

55) What does the cost parameter in SVM stand for?

The cost parameter in SVM controls training errors and margins and decides how much an SVM should be allowed to “bend” with the data. For example, a low cost creates large margin and allows more misclassification and you aim for a smooth decision surface but on the other hand a large cost creates a narrow margin (a hard margin) and permits fewer misclassifications enabling you to classify more points correctly.

56) What is skewed distribution? Differentiate between right skewed distribution and left skewed distribution?

When a distribution creates a curve, which is not symmetrical or in other words data points cluster on one side of the graph than the other, it is referred to as skewed distribution. There are two types of skewed distribution left and right skewed.

Left skewed or negatively skewed distribution means when there are fewer observations on the left, which means it has a very long left tail. For example, Marks obtained by students in a difficult exam.

Distributions with fewer observation on the right that is towards higher values are said to be right skewed distribution. In this distribution the graph has a long right tail. For example, Wealth of people in a country.

57) Why is vectorization considered a powerful method for optimizing numeral code?

Vectorization is a very powerful method for improving the performance of a code running on modern CPU’S. Numerous problems in the real world can be reduced to a set of simultaneous equations. Any set of linear equations in unknowns can be expressed as a vector equation (this remains true for differential equations). Vectors are not just restricted to the physical idea of magnitude and direction, their real application is to problems in fields like statistics and economics. If you want a computer program to address such problems it is much easier if the computer handles the vector manipulations, rather than getting the human programmer to code in all the detail of the calculation thus resulting in the rise of vector languages.

58) What is deep learning, and how does it contrast with other machine learning algorithms?

Deep learning is a subset of machine learning that is concerned with helping machines to solve complex problems even when using large data set that is very diverse, unstructured, semi structured and inter-connected. In that sense, deep learning represents an unsupervised learning algorithm that learns representations of data through the use of neural nets.

Whereas machine learning is a supervised learning algorithm which learns from the past data and makes decisions based on what it has learnt.

59) How is conditional random field different from hidden markovmodels?

Conditional random fields (CRMs) and Hidden markov models (HMMs) are popular statistical modeling methods, often applied to pattern recognition and machine learning problems. Conditional Random Fields (CRMs) are discriminative sequence model in nature and because of their flexible feature design can accommodate any context information. The major drawback of this model is its complex computations at the training stage, which makes it difficult to re-train the model when new data is added.  Hidden Markov Models (HMMs) are strong statistical generative models which facilitate learning directly from raw sequence data. They are capable of performing wide variety of operations such as multiple alignment, data mining and classification, structural analysis, and pattern discovery and can be easily combined into libraries.

60) What are eigenvalue and eigenvector?

Both eigenvalues and eigenvectors of a system are extremely important in data analysis. Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing, or stretching.

Eigenvectors are for understanding linear transformations and is a vector whose direction remains unchanged when a linear transformation is applied to it. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. 

Several analysis like principal component analysis (PCA), factor analysis, canonical analysis etc. are based on eigen values and eigen vectors.

61) What are the different types of joins? What are the differences between them?

  • There are four types of joins used in different data-oriented languages-

(INNER) JOIN: It is a behaviour to keep rows where the merge “on” value exists in both the left and right dataframes, that is each row in the two joined data frames to have matching column values which is similar to the intersection of two sets. It is one of the most common type of joins used.

LEFT (OUTER) JOIN: Keep every row in the left dataframe. Where there are missing values of the “on” variable in the right dataframe, add empty / NaN values in the result.

RIGHT (OUTER) JOIN: IT is similar to left outer join but the only difference is that all the rows of right dataframe are taken as it is and only those of the left dataframe that are common in both.

FULL (OUTER) JOIN: Return all records when there is a match in either left or right table.

62)  What are the core components of descriptive data analytics?

Descriptive Analytics is a preliminary stage of data processing and provides. Data aggregation, Summarization and Visualization are the main pillars supporting the area of descriptive data analytics. A significant example of this is increase in Twitter followers after a particular tweet

63)   How can outlier values be treated?

In both statistical and mchine learning outlier detection is crucial for building an accurate model to get good results. However, detecting them might be very difficult, and is not always possible. There are three methods to deal with outliers.

  1. Univariate method- It is one of the simplest methods for detecting outliers, in this method we look for data points with extreme values on one variable.
  2. Multivariate method- Here we look for unusual combinations on all the variables.
  3. Minkowski Error-This method does not remove outliers but reduces the contribution of potential outliers in the training process.

 The most common ways to treat outlier values

     1) To change the value and bring in within a range.

     2) To just remove the value.

64) How can you iterate over a list and also retrieve element indices at the same time?

In programming, we use lists to store sequences of related data, this can be done using the enumerate function which takes every element in a sequence just like in a list and adds its location just before it.

65)   Explain about the box cox transformation in regression models.

 A Box cox transformation is a statistical technique often used to modify the distributional shape of the response variable so that the residuals are more normally distributed. It is done so that tests and confidence limits that require normality can be more appropriately used. However, it is not suitable in case of data containing outliers which may not be properly adjusted by this technique. If the given data is not normal then most of the statistical technique assume normality. Applying a box cox transformation means that you can run a broader number of tests.

66)  Can you use machine learning for time series analysis?

Yes, it can be used for forecasting time series data like inventory, sales, volume etc but it depends on the applications. It is an important area of machine learning because there are multiple problems involving time components for making predictions. 

67) What is Precision and Recall in ML?

Precision and recall are both extremely important to indicate accuracy of the model. Precision means the percentage of your results which are relevant. On the other hand, Recall refers to the percentage of total relevant results correctly classified by your algorithm. Both of them cannot be maximised at the same time as one comes at the cost of another. For example, in case of where there is a limited space on each webpage, and extremely limited attention span of the customer so if we are shown a lot of irrelevant results and very few relevant results we will shift to other sites or platforms.

68) What is R-square and how is it calculated?

Once a machine learning model is built, it becomes necessary to evaluate the model to find out its efficiency.

R-square is an important evaluation metrics used for linear regression problems. R-Square defines the degree to which the variance in the dependent variable (or target) can be explained by the independent variable (features) and is also known as coefficient determination. For example, if r-square value of a model is 0.8, it means 80% of the variation in the dependent variable is explained by independent variable. Higher the R- Square, better the model. R-square always lies between 0 and 100%.

69) What is dataset Gold standard?

According to Wikipedia Gold standard or Gold set test refers to a diagnostic, test or benchmark that is’ the best available unreasonable conditions. It is a standard, often accepted as the most valid one and the most used one by researchers. In case we take example of medicine, doctors refer to blood assay as a gold standard for checking patients for medication adherence.

70) What is Ensemble Learning?

Ensemble learning technique is basically combining a diverse set of learners (Individual models) together to improvise on the stability and predictive power of the model thereby helping in improving the accuracy of the model.

71) Describe in brief any type of Ensemble Learning?

Ensemble learning has many types but two more popular ensemble learning techniques are mentioned below.


Bagging tries to implement similar learners on small sample populations and then takes a mean of all the predictions. This approach uses the same algorithm for every predictor (e.g. all decision tree), however, having different random subsets of the training set allowing for a more generalised result. For creation of the subsets, we can either proceed with replacement or without replacement.


It is also known as hypothesis boosting and is an ensemble method which adjusts the weight of an observation based on the last classification. It is a sequential process where each subsequent model attempts to fix the errors of its predecessor that is If an observation was classified incorrectly it tries to increase the weight of this observation and vice versa. Boosting in general decreases the bias error and builds strong predictive models. However, they may over fit on the training data.

72) How Regularly Must an Algorithm be Updated?

Google normally makes 500 to 600 changes to algorithm every year so You would generally want to update an algorithm when:

  • You want the model to evolve as data streams through infrastructure.
  • The underlying data source is changing.
  • There is a case of non-stationarity.
  • The algorithm lacks accuracy and precision.

73) What, in your opinion, is the reason for the popularity of Deep Learning in recent times?

Although Deep Learning has been around for many years, the major breakthroughs from these techniques came just in recent years due to better Neural Networks and more computational power of huge amounts of data. Deep learning lays at the forefront of AI and requires high-performance GPUs to achieve tremendous levels of accuracy. Advances in deep learning have pushed this tool to the point where deep learning outperforms humans in some tasks like classifying objects in images, for example driverless Tesla cars need millions of images and thousands of hours of video before gaining the ability to drive you home.

74) What is the difference between Type I Error & Type II Error? Also, Explain the Power of the test?

When we perform hypothesis testing, we consider two types of Error, Type I error and Type II error, sometimes we reject the null hypothesis when we should not or choose not to reject the null hypothesis when we should. 

A Type I Error is committed when we reject the true null hypothesis that is, it refers to a type of error which is made in testing when a conclusive winner is declared although the test is actually inconclusive. On the other hand, a Type II error is a statistical term which rejects alternate hypothesis even though it does not occur due to chance. In other words Type 1 error  occurs when there really is no difference (association, correlation.) overall, but random sampling caused your data to show a statistically significant difference (association, correlation.) whereas in Type 11 error there really is a difference (association, correlation) overall, but random sampling caused your data to not show a statistically significant difference. Type 1 error is more serious, because you have wrongly rejected the null hypothesis and ultimately made a claim that is not true.

The probability of a Type I error is denoted by α and the probability of Type II error is denoted by β.Simply put, power is the probability of not making a Type II error that is, or in other words  it is the ability of making a correct decision (to reject the null hypothesis) when the null hypothesis is false.

75) How is Data Science different from Big Data and Data Analytics?

Data science utilises algorithms and tools to draw meaningful and commercially useful insights from structured and unstructured data. It is a combination of mathematics, statistics, programming, problem solving etc and  comprises of all the tasks like data modelling, data cleansing, analysis, pre-processing etc. Internet search, Recommender systems, Digital advertisements are some of the applications of Data science.

Big data is immense volume of set of structured, semi-structured, and unstructured data in its raw form, which is almost impossible to store in the memory of a single computer and is generated through various channels. It is applied widely in financial sector, communications, Retail etc.

Data Analytics is the mechanical process of applying algorithms to derive meaningful insights through several data to look for meaningful correlations. It is very helpful in complex business situations and is applied in Healthcare, Gaming, Travel etc. It also helps in predicting upcoming opportunities and threats for an organisation to exploit.

76) What is the use of Statistics in Data Science?

Statistics plays a very crucial role in data science and is foundational in data science as it provides tools and methods to identify patterns and structures in data to provide a deeper insight into it. Data scientists use popular statistical methods such as Regression, Classification, Time series Analysis to run experiments and interpret results with the help of these methods. Statistics serves a great role in data acquisition, exploration, analysis, and validation and is used to do estimations. Data Science is a derived field which is formed from the overlap of statistics probability and computer science. Many algorithms in data science are built on top of statistical formulae and processes therefore it is an important part of data science.

77) What is ‘Naive’ in a Naive Bayes?

 A naive Bayes classifier is a very powerful algorithm and assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable. For example, if we like pizza and ice cream, Naive Bayes will assume independence between the two and offer you Pizza ice cream. Basically, it’s “naive” because it makes assumptions that may or may not turn out to be correct. This model works well in case of classification.

78) What is Bias-Variance tradeoff?

The error introduced in your model because of simplifying assumptions of the algorithm to make target function easier to approximate is known as Bias. On the other hand, Variance is the error introduced to your model because of the complex nature of machine learning algorithm that is it is the amount that the estimate of target function will change in case different training data was used. In this case, the model also learns noise and perform poorly on the test dataset.

The bias-variance tradeoff is the optimum balance between bias and variance in a machine learning model which implies models that are too complex tend to have high variance and low bias, while models that are too simple will tend to have high bias and low variance. The best model will have both low bias and low variance.

79) What is confusion matrix?

A confusion matrix or an error matrix is essentially used to evaluate the performance of a machine learning model when the truth values of the experiments are already known and the target class has more than two categories of data. It helps in visualisation and evaluation of the results of the statistical process.

A binary classifier predicts all data instances of a test dataset as either positive or negative. This produces four outcomes-

  1. True positive(TP) — Correct positive prediction
  2. False-positive(FP) — Incorrect positive prediction
  3. True negative(TN) — Correct negative prediction
  4. False-negative(FN) — Incorrect negative prediction

80) What are Autoencoders?

 An autoencoder is an unsupervised learning technique using artificial neural network where input is the same as output. Autoencoders are a very useful dimensionality reduction technique and are therefore utilised for learning a set of data, by training the network to ignore signal “noise”. It tries to generate a representation as close as possible to its original input from the reduced encoding.

81) How should you maintain a deployed model?

A deployed model needs to be retrained after a while so as to improve the performance of the model. Since deployment, a track should be kept of the predictions made by the model and the truth values. Later this can be used to retrain the model with the new data. Dashboarding and notification approaches are a good way to put this into action. Also, root cause analysis for wrong predictions should be done.

82) Can you cite some examples where a false negative holds more importance than a false positive?

In case of airport security when ordinary items such as keys or coins get mistaken for weapons (machine goes “beep”). In an Antivirus software where a normal file is thought to have a virus. In cases of predictions when we are doing disease prediction based on symptoms for diseases like Allergy, cancer etc.

 83) Explain the SVM machine learning algorithm in detail?

SVM is a flexible ML algorithm which is used for classification and regression. For classification, it finds out a multi-dimensional hyperplane to distinguish between classes. SVM uses kernels which are namely linear, polynomial, and rbf. They are extremely popular these days because of their ability to handle multiple continuous and categorical variables. They have great accuracy, work well with high dimensional space and use very less memory but are not suitable for large data sets with overlapping classes.

84) What are the support vectors in SVM?

 Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximise the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM.

85) What is a Boltzmann Machine? 

Boltzmann machines were first invented in Toronto in 1985 by Geoffrey Hinton who is often referred to as ‘The Godfather of Deep learning’. They are generative unsupervised models involving learning a probability distribution from an original dataset and using it to make inferences about data that has never been seen before. The Boltzmann machine are type of Markov Random field and is basically used to optimise the weights and the quantity for the given problem and consist of a neural network with an input layer and one or several hidden layers which make stochastic decisions about whether to be on or off. The learning algorithm is very slow in networks with many layers of feature detectors. “Restricted Boltzmann Machine” algorithm has a single layer of feature detectors which makes it faster than the rest.

86) What is the difference between a tree map and heat map?

A heat map is a type of visualisation tool that compares different categories with the help of colours and size. In this type of chart, data table is represented with rows and lines denoted by different sets of categories. The ‘tree map’ is a chart type that illustrates hierarchical data by using smaller rectangles within larger rectangles. These maps are widely used for visualising financial data and are economical which implies they can be used within limited space and yet display large number of items. However, they are not suitable when there is big difference in magnitude of the values. A heat map is a two- dimensional visualisation tool for representation of data that compares different categories with the help of colours and size.

87) Differentiate between NumPy and SciPy?

 NumPy stands for numerical python whereas SciPy are scientific python libraries with support for arrays and mathematical functions. NumPy is for basic operations such as sorting, indexing, elementary functioning on the other hand SciPy contains all the algebraic functions. NumPy is written in C and is faster than SciPy in execution but SciPy is suitable for more complex computing of numerical data and is therefore more popular than SciPy in modern scenario. They are very handy tools for data science.

88)  Whenever you exit Python, is all memory de-allocated?

Objects having circular references are not always free when python exits. It is impossible to free certain portions of memories stored in C library and hence when we exit python all memory doesn’t necessarily get deallocated.

 89) What are Lambda functions?

In Python Lambda functions are anonymous functions, that is function with no name. Lambda functions can take a number of arguments but are restricted to a single expression and can be used to return function objects.

90) What Do You Mean by Tensor in Tensorflow?

Tensors are the underlying components of computation and a fundamental data structure in TensorFlow. They and can be identified by three parameters rank, shape and type. These are multidimensional arrays of data with different dimensions and ranks fed as input to the neural network.

91) What is the Computational Graph?

Everything in a tensorflow is based on creating a computational graph. It comprises of nodes and edges, where each node represents mathematical operation, and each edge describes a tensor that gets transferred between the nodes. Since data flows in the form of a graph, it is also called a “Dataflow Graph.”

92) How can you create a series from dictionary in Pandas?

A one-dimensional labelled array capable of holding any data types such as integers, strings, floating point numbers, Python objects is called a Series. Unlike Python lists, Series always contain data of the same type.

93) What is Pandas Index?

Indexing is like an address, in Pandas is the process of selecting particular rows and columns of data from a Data Frame is called Panda Index.

94) What kind of data does Scatterplot matrices represent?

Scatterplot matrices are type of data display most commonly used to visualise multidimensional data; they show relationships between a combination of variables,  whether there is a positive or negative association between them or if the data pattern is linear, nonlinear  and if unusual features such as outliers, clusters and gaps exist in the data sets. A series of dots represent the position of observations from the data set.

95) What is the hyperbolic tree?

 A hyperbolic tree or hypertree is a dynamic representation of hierarchical structure and are an effective way to display complex trees clearly. It is an information visualisation and graph drawing method inspired by hyperbolic geometry.

96) What is correlation and covariance in statistics?

Covariance and Correlation are two mathematical concepts which measure relationship and dependency between two variables. These two approaches are widely applied in the field of data analytics. Covariance assumes the units from the product of the units of the two variables whereas correlation is dimensionless which implies, it is a unit-free measure of the relationship between variables. Though the work of these two in mathematical terms is same they are applied differently. Covariance matrix is used when the variables are on similar scales whereas we use correlation matrix when the scales of the variables differ.

97) What is the difference between “long” and “wide” format data?

In the wide-format, a subject’s repeated responses will be in a single row with multiple columns to hold the values of various attributes and each response is in a separate column. In the long-format, each row is a one-time point per subject that is we have as many rows as the number of attributes and each row contains the value of a particular attribute for a given data point.

98) Explain the use of combinatorics in data science?

Combinatorics is one of the most useful fields of mathematics for applying numbers in data science problems to obtain formulas and estimates for analysis of algorithms. In short it is study of finite and countable structures to improve efficiency in text classification problems.  By utilizing a combinatorial approach using permutations, combinations and variations, it’s possible to obtain meaningful representations of text data with a feature space that is on average 50 times smaller than the original space.

99) What is SoftMax Function? What is the formula of Softmax Normalization?

SoftMax Function is a form of logistic regression used for normalizing the input into a probability distribution over the output classes. It is sometimes referred to as a multinomial logistic regression, as it can accommodate as many classes or dimensions in our neural network model and enable us to avoid binary classification. It is very useful in computing losses that can be expected when training a data set.  The formula for the Softmax Normalization:

100) What is curse of dimensionality in data mining?

The curse of dimensionality as coined by Richard E is the phenomena that arises while analysing and organizing data in high-dimensional spaces such as three-dimensional space that do not occur in low-dimensional setting in everyday experience. A suitable example to explain this is Healthcare sector it has vast number of variables such as diabetes, blood pressure, cholesterol levels and numerous other parameters.  Dimensionality   is an ideal way of converting the high dimensional variables into lower dimensional variables without changing the specific information of the variables.

This is the extensive list of data scientist job interview questions. Though every company has a different approach for interviewing data scientists, however, we do hope that the above data science technical interview questions facilitate the data science interview process and provide an understanding on the type of data scientist job interview questions asked when companies are hiring data people.



About Abhay Singh

7 + years of expertise of Cloud Platform(AWS) with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon ELB, Scaling, CloudFront, CDN, CloudWatch, SNS, SQS, SES and other vital AWS services. Understand Infrastructure requirements, and propose design, and setup of the scalable and cost effective applications. Implement cost control strategies yet keeping at par performance. Configure High Availability Hadoop big data ecosystem, Teradata, HP Vertica, HDP, Cloudera on AWS, IBM cloud & other cloud services. Infrastructure Automation using Terraform, Ansible and Horton Cloud Break setups. 2+ Years of development experience with Big Data Hadoop cluster, Hive, Pig, Talend ETL Platforms, Apache Nifi. Familiar with data architecture including data ingestion pipeline design, Hadoop information architecture, data modeling, and data mining, machine learning, and advanced data processing. Experience at optimizing ETL workflows. Good knowledge of database concepts including High Availability, Fault Tolerance, Scalability, System, and Software Architecture, Security and IT infrastructure.

Follow Me

Leave a reply

Captcha Click on image to update the captcha .

By commenting, you agree to the Terms of Service and Privacy Policy.