Skip to content

ankitk2109/DS_ML_DL_InterviewQuestions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 

Repository files navigation

Data Science, Machine Learning, & Deep Learning Interview Questions

Machine Learning

1. What is Machine learning?

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

Traditionally, software engineering combined human created rules with data to create answers to a problem. Instead, machine learning uses data and answers to discover the rules behind a problem.

To learn the rules governing a phenomenon, machines have to go through a learning process, trying different rules and learning from how well they perform. Hence, why it’s known as Machine Learning.

image

2. What is Supervised and Unsupervised Learning?

Supervised Learning: Uses known and labeled data as input, with feedback mechanism. Most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine.

Unsupervised Learning: Uses unlabeled data as input with no feedback mechanism. Most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm.

image

image

3. What is Logistic Regression?

Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid).

image

Useful Links:

4. What is a Decision Tree?

  • It is supervised machine learning technique, which can do classification and regression tasks. It is also know as Classification and Regression Trees(CART) algorithm. It formulates the knowledge into hierarchichal structure which are easy to interpret.

  • Decision trees are build in two steps.

    • Induction: Process of building the tree
    • Pruning: Removes the unnecessary branches or the branches that do not contribute to the predictive power of classification. It also helps in avoiding overfitting or rigid boundaries.

Useful Links:

5. What is Random forest? The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random:

  1. Random sampling of training data points when building trees: The samples are drawn with replacement, known as bootstrapping. While testing the model predictions are made by averaging the predictions of each decision tree. This process of random sampling and aggregating the result is know as bootstrap aggregating or bagging.

  2. Random subsets of features considered when splitting nodes: The other main concept in the random forest is that only a subset of all the features are considered for splitting each node in each decision tree. Generally this is set to sqrt(n_features) for classification meaning that if there are 16 features, at each node in each tree, only 4 random features will be considered for splitting the node.

The random forest combines hundreds or thousands of decision trees, trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features. The final predictions of the random forest are made by averaging the predictions of each individual tree

Useful links:

6. Bias-Variance tradeoff?

What is bias?

  • Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.

What is variance?

  • Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.

Bias-Variance?

  • If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.

image

image

image

Useful Links:

7. What is Naive Bayes Algorithm? Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.

image

Conclusion:

  • Naive Bayes algorithms are mostly used in sentiment analysis, spam filtering, recommendation systems etc. They are fast and easy to implement but their biggest disadvantage is that the requirement of predictors to be independent. In most of the real life cases, the predictors are dependent, this hinders the performance of the classifier.

Useful Links:

8. Handling Imbalanced Data?

  1. Use the right evaluation metrics.
  2. Resample the training set: Under Sampling, Over Sampling
  3. Use K-fold Cross-Validation in the right way
  4. Ensemble different resampled datasets
  5. Resample with different ratios

Useful Links:

9. What is k-fold cross Validation? Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data

The general procedure is as follows:

  • Shuffle the dataset randomly.
  • Split the dataset into k groups
  • For each unique group:
    • Take the group as a hold out or test data set
    • Take the remaining groups as a training data set
    • Fit a model on the training set and evaluate it on the test set
    • Retain the evaluation score and discard the model
  • Summarize the skill of the model using the sample of model evaluation scores

Useful Links:

10. What is Ensemble learning? Explain Bagging and Bossting.

Ensemble Learning: It is the art of combining diverse set of learners (individual models) together to improvise on the stability and predictive power of the model

Bagging and Boosting get N learners by generating additional data in the training stage. N new training data sets are produced by random sampling with replacement from the original set.

Bagging: A Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. If samples are drawn with replacement, then the method is known as Bagging.

Boosting: The term 'Boosting' refers to a family of algorithms which converts weak learner to strong learners. Boosting is an ensemble method for improving the model predictions of any given learning algorithm. The idea of boosting is to train weak learners sequentially, each trying to correct its predecessor.

image

Useful Links:

11. Explain Accuracy, Precision, Recall, ROC, F1, Confusion Matrix, RMSE?

  1. Accuracy : the proportion of the total number of predictions that were correct.
  2. Positive Predictive Value or Precision : the proportion of positive cases that were correctly identified.
  3. Negative Predictive Value : the proportion of negative cases that were correctly identified.
  4. Sensitivity or Recall : the proportion of actual positive cases which are correctly identified.
  5. Specificity : the proportion of actual negative cases which are correctly identified
  6. F1 Score: F1-Score is the harmonic mean of precision and recall values for a classification problem.
  7. RMSE: It is the most popular evaluation metric used in regression problems. It follows an assumption that error are unbiased and follow a normal distribution. As compared to mean absolute error, RMSE gives higher weightage and punishes large errors.
  8. ROC: The ROC curve is the plot between sensitivity(TPR) and FPR

Useful Links:

12. Explain K-Means Algorithm? Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible.

The way kmeans algorithm works is as follows:

  1. Specify number of clusters K.
  2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
  3. Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t changing.

Evaluation Method:

  • Elbow method
  • Silhouette analysis

Useful Links:

13. What is word embedding and How does Word2Vec works?

Before we understand what is word embedding we need to understand what exactly is embedding and why do we need them?

  • On a very high level we can say that word embedding is a vector representation of the words where each value in the vector has some weight. Also it can be described as, a learned representation for text where words that have the same meaning have a similar representation.
  • One hot encoding is another way to encode the words. But if we try to visualize those encodings, where each occupies a dimension and has nothing to do with the rest of the words. For example word "Hello" and "Hi" are as different as "day" and "Country" which is not true.
  • One hot encoding limitations:
    • High dimensional space and sparse: Feature vector grows with the vocabulary size
  • Benefits of word embedding:
    • Low-dimensional and dense: Usually between 50-600 dimension
  • The main objective is to have words with similar context occupy close spatial positions. Mathematically, the cosine of the angle between such vectors should be close to 1, i.e. angle close to 0.
  • Intuitively, we introduce some dependence of one word on the other words. The words in context of current word would get a greater share of this dependence. In one hot encoding representations, all the words are independent of each other.

How does Word2Vec works?

  • It is a method to construct those embeddings mentioned above. It can be obtained using two methods both involving Neural Networks:
    • Common Bag of Words (CBOW): This method takes the context of each word as the input and tries to predict the word corresponding to the context. This can take either a single or multiple context words to predict the target. In this model the hidden layer neurons just copy the weighted sum of inputs to the next layer. There is no activation like sigmoid, tanh or ReLU. The only non-linearity is the softmax calculations in the output layer.

    • Skip Gram: We can use the target word (whose representation we want to generate) to predict the context and in the process, we produce the representations. The hidden layers also have the activation functions unlike CBOW.

Who Wins?

  • Both have their own advantages and disadvantages. According to Mikolov, Skip Gram works well with small amount of data and is found to represent rare words well. On the other hand, CBOW is faster and has better representations for more frequent words.

Useful Links:

14. Understanding LSTM

Useful Link:

15. What is Feature Scaling, which algorithms are affected by them and what is the difference between normalization and standardization?

Why do we need Feature Scaling?

  • Some machine learning algorithms are sensitive to feature scaling while others are virtually invariant to it.

  • Gradient based Algorithms

    • Machine learning algorithms like linear regression, logistic regression, neural network, etc. that use gradient descent as an optimization technique require data to be scaled.
    • Having features on a similar scale can help the gradient descent converge more quickly towards the minima.
  • Distance based Algorithms

    • Distance algorithms like KNN, K-means, and SVM are most affected by the range of features. This is because behind the scenes they are using distances between data points to determine their similarity
    • Therefore, we scale our data before employing a distance based algorithm so that all the features contribute equally to the result.
  • Tree based Algorithms

    • Tree-based algorithms, on the other hand, are fairly insensitive to the scale of the features. Think about it, a decision tree is only splitting a node based on a single feature. The decision tree splits a node on a feature that increases the homogeneity of the node. This split on a feature is not influenced by other features.
    • Therefore, tree based algorithms have no effect of the remaining feature on split. This is what makes them invariant to the scale of the features!.
  • What is Normalization?

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

image

  • What is Standardization?

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

image

Note: in this case, the values are not restricted to a particular range

  • When to use Normalization and Standardization?

    • Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.
    • Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.
  • How to scale train and test data?

    • It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required.

Useful Link:

About

Interview Prepration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published