100 Questions that will help you Smash your Data Analysis and Data Science Job Interviews
Here are compiled 100 list of some common data Analysis and data science questions with dynamic answers to help guide you in your data professional career quest and knowledge acquisition. No doubt that Data Analysis and Data Science skills are among the skills that are high in demand and as the demands for these skills increases year after year the is also the need for people to be equipped with the right skills and knowledge for professional job interviews and other scrutinisation processes to fill those roles.
These questions were compiled to facilitate Data Scientists Network (DSN), Port Harcourt community members in leveraging their data professional skills, targeted for fresher's and experienced candidates to improve their knowledge and also to prepare them in getting their dream jobs.
The 100 questions are structured as Objectives with answers in bold font to help readers grasp answers to the questions. Readers can also copy out these questions and restructure them for their personal practices:
Which of the following is characteristic of Raw Data? A) Data is ready for analysis B) Original version of data C) Easy to use for data analysis D) None of the mentioned
Which of the following is not a step in data analysis? A) Obtain the data B) Clean the data C) EDA D) None of the mentioned
Which of the following programming language is used for Data analysis? A) Tkinter B**) R C**) HTML D) None of the mentioned
The main goal of data science is _____? A) to raise AI talents B) to solve mathematical problems C) To get Insight D) to build machines
Data Science is equal to Coding ? True or False
Which of the following skill is not needed for Data Science A) Coding B) Statistics C) Domain knowledge D) None of the mentioned
In the field of Data Science we can say that presenting is not the same as Exploring? True or False
Which of the following does not signify Web Data ? A) Html B) JSON C) XLSX D) XML
Which of the following is not Visualization tool for data analysis ? A) Tableau B**) YPlot* C) power BI D)* None of the mentioned
Which of this is not a major challenge of data science and analysis are A) Insufficient Data B) Poor tools C) Poor quality of data D) Irrelevant features
One of this is not a Major programming language used in Data science A) R B) Python C) MS Excel D) SQL
In machine learning confusion matrix has two dimensions ? A) Actual and Mixed B) Actual and predicted C) precise and specific D) Prediction and confusion
Deep Learning involves ? A) System that involves label datasets B) Using artificial natural networks C) Using Sklearn for Machine Learning D) Using artificial neural networks
The following are applications of Supervised machine learning in modern businesses except : A) Sentiment Analysis B) Healthcare Diagnosis C) Patterns detection D) Fraud detection
Amazon is able to recommend products to their customers based on ? A) The customer needs quality references B) because of the association algorithm which identify patterns. C) it has a big platform that meets all customers needs D) Numerous customers visit Amazon to buy and get to inform others about their services.
The following are regression instances except ? A) When variables are continuous in nature B) To estimate the sale of a product C) To estimate the amount of rainfall D) To estimate the gender of a person
The algorithm that operates by Constructing multiple decision trees during training phase is ? A) Support Vector Machine B) Decision Tree C) Random Forest D) Clustering
The following algorithm can be used for categorical output except ? A) Random Forest B) K Means Clustering C) KNN D) Naive Bayes
Given a database of customer data, you are ask to automatically discover market segments and group customers into different market segments. What approach of Machine Learning type will you consider? A) Classification B) Unsupervised Learning algorithm C) Test Validation D) Regression approach
Which of this Algorithm can be used for Clustering problems : A) PCA B) KNN C) Random Forest D) Decision Tree
You are running a company and you want to develop learning algorithms to address the of a software to examine individual customer accounts and to decide if it has been hacked and compromised. How will you treat this problem? A) Treat as a Classification problem B) Treat as both Classification and Regression C) Treat as a Regression problem D) Treat as a unsupervised problem
An important machine learning where an agent learns how to behave in an environment by performing actions and seeing the result is referred to as ? A) Deep Learning B) Supervised Learning C) Reinforcement learning D) Unsupervised learning
You want to predict how many of these items will sell over the next 3 months, what kind of problem is it? A) Classification and Continuous problem B) Selection problem C) Decision tree problem D) Regression problem
Given a dataset of patients diagnosed as having diabetes or not What kind of Machine Learning algorithm can we use to develop a model for this? A) principal component analysis B) Linear Regression C) Logistic Regression D) Confusion Matrix
Suppose you are to build a program to filter your emails through given answers by marking them as Spam or not Spam, what is the task in this setting ? A) This is not a Machine Learning problem B) Fitting the model answers to an algorithm C) Classifying emails as spam or not spam D) the number of emails correctly classified.
The following are performance Metrics for classification problems except ? A) Confusion Matrix B) F1 Score C) RMSE D) AUC
The following are performance metrics for Regression problems except? A) Mean Absolute Error B) Standard Error C) Mean Square Error D) R Squared
In building a Machine Learning model after getting your data which of the following step is the most important step? A) Model Evaluation B) Model Training C) Data pre-processing D) Data ingestion
Some of the most important applications of classification algorithm are as follows except? A) Speech Recognition B) Forecasting oil prices C) Handwriting recognition D) Biometric identification
Which of the statement is true about the performance of machine learning model with the data features. A) the use of relevant features can decrease the accuracy of your Model B) performing future selection before data modeling will decrease the model accuracy C) The performance of Machine Learning model is directly proportional to the data features D) Data features causes over-fitting in the model
In a DataFrame each variable can be seen as ? A) Tuple B) Column C) Rows D) Entity
Which of the following statement about DataFrame is not correct ? A) It’s core pandas data structure B) different value type can exist within a single column C) different columns can contain different data types D) Values within a single column are of the same data type.
Which of this method Subset DataFrame using row and columns numbers A). .loc[ ] B). df[ ] C). .iloc[ ] D). None of the mentioned
The following will give first 5 observations for the DataFrame df except? A) df.head() B). df. head(6) C). df.head(5) D). print(df.head())
This method will return summary statistics for numeric columns? A) df.summary() B). df.count() C). df.describe() D). df.stats()
This attribute returns a tuple of the number of rows followed by number of columns. A). df.columns B). df.(‘rows’,’columns’) C). df.columns() D). df.shape
This will extract the data values in form of 2D numpy array ? A). df.extract() B). df.np() C). df.values D. None of the mentioned
This will return column names ? A. df.columns B. df.column_names() C. df(columns) D. None of the mentioned
To subset multiple columns, column1 & column2 of df? A. df[(columns) B. df.subset(column1, column2) C. df[[‘column1’ , ‘column2’]] D. df[colum1, column2]
To add a new column, column3 to the DataFrame df by adding column1 and column2 we have: ? A). column3 = df[column1 + column2] B). df[column1 +column2] C). df[‘column3’] =df[‘column1’] +df[‘column2’] D). None of the mentioned
To drop duplicate rows in column1 we use? A). df.drop_duplicates(subset=’column1') B). df.drop(column1) C). df.drop.duplicates.column1 D). None of the mentioned
To count unique values in column1 of df DataFrame ? A). df.value.counts.column1 B). df[‘column’].value_counts() C). df.values[‘column1’].count() D). None of the mentioned
To set column1 as the index column of df ? A) df=column1.set_index B). df=index_column() C) df.set_index(‘column1’) D). None of the mentioned
The correct way of importing matplotlib is ? A). import matplotlib as plt B). Import Matplotlib.pyplot C). import matplotlib.pyplot as plt D). None of the mentioned
Which of this will give the counts number of true missing values in each column of df ? A) df.count_missing_values() B) df.isna().sum() C) df.sum(missing_values) D) None of the mentioned
You can load a csv into a DataFrame using this pandas function ? A). pandas=load(csv) B). pd.load_csv() C) pd.read_csv() D). None of the mentioned
You can write to a csv file using ? A) df.to_csv() B). df.write_to() C). df.write_to_csv() D). None of the mentioned
This method allows variables to be groupby similar to groupby() method A). .sum() B). .pivot_table() C) .avg() D) None of the mentioned
What is the function of plt.show() ? A) to display class values B) to show null points C) to display plot D) None of the mentioned
This takes a value as an argument and replaces each missing value ? A) df.ffillna() B) df.fill_na() C) df.replace() D) None of the mentioned
Important Characteristics of Structured Data are ? A). Generality B). Dimensionality C). Resolution D). All of the Above
What are some examples of data quality problems ? A) Noise and outliers B) Duplicate data C) Missing values D) All of the Above
In standardization, the features will be rescaled with ? A). Mean 0 and Variance 0 B). Mean 0 and Variance 1 C). Mean 1 and Variance 0 D). Mean 1 and Variance 1
Which one is a feature extraction example? A). Constructing a bag of words model B). Imputation of missing values C). Principal component analysis D). All of the Above
Why do we need feature transformation? A). Converting non-numeric features into numeric B). Resizing inputs to a fixed size C). Both A and B D). None
The correct way of pre processing the data should be ? A). Imputation ->feature scaling-> training B). Feature scaling->imputation->training C). Feature scaling->label encoding->training D). None
Some of the Imputation methods are ? A). Imputation with mean/median B). Imputing with random numbers C). Imputing with one D). All of the above
What is a Dummy Variable Trap? A). Multicollinearity among the dummy variables B). One variable predicts the value of other C). Both A and B D). None of the Above
Which of the following(s) is/are features scaling techniques? A). Standardization B). Normalization C). Min-Max Scaling D). All of the Above
How to handle the missing values in the dataset? A) Dropping the missing rows or columns B) Imputation with mean/median/mode value C) Taking missing values into a new row or column D) All of the above
PANDAS stands for _______? A) Panel Data B) Panel Dashboard C)Panel Data analyst D)Panel Data Analysis
Pandas key data structure is called? A) DataFrame B) KeyFrame C) Statistics D) Econometrics
Pandas is an open-source _______ Library? A) Java B) Python C) jQuer D) Javascript
Numpy stands for? A). Numerical Python B). Number In Python C). Numbering Python D). None Of the above
Numpy developed by? A) Jim Hugunin B) Wes McKinney C) Travis Oliphant D) Guido van Rossum
Which of the following Numpy operation is or are correct? A) Operations related to linear algebra. B) Mathematical and logical operations on arrays. C) Fourier transforms and routines for shape manipulation. D) All of the above
NumPy is often used along with packages like? A) Node.js B) SciPy C) Matplotlib D) Both B and C
Which of the following is contained in NumPy library? A) fourier transform B) n-dimensional array object C) tools for integrating C/C++ and Fortran code D) All of the mentioned
Which of the following attribute should be used while checking for type combination input and output? A). types B).class C).type D)None of the above
Which of the following function stacks 1D arrays as columns into a 2D array? A). column_stack B)com_stack C)row_stack D)All of the above
What is the result of the following: int(3.99) A) 3.99 B) 3 C) 3.9 D) 3.0
What is the result of the following operation: 11//2 A) 5.5 B) 5 C) 5.6 D) 5.0
What is the result of the following? “hello Mike”.find(“Mike”) ? A) 6,7,8 B) 5 C) 6 D) 4,4
Consider the following tuple: say_what= (‘say’, ‘what’, ‘you’, ‘will’) What is the result of this: say_what[-1] ? A) ‘will’ B) ‘say’ C) ‘what’ D) ‘you’
Consider the following tuple: A= (1,2,3,4,5), What is the result of this line of code: A[1:4] ? A) (2,3,4,5) B) (2,3,4) C) (3, 4,5) D) (1,2,3,4)
Consider the following tuple, A=(1, 2,3,4,5), what is the result of the following len(A) ? A) 4 B) 6 C) 5 D) 6
Consider the following list, B=[1, 2, [3,’a’],[4,’b’]] What is the result of the following: B[3] [1] ? A) [4, ‘b’] B) “c” C) “b” D) [a, b]
Dict = {“A” :1, “B” :”2", “C” :[3, 3,3,],”D”:(4,4,4),’E’:5, ‘F’ :6} What is the result of the following operation: Dict[“D”] ? A) 1 B) [3,3,3] C) (4,4,4) D) 4
Consider the following set: {“A”, “A”}, What will be the result when the s wet is created? A) {“A”} B) {“A”, “A”} C) (“A”, “A”) D) {}
What is the result of the following : type(set([1,2,3])) ? A) set B) list C) str D) dict
What method do you use to add an element to a set ? A) append B) extend C) add D) merge
What is the result of the following operation : {‘a’, ‘b’} & {‘a’} ? A) {‘a’, ‘b’} B) {‘a’, ‘b’, ‘a’} C) {‘a’} D) {}
Consider the tuple A= ((1), [2,3], [4]) That contains a tuple and a list, what is the result of the following operation : A[2] ? A) [4] B) [2,3] C) 1 D) [1,2,3,4]
Consider the tuple, A= ((11,12),[21,22]) that contains a tuple and list, what is the result of the following : A[0][1] ? A) 21 B) 11 C) 12 D) 22
Consider the following list, A= [“hard rock”, 10,1.2] What will list A contain after the following command is run: del(A[1]) ? A) [10,1.2] B) [“hard rock”, 1.2] C) [“hard rock”, 10] D) [10]
What is the result of this logic : True or False ? A) True B) Fasle C) Both D) None
Why do we use exception handlers? A) write a file B) Read a file C) Catch errors within a program D) terminate a program
Which of the following skills is not part Data Science Skills A) Story Telling B) Critical Thinking C) Philosophy D) Statistics
Which of this best defines who a data scientist is ? A) Convert raw data into usable data B) To drive decisions that benefits business C) Uses Data to generate insights D) Uses insight to drive business decisions
Artificial Intelligence is a subset of Machine Learning ? True or False
Which of the following statements is true ? A) Data Science is the same as Artificial intelligence B) Deep Learning is an umbrella of Data Science C) Data Science uses AI, Machine Learning and Deep Learning D) All of the mentioned
The following are ways of making data except: A) Visualization B) Web scrapping C) Experiments D) Surveys
Big Data involves 3Vs, velocity, volume and variety, which of the following statement is correct? A) Data Science relates to the 3Vs of big data B) Data Science does not relate with big data C) Data Science can only relate with the Volume of big data. D)Data Science can only relate with the Variety of big data
Statistics and Data Science have ____In common ? A) Machine Learning B) Analysis C) Deep Learning D) Coding
What is the ultimate purpose of analytics: A) To evangelize data science B) To facilitate meetings between sales and marketing C) To communicate findings to the concerned D) To generate reports
Which of the following is performed by Data Scientist? A) Define the question B) Create reproducible code C) Challenge results D) All of the mentioned
What is a training set in a Machine Learning model? A) It is used to test the accuracy B) It is 30% of the dataset C) It is labelled data used in the model D) It is used to verify the dataset
What is the output of the following lines of code?
a=1
def do(x):
return (x+a)
print(do(1))
A) 2 B) 3 C) 1 D) 0
A) 2 B) 3 C) 1 D) 0
99. What is the output of the following:
for x in ['A', 'B', 'C']:
print(x+'A')
A) xA
xB
xC
B) AA
BA
CA
C) A, B, C
D) AA, BA, CA
A) B) C) D)
100. What is the output of the following few lines of code:
x=0
while(x<2):
print(x)
x=x+1
A) 0
1
B) 0
1
2
C) 1, 2, 3
D) 0
A) B) C) D)
I believe these 100 questions are of help to empower your data science and data analysis professional career quest. Good luck in your career journey.
About the Author
Gospel Orok is a data specialist and an AI enthusiast with working experience in the e-commerce and engineering industry. He is also an advocate for specialised skills in data analysis, data science and data engineering for delivering high impact solutions. He is currently one of the Community leads of Data Scientists Network (DSN). He is open for collaborations with organisations and individuals to build an AI ecosystem that develops high human capacity impact.
Feel free to join the DSN community and also get in touch with me on Linkedin and Twitter: Orok Gospel