01. In a research project, Professor Smith is analyzing a large corpus of scientific articles. He wants to remove common words like “the,” “is,” and “a,” which do not contribute much to the analytic value of the text. Which text preprocessing step should Professor Smith use?
a) Tokenization
b) Stemming
c) Lemmatization
d) Removing stop words
02. One of the main differences between administrative and transactional data is ______.
a) Transactional data is event-based and tends to change more frequently.
b) Administrative data is only about finances.
c) Transactional data is generated by internal operations.
d) Administrative data is always public.
03. For an imbalanced dataset, why can accuracy be considered a misleading metric?
a) It always underestimates model performance.
b) It may simply reflect the class distribution.
c) It overcomplicates the evaluation process.
d) It is computationally too demanding to calculate.
04. Xiaojing frequently watches romantic comedies. A movie recommender system uses this information to suggest other romantic comedies to her. Which of these approaches is the system using?
a) User-user collaborative filtering
b) Item-item collaborative filtering
c) Content-based filtering
d) Hybrid filtering
05. What does it mean for two vectors to be linearly independent?
a) One vector can be written as a linear combination of the other.
b) The vectors have unlimited span and can create new vectors in any direction.
c) The vectors exist on the same line and have the same direction.
d) The dot product of the vectors is 0.
06. After building several predictive models to identify potential financial fraud, Juan needs to select the best model based on its performance. Which phase of the CRISP-DM framework is Juan most likely in?
a) Data understanding
b) Modeling
c) Evaluation
d) Deployment
07. You are provided with a 95% confidence interval for a population mean. What does the confidence level indicate?
a) The probability that the sample mean is equal to the population mean
b) The probability that the population mean lies within the interval
c) The percentage of the sample that lies within the interval
d) The range of values within which the population mean is expected to lie
08. Why is class imbalance in training data a problem for supervised machine learning algorithms?
a) It makes learning patterns that differentiate the minority class from the majority class difficult.
b) It increases the computational time that it takes the algorithm to learn the difference between the minority and majority classes.
c) It forces the model to overfit to the minority class.
d) It automatically makes the model less accurate.
09. Your logistics company relies heavily on location data. How could geocoding be utilized to enhance your operational efficiency?
a) By importing geographical coordinates from public data sources
b) By importing address data from postal route data
c) By consolidating multiple datasets into a single database
d) By converting warehouse addresses into geographical coordinates
10. Karen is using a linear regression model for her research. During her analysis, she suspects that the error terms in her model might be correlated, which could violate an important assumption. Which of these tests should Karen use to check this assumption?
a) Shapiro–Wilk test
b) Durbin–Watson test
c) Pearson correlation test
d) Chi-square test