Q. If a dataset has 200 points and you apply K-means clustering with K=4, how many points will be assigned to each cluster on average?
A.50
B.40
C.60
D.30
Solution
If K=4 and there are 200 points, on average, each cluster will have 200/4 = 50 points assigned to it.
Correct Answer: A — 50
Q. If the distance between two clusters in hierarchical clustering is defined as the maximum distance between points in the clusters, what linkage method is being used?
A.Single linkage
B.Complete linkage
C.Average linkage
D.Centroid linkage
Solution
The method that defines the distance between two clusters as the maximum distance between points in the clusters is called complete linkage.
Correct Answer: B — Complete linkage
Q. In a K-means clustering algorithm, if you have 5 clusters and 100 data points, how many centroids will be initialized?
A.5
B.100
C.50
D.10
Solution
In K-means clustering, the number of centroids initialized is equal to the number of clusters. Therefore, if there are 5 clusters, 5 centroids will be initialized.
Correct Answer: A — 5
Q. In hierarchical clustering, what does 'agglomerative' mean?
A.Clusters are formed by splitting larger clusters
B.Clusters are formed by merging smaller clusters
C.Clusters are formed randomly
D.Clusters are formed based on a predefined distance
Solution
Agglomerative hierarchical clustering starts with each data point as its own cluster and merges them into larger clusters based on similarity.
Correct Answer: B — Clusters are formed by merging smaller clusters
Q. In hierarchical clustering, what does agglomerative clustering do?
A.Starts with all data points as individual clusters and merges them
B.Starts with one cluster and splits it into smaller clusters
C.Randomly assigns data points to clusters
D.Uses a predefined number of clusters
Solution
Agglomerative clustering begins with each data point as its own cluster and progressively merges them based on their similarities.
Correct Answer: A — Starts with all data points as individual clusters and merges them
Q. In hierarchical clustering, what does the term 'dendrogram' refer to?
A.A type of data point
B.A tree-like diagram that shows the arrangement of clusters
C.A method of calculating distances
D.A clustering algorithm
Solution
A dendrogram is a tree-like diagram that illustrates the arrangement of clusters formed during hierarchical clustering.
Correct Answer: B — A tree-like diagram that shows the arrangement of clusters
Q. In hierarchical clustering, what does the term 'linkage' refer to?
A.The method of assigning clusters to data points
B.The distance metric used to measure similarity
C.The strategy for merging clusters
D.The number of clusters to form
Solution
Linkage in hierarchical clustering refers to the strategy used to determine the distance between clusters, which affects how clusters are merged.
Correct Answer: C — The strategy for merging clusters
Q. In hierarchical clustering, what is the difference between agglomerative and divisive methods?
A.Agglomerative starts with individual points, divisive starts with one cluster
Agglomerative clustering starts with individual points and merges them into clusters, while divisive clustering starts with one cluster and splits it into smaller clusters.
Correct Answer: C — Both A and B
Q. In hierarchical clustering, what is the result of the agglomerative approach?
A.Clusters are formed by splitting larger clusters
B.Clusters are formed by merging smaller clusters
C.Clusters are formed randomly
D.Clusters are formed based on a predefined number
Solution
The agglomerative approach in hierarchical clustering starts with individual data points and merges them into larger clusters based on similarity.
Correct Answer: B — Clusters are formed by merging smaller clusters
Q. In K-means clustering, what happens if K is set too high?
A.Clusters become too large
B.Overfitting occurs
C.Underfitting occurs
D.No effect
Solution
If K is set too high, the model may overfit the data, resulting in too many clusters that do not generalize well.
Correct Answer: B — Overfitting occurs
Q. In which scenario would hierarchical clustering be preferred over K-means?
A.When the number of clusters is known
B.When the dataset is very large
C.When a hierarchy of clusters is desired
D.When the data is strictly numerical
Solution
Hierarchical clustering is preferred when a hierarchy of clusters is desired, as it provides a tree-like structure of the data.
Correct Answer: C — When a hierarchy of clusters is desired
Q. What is a common application of clustering in real-world scenarios?
A.Spam detection in emails
B.Predicting stock prices
C.Image classification
D.Customer segmentation
Solution
Customer segmentation is a common application of clustering, where businesses group customers based on purchasing behavior or demographics.
Correct Answer: D — Customer segmentation
Q. What is the effect of outliers on K-means clustering?
A.They have no effect on the clustering results
B.They can significantly distort the cluster centroids
C.They improve the clustering accuracy
D.They help in determining the number of clusters
Solution
Outliers can significantly distort the cluster centroids in K-means clustering, leading to inaccurate clustering results.
Correct Answer: B — They can significantly distort the cluster centroids
Q. What is the main criterion for determining the optimal number of clusters in K-means?
A.Silhouette score
B.Elbow method
C.Both A and B
D.None of the above
Solution
Both the Silhouette score and the Elbow method are commonly used criteria for determining the optimal number of clusters in K-means clustering.
Correct Answer: C — Both A and B
Q. What is the main difference between K-means and hierarchical clustering?
A.K-means is a partitional method, while hierarchical is a divisive method
B.K-means requires the number of clusters to be defined, while hierarchical does not
C.K-means can only be used for numerical data, while hierarchical can handle categorical data
D.K-means is faster than hierarchical clustering for small datasets
Solution
K-means is a partitional clustering method that divides data into a fixed number of clusters, while hierarchical clustering builds a tree of clusters without needing to specify the number of clusters in advance.
Correct Answer: B — K-means requires the number of clusters to be defined, while hierarchical does not
Q. What is the primary goal of the K-means clustering algorithm?
A.Minimize the distance between points in the same cluster
B.Maximize the distance between different clusters
C.Both A and B
D.None of the above
Solution
The primary goal of K-means clustering is to minimize the distance between points in the same cluster while maximizing the distance between different clusters.
Correct Answer: C — Both A and B
Q. What is the purpose of the elbow method in K-means clustering?
A.To determine the optimal number of clusters
B.To visualize the clusters formed
C.To assess the performance of the algorithm
D.To preprocess the data before clustering
Solution
The elbow method is used to determine the optimal number of clusters by plotting the explained variance as a function of the number of clusters and identifying the 'elbow' point.
Correct Answer: A — To determine the optimal number of clusters
Q. What type of data is K-means clustering best suited for?
A.Categorical data
B.Numerical data
C.Text data
D.Time series data
Solution
K-means clustering is best suited for numerical data, as it relies on calculating distances between data points.
Correct Answer: B — Numerical data
Q. Which clustering method is more suitable for discovering nested clusters?
A.K-means clustering
B.Hierarchical clustering
C.DBSCAN
D.Gaussian Mixture Models
Solution
Hierarchical clustering is more suitable for discovering nested clusters, as it creates a tree structure that can reveal relationships at various levels of granularity.
Correct Answer: B — Hierarchical clustering
Q. Which clustering method is more suitable for discovering non-globular shapes in data?
A.K-means clustering
B.Hierarchical clustering
C.DBSCAN
D.Gaussian Mixture Models
Solution
DBSCAN is particularly effective for discovering clusters of varying shapes and sizes, making it suitable for non-globular data distributions.
Correct Answer: C — DBSCAN
Q. Which evaluation metric is commonly used to assess the quality of clustering results?
A.Accuracy
B.Silhouette score
C.F1 score
D.Mean squared error
Solution
The Silhouette score is a popular metric for evaluating clustering quality, measuring how similar an object is to its own cluster compared to other clusters.
Correct Answer: B — Silhouette score
Q. Which evaluation metric is commonly used to assess the quality of clustering?
A.Accuracy
B.Silhouette score
C.F1 score
D.Mean squared error
Solution
The Silhouette score is a popular metric for evaluating clustering quality, measuring how similar an object is to its own cluster compared to other clusters.
Correct Answer: B — Silhouette score
Q. Which of the following is a characteristic of K-means clustering?
A.It can produce overlapping clusters
B.It is deterministic and produces the same result every time
C.It can handle noise and outliers effectively
D.It partitions data into non-overlapping clusters
Solution
K-means clustering partitions data into non-overlapping clusters, assigning each data point to the nearest centroid.
Correct Answer: D — It partitions data into non-overlapping clusters
Q. Which of the following is a disadvantage of K-means clustering?
A.It is sensitive to outliers
B.It requires the number of clusters to be specified in advance
C.It can converge to local minima
D.All of the above
Solution
All of the listed options are disadvantages of K-means clustering, making it sensitive to outliers, requiring prior knowledge of the number of clusters, and potentially converging to local minima.
Correct Answer: D — All of the above
Q. Which of the following is a disadvantage of the K-means algorithm?
A.It can handle large datasets efficiently
B.It requires the number of clusters to be specified in advance
C.It is sensitive to outliers
D.It can be used for both supervised and unsupervised learning
Solution
A key disadvantage of K-means is that it requires the user to specify the number of clusters beforehand, which may not always be known.
Correct Answer: B — It requires the number of clusters to be specified in advance
Q. Which of the following is a limitation of the K-means algorithm?
A.It can handle non-spherical clusters
B.It requires the number of clusters to be specified in advance
C.It is computationally efficient for large datasets
D.It can be used for both supervised and unsupervised learning
Solution
A key limitation of K-means is that it requires the number of clusters to be specified beforehand, which can be challenging in practice.
Correct Answer: B — It requires the number of clusters to be specified in advance
Q. Which of the following is NOT a common distance metric used in clustering?
A.Euclidean distance
B.Manhattan distance
C.Cosine similarity
D.Logistic distance
Solution
Logistic distance is not a standard distance metric used in clustering; common metrics include Euclidean, Manhattan, and Cosine similarity.
Correct Answer: D — Logistic distance
Q. Which of the following is NOT a method of linkage in hierarchical clustering?
A.Single linkage
B.Complete linkage
C.Average linkage
D.Random linkage
Solution
Random linkage is not a recognized method of linkage in hierarchical clustering; the common methods include single, complete, and average linkage.
Correct Answer: D — Random linkage
Q. Which of the following is NOT a step in the K-means clustering algorithm?
A.Assigning data points to the nearest centroid
B.Updating the centroid positions
C.Calculating the silhouette score
D.Choosing the initial centroids
Solution
Calculating the silhouette score is not a step in the K-means algorithm; it is an evaluation metric used after clustering.
Correct Answer: C — Calculating the silhouette score
Q. Which of the following methods can be used to evaluate the quality of clusters formed by K-means?
A.Silhouette score
B.Davies-Bouldin index
C.Both A and B
D.None of the above
Solution
Both the Silhouette score and the Davies-Bouldin index are methods used to evaluate the quality of clusters formed by K-means.