QUESTION ONE
[25Marks]
a) Why is SV's the standard characteristics for the big data technologies and not 3V's?
Explain
10marks
b) Write out how to represent the binary class below using a numpy array in python
Index
1
2
3
4
5
6
7
8
9
10
Actual
Dog Not Dog Dog Not Dog Dog Dog Not Dog
dog
dog
dog
predicted Dog Dog Not Dog Not Not Dog Dog Not Not
dog
dog dog
dog dog
Smarks
c) Write short note on how to apply the following with your Data science knowledge
i)
Normalization
2marks
ii)
Di sc retization
2marks
iii) Feature selection
2marks
iv) Feature importance
2marks
v)
Standardization
2marks
QUESTION TWO
[25Marks]
a) Differentiate between a binary and multiclass in supervised learning 2marks
b) A set of 1100 pens contains 700 pens of the Parker brand, and the remaining pens are of
other brands. A binary classifier correctly id entified the 700 Parker pens and incorrectly
identified 100 non-Parker pens as Parker.
(i) How many non-Parker pens were correctly id entified?
2marks
(ii) Construct the confusion matrix of th e classifier
2marks
(iii) Calculate the following based on the confusion matrix in question lb(ii)
1) Accuracy
2marks
2) Recall
2marks
3) Precision
2marks
4) Fl-Score
2marks
5) Specificity
2marks
c) Write short note on the key components of Reinforcement learning with the support of a
diagram.
9marks
QUESTION THREE
[25Marks]
a) Write out the algorithm for implementing K-means clustering Smarks
b) List and explain five reasons why Data Quality is important in Big data technologies?
10marks
c) Given the data point in the table below, initialize the k-means clu stering algorithm with
two cluster centers cl =(2,10) and c2=(8,4) using Squared Euclidean distance. What
are the values of cl and c2 after one iteration of k-means clustering? What are the
values of cl, and c2 after the second iteration of k-means clustering?
10marks
3