DSA821S - DATA SCIENCE AND ANALYTICS - 2ND OPP - JAN 2025 :: NUST past examination papers between 2021 and 2025

Expand document

Collapse document

DSA821S - DATA SCIENCE AND ANALYTICS - 2ND OPP - JAN 2025

1 Page 1

2 Page 2

3 Page 3

4 Page 4

5 Page 5

DSA821S - DATA SCIENCE AND ANALYTICS - 2ND OPP - JAN 2025

1 Page 1

▲back to top

nAmI BI AunIVER s ITY

OF SCIEnCE Ano TECHnOLOGY

FACULTY OF COMPUTING AND INFORMATICS

DEPARTMENT OF INFORMATICS

QUALIFICATION: Bachelor of Informatics Honours

QUALIFICATION CODE: 08BIHW, 08BIFB

COURSE:Data Science and Analytics

DATE: 2024/2025

DURATION: 2 Hours

COURSELEVEL: NQF LEVEL8

COURSECODE: DSA821S

SESSION: 1

MARKS: 60

EXAMINER(S):

SUPPLEMENTARY/SECOND OPPORTUNITY

EXAMINATION QUESTION PAPER

MR. SEBASTIAN MUKUMBIRA

MODERATOR (S):

MS. EMILIA SHIKEENGA

THIS EXAMINATION PAPERCONSISTSOF 5 PAGES

(INCLUDING THIS FRONT PAGE)

Instructions for the students

1. There are four Sections in this paper Section 1, Section 2, Section 3 and Section 4.

2. Answer ALL the questions in ALL Sections.

3. Write clearly and neatly.

4. Number the answers clearly.

5. Non-programmable calculators may be used.

2 Page 2

▲back to top

Question 1: Exploratory Data Analysis (EDA) [15 marks]

A dataset contains infom1ation on the sales of a retail store for a year, including columns for Date,

Product, Quantity Sold, and Revenue. The first few rows of the dataset are as follows:

Date

2023-01-01

2023-01-0 I

2023-01-02

2023-01-03

Product

Quantity Sold

Revenue

600

400

500

300

700

(a) Calculate the total revenue generated from each product for the first three days of sales. (3 marks)

(b) Determine the average quantity sold per day for each product. (4 marks)

(d) Discuss two potential insights you can draw from the EDA of this dataset. (2 marks)

(e) Propose one data cleaning step that could improve the quality of this dataset. (2 marks)

Question 2: Hypothesis Testing [15 marks]

A retail company wants to know if there is a significant association between the type of advertisement

used and the purchase decision made by customers. The data collected is summarised in the following

contingency table:

Advertisement Type

Online

Purchase (Yes)

Purchase (No)

(a) State the null and alternative hypotheses for the above mentioned market research. (2 marks)

(b) Calculate the expected frequencies for each cell in the contingency table. Show your calculations.

(5 marks)

can you conclude? (5 marks)

Use the following formula for chi-square statistic calculation:

Where O = observed frequencies, E = expected frequencies.

(d) Discuss one assumption of the Chi-square test tha~ should be considered. (2 marks)

(e) Suggest a follow-up analysis if a significant association is found. (1 mark)

3 Page 3

▲back to top

Question 3: Clustering Analysis (15 marks]

Consider the following dataset representing customers based on their annua1 income and spending score:

Customer ID

Annual Income

15,000

20,000

25,000

30,000

35,000

Spending Score

(a) Calculate the Euclidean distance between Customer I and Customer 2 based on the provided

features. (3 marks)

(b) Explain the concept of K-means clustering and its application in customer segmentation. (4

marks)

clusters? Discuss. (4 marks)

(d) Demonstrate one iteration of the K-means clustering algorithm for this dataset, assuming three

clusters. Your answer should include:

• Selection of initia1 centroids .

• Assignment of data points to clusters

• Recalculation of cluster centroids

Clearly explain your process and show all necessary calculations. (3 marks)

(e) Discuss one potential drawback of using K-means clustering. (1 mark)

Question 4: Decision Trees [15 marks]

Consider a dataset that includes features like Age, Income, and Loan Status (Approved or Rejected) for

loan applications:

Age Income Loan Status

25 30000 Approved

30 40000 Approved

35 50000 Rejected

40 60000 Approved

45 70000 Rejected

(a) Explain the concept of a decision tree and how it is used for classification tasks. (3 marks)

(b) Calculate the Gini impurity for the Loan Status variable in the dataset. Show your calculations. (5

marks)

Use the following formula for Gini impmity ca1cu1ation:

G = 1 - L;P7

i=l

where p; is the probability of class i.

one method. (3 marks)

4 Page 4

▲back to top

(d) Describe how overfitting can occur in decision trees and propose one technique to mitigate it. (3

marks)

(e) Provide an example of a real-world application where decision trees could be effectively used. (1

mark)

5 Page 5

▲back to top

Chi-Square Distribution Table

The shaded area is equal to a: for x2 = x~-

X.995

1 0.000

2 0.010

3 0.072

4 0.207

5 0.412

6 0.676

7 0.989

8 1.344

9 1.735

10 2.156

11 2.603

12 3.074

13 3.565

14 4.075

15 4.601

16 5.142

17 5.697

18 6.265

19 6.844

20 7.434

21 8.034

22 8.643

23 9.260

24 9.886

25 10.520

26 11.160

27 11.808

28 12.461

29 13.121

30 13.787

40 20.707

50 27.991

60 35.534

70 43.275

80 51.172

90 59.196

100 67.328

X~ggo

0.000

0.020

0.115

0.297

0.554

0.872

1.239

1.646

2.088

2.558

3.053

3.571

4.107

4.660

5.229

5.812

6.408

7.015

7.633

8.260

8.897

9.542

10.196

10.856

11.524

12.198

12.879

13.565

14.256

14.953

22.164

29.707

37.485

45.442

53.540

61.754

70.065

X~975

0.001

0.051

0.216

0.484

0.831

1.237

1.690

2.180

2.700

3.247

3.816

4.404

5.009

5.629

6.262

6.908

7.564

8.231

8.907

9.591

10.283

10.982

11.689

12.401

13.120

13.844

14.573

15.308

16.047

16.791

24.433

32.357

40.482

48.758

57.153

65.647

74.222

X~950

0.004

0.103

0.352

0.711

1.145

1.635

2.167

2.733

3.325

3.940

4.575

5.226

5.892

6.571

7.261

7.962

8.672

9.390

10.117

10.851

11.591

12.338

13.091

13.848

14.611

15.379

16.151

16.928

17.708

18.493

26.509

34.764

43.188

51.739

60.391

69.126

77.929

X~goo

0.016

0.211

0.584

1.064

1.610

2.204

2.833

3.490

4.168

4.865

5.578

6.304

7.042

7.790

8.547

9.312

10.085

10.865

11.651

12.443

13.240

14.041

14.848

15.659

16.473

17.292

18.114

18.939

19.768

20.599

29.051

37.689

46.459

55.329

64.278

73.291

82.358

X~100

2.706

4.605

6.251

7.779

9.236

10.645

12.017

13.362

14.684

15.987

17.275

18.549

19.812

21.064

22.307

23.542

24.769

25.989

27.204

28.412

29.615

30.813

32.007

33.196

34.382

35.563

36.741

37.916

39.087

40.256

51.805

63.167

74.397

85.527

96.578

107.565

118.498

X~oso

3.841

5.991

7.815

9.488

11.070

12.592

14.067

15.507

16.919

18.307

19.675

21.026

22.362

23.685

24.996

26.296

27.587

28.869

30.14•1

31.410

32.671

33.924

35.172

36.415

37.652

38.885

40.113

41.337

42.557

43.773

55.758

67.505

79.082

90.531

101.879

113.145

124.342

X~o2s

5.024

7.378

9.348

11.143

12.833

14.449

16.013

17.535

19.023

20.483

21.920

23.337

24.736

26.119

27.488

28.845

30.191

31.526

32.852

34.170

35.479

36.781

38.076

39.364

40.646

41.923

43.195

44.461

45.722

46.979

59.342

71.420

83.298

95.023

106.629

118.136

129.561

X 010

6.635

9.210

11.345

13.277

15.086

16.812

18.475

20.090

21.666

23.209

24.725

26.217

27.688

29.141

30.578

32.000

33.409

34.805

36.191

37.566

38.932

40.289

41.638

42.980

44.314

45.642

46.963

48.278

49.588

50.892

63.691

76.154

88.379

100.425

112.329

124.116

135.807

I X~oos

7.879

10.597

12.838

14.860

16.750

18.548

20.278

21.955

23.589

25.188

26.757

28.300

29.819

31.319

32.801

34.267

35.718

37.156

38.582

39.997

41.401

42.796

44.181

45.559

46.928

48.290

49.645

50.993

52.336

53.672

66.766

79.490

91.952

104.215

116.321

128.299

140.169