## STAT0030 Assessment 3

统计Assessment代写 For this assesment you should submit online – on the course Moodle page using the link “ICA3: Click here to submit your assignment”.

**Instructions**

For this assesment you should submit online – on the course Moodle page using the link “ICA3: Click here to submit your assignment”. Make sure none of the fifiles contains your surname, as the marking must be anonymous. You must submit two fifiles:

- An electronic copy of your StudentNumber.rmd fifile, containing your R markdown code. For example, if your student number is 18239004, your R markdown script should be saved in the fifile 18239004.rmd.
- A single PDF fifile named StudentNumber.pdf containing the knitted output of the Rmarkdown fifile. This should correspond
**exactly**to what is produced when knitting the submitted .rmd fifile.

Any output within your pdf should be clearly presented and structured according to the question parts.

**Marking guidelines 统计Assessment代写**

The assessment is marked out of 100, with Questions 1, 2 and 3 having a total of 30, 20 and 50 marks, respectively. The marks for each question are subdivided into the following components.

1.For Question 1, items (a), (b) and (c) have a breakdown of 15, 10 and 5 marks, respectively.

2.For Question 2, items (a), (b) and (c) have a breakdown of 2, 8 and 10 marks, respectively.

3.For Question 3, items (a), (b), (c), and (d) have a breakdown of 25, 5, 5, 15 marks, respectively.

For each coding part, graphical presentation (appropriate choice of graphs and formatting), quality of printed output (appropriate messages printed) and quality of the code (your code should be clean, readable – with suffiffifficient commenting for the user – and effiffifficient) are factors to be considered.

**Questions 统计Assessment代写**

**Background. **

Clustering and dimensionality reduction are often seen as complementary tasks in unsupervised learning, but they can also be combined. One approach is the *mixture of factor **analysers¹* . The idea is to start with the usual mixture of Gaussians (as seen in Section 2.3 of Lab 8), but where each Gaussian is a probabilistic variant of PCA. What does it mean? We will start fifirst by describing the basic factor model.

The likelihood function is therefore a function of *µ*, *B *and the diagonal of *D*. The dataset is *{***X**_{1}*, . . . , ***X _{N}**

*}*, since no

**Y**

_{i}*is ever observed. The relationship between the Gaussian factor model and PCA is sometimes explained in terms of the conditional expectation E[*

**Y**

_{i}

*|*

**x**]. Under some conditions on

_{i}*D*, this conditional expectation is the same as the PCA projection². The use of factor models in statistical inference is sometimes called

*factor analysis*, with

*factor*

*analyser*being sometimes used a synonym for factor model.

**Questions. **

The following questions will use the mixture of factor analysers package EMMIXmfa, which can be installed in the usual way by install.packages(“EMMIXmfa”). All exercises will be based on the pre-loaded dataset iris. You can check its documentation using ?iris. In all that follows, we will use function mfa from EMMIXmfa. Read the documentation using ?mfa, where an example with iris is already shown. *It is your job to understand the documentation, **including some differences in notation. *Setting nkmeans = 3 and nrandom = 0 in all calls to mfa will speed things up and be enough for iris, and for this exam you MUST use this setting to make sure the submitted code is reproducible.

**1. 统计Assessment代写**

This question concerns fifitting and evaluating a mfa model.

(a) Fit a mixture of factor analysers with *M *= 2 and *K *= 3 for the fifirst four columns of iris. Don’t scale the data. Make sure that the model uses a difffferent *B *matrix for each mixture component (which is *not *the default of mfa). From the fifitted model, extract all the corresponding parameters *θ *to simulate data based on the model. You need to program the simulation explicitly, i.e., your code should be based on the sampling procedure of the previous page.

You are allowed to use functions sample and rnorm. You must provide your code in terms of a function simulate mfa that takes as arguments N, the number of samples; and mfa model, the output of a call to mfa. The function must output a *N **× **J *matrix of your samples. Apply this function to the fifitted model to simulate a sample of the same sample size as iris.

(b) Visually compare the result of this simulation against the real data to provide an informal goodness-of-fifit assessment. Plots of interest include quantile-quantile plots and scatterplots. Use your judgement and write one short paragraph (recommended maximum of 200 words) with your conclusions.

(c) Without providing any code, explain brieflfly how you would do model selection to decide on a choice of *K *and *M *(recommended maximum of 200 words).

**2. 统计Assessment代写**

Let us now use the model for clustering (under the hood, the logic is the same as in a mixture of Gaussians: for each **x_{i}**

*: the cluster assignment is decided by the value*

*z*

_{i}*that maximises*

*P*(

*Z*

_{i}*=*

*z*

_{i}

*|*

**x**)). As in the previous question, keep

_{i}*K*= 3 and

*M*= 2.

(a) Use function predict to generate cluster assignments from the output of mfa as applied to the fifirst four columns of iris. Run kmeans on the same, unscaled, columns of iris, again using 3 clusters. We want to evaluate the agreement between the two clusterings. However, given that this is unsupervised learning, the cluster *labels *for both kmeans and mfa are arbitrary. Function ari can be used to compare classififications with arbitrary labels, so use it to report the agreement between the two clusterings, and each clustering against the labels reported in column Species.

(b) Discuss the results obtained, and how you explain the difffferences. Describe one way by which you could improve the agreement between the kmeans selection and the true labels, and why it would work (recommended maximum of 200 words).

(c) Generate a new dataset iris noisy, made of the fifirst four columns of iris and ten columns of independent standard Gaussians. Use the randomForest function from the respective package seen in Lab 7 to learn and evaluate the ability of this model to predict iris$Species from iris noisy. To simplify, there is no need to fifiddle with hyperparameters or test sets: just look at the out-of-bag error evaluation. Report on the difffference between the random forest evaluation and what the ari function reports for mfa also applied to this modifified dataset. Provide a brief explanation for the difffferences obtained (recommended maximum of 200 words).

**3. 统计Assessment代写**

One interesting property of having a model (as in the mixture of factor analysers) as opposed to just the solution of a projection problem (as in PCA) is that the model has a natural way of dealing with missing data. In particular, clustering is still applicable even if some variables in **x***i *are not recorded. Furthermore, we can use the model to estimate those missing records, a procedure sometimes known as *data imputation*.

(a) Using mathematical results provided at the Technical Hints section at the end of this document, write a function mfa impute that takes two arguments: dat, a data frame which may have any possible combination of NA entries; and mfa model, the output of a call to mfa.

(b) Randomly partition iris into two datasets, iris1 with 100 data points and iris2 with the remaining 50 data points. Fit mfa to the data corresponding to the fifirst four 4columns of iris1 using *K *= 3 and *M *= 2, storing the output in a variable called mfa model. Modify iris2 so that, for each row of iris2, you choose uniformly at random exactly one of its four columns to be set to NA, calling the resulting dataset iris3. Call mfa impute with (the fifirst four columns of) iris3 and mfa model as arguments, storing the result in a variable called iris3 imputed.

**HOW TO PREPARE YOUR SUBMISSION. **

You must provide a single .rmd fifile including all of your code and answers to explanatory questions. **The fifile MUST contain ****exactly one chunk of code for each of the three questions, which implements ****and runs all the required steps in that question in a self-contained way.** Each **chunk MUST start with the line **set.seed(12345) **to make it reproducible. **

The clarity of the printed output will be taken into consideration (e.g., in Question 3(d), you may want to print messages specifying which numbers correspond to which mean squared errors). It is up to your judgement to decide how the printed output should be organised. Answers to explanatory points within each question should appear in a single block of text, after the code and output for that respective question. Explanatory points should be presented at a level that can be understood easily by somebody with a MSc in Statistics.

Code comments should clarify which parts of each question are being solved. When submitting, knit the corresponding .rmd into a .pdf, including all code and output. Both the .rmd and the .pdf will need to be submitted.

For instance, for Question 1, prepare a single chunk of code which i) fifits iris using mfa; ii) defifines function simulate mfa; iii) generates a simulated dataset; iv) provides the visual comparison between simulated and real data by your own judgement of what should be plotted. The code will be followed by a block of text with the explanatory aspects of Questions 1(b) and 1(c). For Question 1, the .pdf fifile will show all of that question’s code, followed by its output, followed by its textual explanations. This will then be followed by the respective steps for Questions 2 and 3.

**General hints 统计Assessment代写**

1.

In general, there is not a single “right” answer to each question. To obtain a good mark you should approach the questions sensibly and justify what you’re doing. Credit will be given for code that is clear and readable, while code that is inadequately commented will be penalised. You might like to use scripts cosapprox.r (Lab 1) and tablet.r (Lab 3) as models.

2.

This assessment is designed to test your ability to understand data analysis algorithms in an applied context, their limitations and ways to combine then. This will be assessed not only on your computing skills, but also on your ability to carry out a critical assessment of what these models and algorithms can accomplish, using a simple dataset as an example. To earn high marks for this question, you need to take a structured and critical approach to the analysis and to demonstrate appropriate judgement in your choice of material to present.

**3. 统计Assessment代写**

Marks will be deducted if your .pdf fifile does not correspond *exactly *to the results we obtain when we knit the .rmd. You should *not *call install.packages anywhere in your fifile, but do all necessary calls to library. Marks will be deducted otherwise.

4.

More credit will usually be given for code that is more generally applicable, rather than tailored to a particular situation or set of data. For example, if you were asked to print out the mean age of a group of people, you could do either of the following:

Calculate the mean before you write your fifinal script, and then insert a line cat(“Mean age is 25.3\n”)

(or whatever the mean happens to be) into your script.

- In your script, create an object (say xbar) that holds the mean age, and then insert the line

cat(paste(“Mean age is”,xbar,”\n”))

into your script.

The second approach is clearly more general and will earn more credit, since it will work for other similar data also.

**5. **

All graphs should be clearly and appropriately labelled (giving units of quantitative variables), titled and formatted. By ‘appropriately formatted’ we mean, for example, that axis scales should be well chosen.

6.

Your program should be **well commented**. If you have defifined functions, these should consist of a header section summarising the logical structure, followed by the main body of the script. The main body should itself contain comments.

7.

Refer to the feedback you received/will receive on in-course assessments 1 and 2.

**Technical hints 统计Assessment代写**

where *g*(**x ***| **µ, *Σ) is the multivariate normal pdf with mean *µ *and covariance Σ evaluated at **x**.

2.You are not expected to write very complicated algorithms for this assessment. Instead, we want you to be able to demonstrate your understanding of the logic and limitations of difffferent methods and the contrast between supervised and unsupervised learning.

3.You are allowed to use R libraries for this assessment. However, you need to make sure that you fully understand what each function is doing and brieflfly explain it either in your writeup or in your code. Use of *any *libraries beyond what we have used in STAT0030 (or ones mentioned in these instructions) will result in a lower mark.

³The Wikipedia page on multivariate normals is particularly good with explanations, see https://en.wikipedia.org/wiki/Multivariate_normal_distribution.

更多代写：Css代写 统计北美代考 英国PS论文代写 网课essay代写论文 Dissertation 高分代写 文书PS框架代写