Validating multi-photon quantum interference with finite data

Multi-particle interference is a key resource for quantum information processing, as exemplified by Boson Sampling. Hence, given its fragile nature, an essential desideratum is a solid and reliable framework for its validation. However, while several protocols have been introduced to this end, the approach is still fragmented and fails to build a big picture for future developments. In this work, we propose an operational approach to validation that encompasses and strengthens the state of the art for these protocols. To this end, we consider the Bayesian hypothesis testing and the statistical benchmark as most favorable protocols for small- and large-scale applications, respectively. We numerically investigate their operation with finite sample size, extending previous tests to larger dimensions, and against two adversarial algorithms for classical simulation: the Mean-Field sampler and the Metropolized Independent Sampler. To evidence the actual need for refined validation techniques, we show how the assessment of numerically simulated data depends on the available sample size, as well as on the internal hyper-parameters and other practically relevant constraints. Our analyses provide general insights into the challenge of validation, and can inspire the design of algorithms with a measurable quantum advantage.

While the sampling task itself has been thoroughly analyzed in computational complexity theory, we still lack a comparable understanding when it comes to its validation. However, it is clear from a practical perspective that any computational problem designed to demonstrate quantum advantage needs to be formulated together with a set of validation protocols which account for the physical ramifications and resources required for its implementation. For instance, while small-scale examples can be validated by direct solution of the Schrödinger equation and using statistical measures such as cross-entropy [6], this is prohibitively expensive to debug a faulty Boson Sampler. Moreover, for Boson Sampling a deterministic certification is impossible [24] by the very definition of the problem [20]. Hence, it is crucial to develop debugging tools, as well as tests to exclude undesired hypotheses on the system producing the output, that are computationally affordable and experimentally feasible. Furthermore, due to random fluctuations inherent to any finite-size problem, a validation cannot be considered reliable until sufficient physical resources are spent to obtain reasonable experimental uncertainties. Ultimately, no computational problem can provide evidence of quantum advantage unless quantitative validation criteria can be stated.
In this work, we investigate the problem of validating multiphoton quantum interference in realistic scenarios with finite data. The paper is structured as follows: first, we discuss possible ambiguities in the validation of Boson Sampling, which play a crucial role in large-size experiments. Then, building upon state-of-the-art validation protocols, we address the above considerations with a more quantitative analysis. We describe a practical approach to validation that makes the most of the limited physical resources available. Specifically, we study the use of the statistical benchmark [30] and the Bayesian hypothesis testing [31] to validate n-photon interference for large and small n, respectively. We numerically investigate their operation against classical algorithms to simulate quantum interference, with a particular focus on the number of measurements. The reported analysis strengthens the need for a well-defined approach to validation, both to demonstrate quantum advantage and to assist applications that involve multi-photon states.

II. VALIDATION OF BOSON SAMPLING: FRAMEWORK
Our aim, in the context of Boson Sampling, consists in the unambiguous identification of a quantum advantage in a realistic scenario. We focus on the task of validation, or verification, whose aim is to check if measured experimental data is compatible with what can be expected from a given physical model. Validation generally requires fewer resources and is, thus, more appropriate for practical applications than full certification, which is exponentially hard in n for Boson Sampling [20,50]. In both cases, these claims must follow a well-defined protocol to distill experimental evidence that is accepted by the community under jointly agreed criteria [51] ( Fig. 1). As we discuss below and in Sec. III, we propose an application-oriented approach to validation that takes into consideration the limited physical resources, be them related to the evaluation of permanents [52] or to finite sample size [50]. In fact, without such well-defined approaches, obstacles or ambiguities may arise in large-scale experiments, as we highlight in the following. For instance, not all validation protocols are computationally efficient, which is a strong limitation for future multi-photon applications or high-rate realtime monitoring. Also, a theoretically scalable validation protocol may still be experimentally impractical due to large instrumental overheads or large prefactors that enter the scaling law.
Given two validation protocols V 1 and V 2 to rule out the same physical hypothesis or model, which conclusion can be drawn if they agree for a data set of given size and unexpectedly disagree when we add more data? In principle we can accept or reject a data set when we reach a certain level of confidence, but which action is to be taken if this threshold is not reached after a large number of measurement events (which hereafter we refer to as the "sample size")? Shall we proceed until we pass that level, shall we reject it or shall we make a guess on the available data? Finally, what if the classical algorithm becomes more effective in simulating Boson Sampling for larger data sets, as for Markov chains [47], or for longer processing times, as for adversarial machine learning algorithms [54] that could exploit specific vulnerabilities of validation protocols?
However artificial some of the above questions may seem, such skeptical approach was indeed already adopted [25] and addressed [26][27][28][29][30][35][36][37] with the Mean-Field sampler (see Sec. V A): all these considerations are necessary to strengthen the claim of quantum advantage. Under the above premise, we therefore identify the following crucial features to be assessed in any decision on acceptance or rejection: 1. Sample size S. The strength of a validation protocol is affected by the limited number S of collected events, as compared to the total number of distinct n-photon output events. While this limitation is not relevant for smallscale implementations, due to (i) the then low dimension of Hilbert space, (ii) a high level of control and (iii) reduced losses, it represents one of the main bottlenecks for the actually targeted large-scale instances [55]. It is thus desirable to assess the robustness and the resilience of a pro- tocol under such incomplete-sampling effects, to quantify the impact of always strictly finite experimental resources on the actual applicability range of the protocol. We therefore propose to define a (minimal) threshold sample size S which must be available for validation. Given a set of S events, a validation protocol must be capable to give a reliable answer within a certain confidence level.
2. Available sampling time T . While the sampling rate is nearly constant for current quantum and classical approaches [48], de facto making the time T not relevant, it cannot be excluded that future algorithms may process data and output all events at once. The very quality of the simulation, i.e. the similarity to quantum Boson Sampling in a given metric, could also improve with processing time [47,54]. Ultimately, T must be treated as an independent parameter with respect to S, while at the same time it should be adapted to the sample size required for a reliable validation.
3. Unitary U. Unitary evolutions should be drawn Haarrandomly by a third agent, at the start of the competition to avoid any preprocessing. This agent, the validator (V), uses specific validation protocols to decide whether a sample is compatible with quantum operation.
In the thus defined setting, a data set is said validated according to the following rule ( Fig. 1a): Boson Sampling is validated if, collecting S events in time T from some random unitary U, it is accepted by all selected validators V.
Given a unitary and a set of validation protocols, we are then left with the choice of S and T , which need be plausible for technological standards. Demanding to sample S events in time T , these thresholds in fact limit the size of the problem (n,m) for an experimental implementation. As for the time T , one possibility, feasible for quantum experiments, could be for instance one hour. Within this time, a quantum device will probably output events at a nearly constant rate, while a classical computer can output them at any rate allowed by its clock cycle time. The choice of the sample size S is instead more intricate, since a value too high collides with the limited T , while a value too low implies an unreliable validation V. With these or further considerations [56], classical and quantum samplers should agree upon a combination of (n, m, S, T ) that allows them to validate their operation.

III. VALIDATION WITH FINITE SAMPLE SIZE
In this section, we investigate a convenient approach to validation that distinguishes between two regimes: until n ∼ 30 (Sec. III A) and from n ∼ 30 (Sec. III B). In each section, we will first summarize the main ideas behind their operation. Then, we will discuss their performance for various (n, m), highlighting strengths and limitations, by numerically simulating experiments with finite sample size and distinguishable or indistinguishable photons.

A. Bayesian tests for small-scale experiments
The Bayesian approach to Boson Sampling validation (V B ), introduced in Ref. [31], aims to identify the most likely between two alternative hypotheses, which model the multiphoton states under consideration. In particular, V B tests the Boson Sampling hypothesis (H Q ), which assumes fully indistinguishable n-photon states, against an alternative hypothesis (H A ) for the source that produces the measurement outcomes {x}. Equal probabilities are assigned to the two hypotheses prior to the experiment. Let us denote with p Q (x k ) (p A (x k )) the scattering probability associated with the output state x k for H Q (H A ). The intuition is that, if H Q is most suitable to model the experiment, it is more likely to collect events for which p Q (x k ) > p A (x k ). The idea is made quantitative considering the confidence P ({x}|H hypo ) = S k=1 p hypo (x k ) we assign to each hypothesis. By applying Bayes' theorem, after S events we have and our confidence in the hypothesis H Q becomes P (H Q |{x}) = χ S 1+χ S . This test requires the evaluation of permanents of n × n scattering matrices for p Q (x k ) [49,52], which sets an upper limit to the number of photons that can be studied in practical applications [40][41][42][43][44][45][46][47]. Indeed, it is foreseeable that real-time monitoring or feedback-loop stabilization of quantum optics experiments will only have access to portable platforms with limited computational power. However, an interesting advantage of this validation protocol is its broad versatility, due to the absence of assumptions on the alternative distributions. Importantly, when applied to validate Boson Sampling with distinguishable photons, it requires very few measurements (S ∼ 20) for a reliable assessment. In Fig. 2, for instance, we Confidence P (HQ|{x}) of the Bayesian test to accept, as a correct Boson Sampling experiment, events that are sampled using distinguishable (C, green) [20] and indistinguishable (Q, red) [48] n-photon states from m-mode interferometers. Note how curves become steeper for increasing n (n = 3, 6, 9, 12 and m = n 2 ), making the test progressively more sample-efficient. Inset: Bayesian protocol applied to test Q against the Mean-Field sampler (MF, orange) [25] for n = 4 photons and m = n 2 . Curves are obtained by numerically sampling 10 4 n-photon events, averaging over 50 random reshuffling of these events and over 100 different Haar-random unitary transformations (shaded regions: one standard deviation). numerically investigate its application as a function of sample size, extending previous simulations from n = 3 [31] to n = (3, 6, 9, 12) and m = n 2 . Data for distinguishable (H C ) and indistinguishable (H Q ) photons were generated using exact algorithms, respectively by Aaronson and Arkhipov [20] and by Clifford and Clifford [48]. The analysis shows how the validation protocol becomes even more effective for increasing n, being it able to output a reliable verdict after only ∼ 20 events. However, as mentioned, its power comes at the cost of being computationally inefficient in n. Also, it is not possible to preprocess V B and store information for successive re-use, since its confidence depends on the specific U and sampled events, according to p Q (x k ). Hence, in the regime n ∼ 25 − 35 [46,47] it becomes rapidly harder to perform a validation in real time. Eventually, since classical supercomputers cannot assist quantum experiments in everyday applications, V B becomes prohibitive from n ∼ 35.

B. Statistical benchmark for large-scale experiments
In the previous section we described how the Bayesian test is effective in validating small-and mid-scale experiments with very few measurement events. However, the evaluation of permanents hinders its application for large n, be it due to too large scattering matrices or to the need for speed in realtime evaluations. To overcome this limitation, further validation protocols have been proposed in the last few years, to find a convenient compromise between predictive power and physical resources. All these approaches have their own strengths and limitations, and tackle the problem from different angles [16], e.g. using suppression laws [24][25][26][27][28], machine learning [29,33] or statistical properties related to multi-particle interference [30]. In this section we will focus on the latter protocol, which arguably represents the most promising solution for the reasons we outline below.
Statistical benchmark with finite sample size. Validation based on the statistical benchmark (V S ) looks at statistical features of the C-dataset, the set of two-mode correlators where (i, j) are distinct output ports andn i is the bosonic number operator. Two statistical features that are effective to discriminate states with indistinguishable and distinguishable photons are its normalized mean NM (the mean divided by n/m 2 ) and its coefficient of variation CV (the standard deviation divided by the mean). For any unitary transformation and input state we can retrieve a point in the plane (NM, CV), where alternative models tend to cluster in separate clouds located via random matrix theory (Fig. 3a) [30]. Validation based on V S would then consist in (i) collecting a suitable number S of events, (ii) evaluating the experimental point (NM, CV) associated to the C ij and (iii) identifying the cluster that the point is assigned to. For S sufficiently large, the point will be attributable with large confidence to only one of the models, thus ruling out the others (Fig. 3b).
V S represents the state of the art for validation protocols that do not require the evaluation of permanents. Indeed, this approach has several advantages [39]: (a) it is computationally efficient (one only needs to compute two-point correlators), (b) it can reveal deviations from the expected behaviour (manifest in the NM-CV plane), (c) it makes more reliable predictions for larger n (clouds become more separate), (d) it is sample-efficient (clouds separate relatively early, after few measurements events). However, despite points (c, d) above, in actual conditions the experimental point is not always easy to validate. In fact, as mentioned in point (b), hardware imperfections and partial distinguishability make the point move away from the average route shown in Fig. 3a. These issues can be addressed and mitigated by numerically generating, for a fixed sample size S, clouds from unitary transforma- tions that take these aspects into account. As suggested in Ref. [39], and more closely investigated in Fig. 3b,c, a convenient approach is to employ machine learning to assign experimental points to one of the two clouds, with a certain confidence level. Specifically, one can train a classifier with numerically generated data [20,48] for a certain (n, m, S), that can even include error models, and then deploy it for all applications in that regime. In this sense, S can be seen as the label of the model that can classify (validate) data for a given (n, m). This intuition can be extended to a classifier that is trained on data from multiple S (see Fig. 3c), which is likely more practical. For a fixed S, the computational resources to sample events from a distribution given by n distinguishable (indistinguishable) photons scale polynomially [20] (exponentially [48]) in n. However, once trained, this classifier can be considered as an off-the-shelf tool that is readily applicable to validate multi-photon interference with no additional computational overhead, which is ideal for large-size experiments. In Sec. V B, we also discuss how such a classifier can even be combined with other protocols, which search the data for different distinctive structures, to boost its accuracy.
Finite-size effects in validation protocols. So far, we qualitatively discussed the role of a limited sample size for the validation of multi-photon quantum interference. To provide a more quantitative analysis of finite-size effects for the task of validation, and in particular for V S , in the following we study the scaling of the parameters involved in the above validation protocol with S. The goal of this section is to elaborate on a standard test which should be implemented in all validation protocols, to guarantee their experimental feasibility.
Let us start by considering a fixed unitary circuit U , for which we calculate the correlators C ij from Eq. (2). Such evaluation in principle assumes the possibility to collect an arbitrary number of measurement events. In practical applications, however, sample sizes will always be limited. Hence, finite-size effects play a role in the estimation of the above correlators. According to the central limit theorem, the correlator retrieved from the experimental data can be represented as C ij = C ij + X ij , where X ij is a random number normally distributed with zero mean and variance σ 2 ij S −1 . The σ 2 ij depend on the unitary evolution U and should either be evaluated from the data or be estimated using random matrix theory. Now, to infer, from noisy C-datasets [30], the centre of the cloud of points in the NM-CV plane, we need to average not only over the Haar measure, but also over X ij .
Consequently, we have to assess the impact of finite-size effects on the estimate of the moments (NM, CV). First, since the noise induced by the finite sample size averages out, namely E X ( C ij ) = C ij , we have that N M = N M . The estimation of CV is a bit more subtle because we need to evaluate the mean of and, hence, ij )] cannot be easily compared, since the latter involves averaging the distribution of X ij over the unitary group. However, using the properties of the normal distribution under convex combinations, we can deduce that both orders of averaging yield approximately the same result (and the same scaling in S), in particular once S is large and the distribution is concentrated close to its mean. Numerical simulations for 3 ≤ n ≤ 15 and m = n 2 indeed confirm its validity (Fig. 4). Specifically, we observe that, upon averaging over different Haar-random unitaries with S events per realization, the deviation of the experimentally-measured C 2 ij from the analytically predicted values decreases as fast as 1/S. Hence, their estimation from finite-size data sets shows no exponential overhead that would hinder a practical application of the validation protocol.

IV. DISCUSSION
Validation of multi-photon quantum interference is expected to play an increasing role as the dimensionality of photonic applications increases, both in the number of photons and modes. To this end, and as notably emphasized by the race towards quantum advantage via Boson Sampling, it is necessary to define a set of requirements for a validation protocol to be meaningful. Ultimately, these requirements should allow to establish strong experimental evidence of quantum advantage that is accepted by the community within a jointly agreed framework.
In the present work, we implement such a program and describe a set of critical points that experimenters will need to agree upon in order to validate the operation of a quantum device. With the goal of building a solid framework for validation, we then discuss a practical approach that applies the most suitable state-of-the-art protocols in realistic scenarios. We report numerical analyses on the application of two key validation protocols, the Bayesian hypothesis testing and the statistical benchmark, with finite-size data, providing compelling evidence in support of this approach. A clear and illustrative example for the above considerations is provided in Section V A, where we numerically studied the competition between a recent classical simulation algorithm and the statistical benchmark, respectively to counterfeit and to validate Boson Sampling, while they process an increasing number of measured output events. The analysis quantifies the general intuition that there must be a trade-off between speed and quality in approximate simulations of Boson Sampling. We also provide a formal analysis on the performance of the validation protocol with finite-size samples, showing that the estimation of relevant quantities converges fast to the predicted values. We expect that similar features will be crucial for larger-scale demonstrations and, as such, a key prerequisite to be investigated in all validation protocols.
Finally, in Section V B we introduce a novel approach to validation that can bring together the strengths of multiple protocols. This approach uses a meta-algorithm (AdaBoost) to combine protocols based on machine learning into a single validator with boosted accuracy. This strategy becomes more advantageous for a larger number of such protocols with comparable performance, as well as with very noisy data. To shed some light on the critical aspects of validation, and as a benchmark of the state of the art in this context, we now provide a qualitative analysis inspired by the Metropolized independent sampling (M), a recent algorithm to classically simulate Boson Sampling [47]. The idea behind M is reminiscent of the Mean-Field sampler (MF) [25], an adversarial classical algorithm that was capable to hack one of the first validation protocols [32] using limited classical resources. In the race towards quantum computational supremacy, the introduction of MF has prompted the development of more sophisticated techniques to tackle classical simulations. For instance, besides the Bayesian test (see inset in Fig. 2), also the statistical benchmark is highly effective to validate Boson Sampling against MF (see Fig. 5a). For our scope, the key difference between the two algorithms is that, while for MF the quality of the simulation does not really change over time, M samples from a distribution that gets closer to Q the more events are evaluated (i.e. for a larger S).
The goal of M is to generate a sequence of n-photon events {e i } from a Markov chain that mimics the statistics of an ideal Boson Sampling experiment. Given a sampled event e i , a new candidate event e i+1 is efficiently picked according to the probability distribution of distinguishable photons p D , and accepted with probability where p I (e i ) is the output probability corresponding to event e i for indistinguishable photons. While the approach remains computationally hard, since it requires the evaluation of permanents [52,57], the advantage is that only a limited number of them needs to be evaluated to output a new event, rather than the full distribution as in a brute-force approach. Ultimately, after a certain number of steps in the chain, M is guaranteed to sample close to the ideal Boson Sampling distribution p I [58]. Hence, not only does the sample size S play a key role to improve the reliability of validation protocols, as shown in Sec. III, but it can be crucial also to increase the quality of the outcome of a classical simulation. This is a relevant point to keep in mind, even though M has since been surpassed by an algorithm that is both provably faster and exact [48]. In fact, in future, novel classical algorithms might be developed [53] that depend on S more efficiently. The aim of our present analysis is to investigate the role of the sample size in a validation of the samples generated by M, via V S . Indeed, a crucial issue in a hypothetical competition between M and V S concerns the number of events S available to accept or reject a data set. While larger sets provide deeper information to V S to identify fingerprints of quantum interference, on the other hand M approaches the target distribution p I as more steps are made along the chain. However, in order to output a large number of events in time T , M requires physical and computational resources that set  [25] in an m = 16-mode Haar-random transformation, for S = 10 4 events. Pyramids identify the random matrix prediction for S → ∞ [30]. b) Boson Sampling with indistinguishable or distinguishable photons and Metropolized independent sampling (MIS), with collision events (M) or without (M MIS CF , extracted from data in Ref. [47]; MCF, data subset of M) for n=20 photons, m=400 modes and up to S = 2×10 4 events. Curves without collision events (which can be resolved under stronger zoom) have a smoother evolution due to reduced fluctuations in the C-dataset. Note that the statistical benchmark captures the presence of collision events (in Q, C, M), which have an impact on the statistics since the protocol probes two-particle processes. a limit to the tractable dimension of the problem. We are then interested in the intermediate regime, the one relevant for experiments, to determine whether convergence is reached fast enough to mislead V S . In the specific case of M, we then need to look at the scaling in n of its hyper-parameters: burn-in (the number B n of events to be discarded at the beginning of the chain) and thinning (the number T n of steps to skip to reduce correlations between successive events). Eventually, the time required to classically simulate Boson Sam- pling will scale as T = τ p (B n + S T n ), where τ p is the time to evaluate a single scattering amplitude according to Eq. (4). Considering the estimate provided by the supercomputer Tianhe-2 [49], and for fixed (T , S), we find the constraint B n = α n −2 2 −n T − S T n where α ∼ c 0.8782 10 11 and c is the number of processing nodes. If we assume T n = 100 [47] for all n and V, we get an estimate of the maximum B n allowed by (T , S). The key issue is that this estimate does not guarantee that M achieves the target distribution fast enough, since B n decreases (exponentially) in n. Moreover, the minimum B n is expected to increase with n, since on average the Markov chain needs to explore more states before picking a good one.
To better clarify the above considerations, we simulate a competition between M and V S for n = 10 photons in m = 100 modes on Fig. 6. Data for distinguishable and indistinguishable photons were generated with exact algorithms, respectively by Aaronson and Arkhipov [20] and by Clifford and Clifford [48]. The analysis proceeds through five main steps: 1) randomly pick a unitary transformation U according to the Haar measure; 2) simulate the generation of S n-particle output events; 3) extract the C-dataset from these S events; 4) evaluate the corresponding (NM, CV) point and plot it in Fig.  6a; 5) repeat steps 1-4 200 times, to simulate as many different experiments. Upon completion, evaluate average and variance of P M and plot them in Fig. 6b. With this analysis, we get a quantitative intuition on how the confidence of a validation changes with S, as does the quality of the classical simulation. Similar behaviour is found also for other choices of n and m. In particular, we observe how a stronger thinning (up to T 10 = 100, as in Ref. [47]) is reflected in the quality of the simulation, where M behaves very similar to the ideal Boson Sampler for small as well as for large sample sizes. Conversely, a faster M that trades quality for speed by computing fewer permanents (T 10 = 10, 30) is more easily detectable by V S . Constraints due to a speed vs. quality compromise (Fig. 3b,c,d) define a generic scenario for a classical simulation which is run with a specific choice of T and S.

B. Combining and boosting validation protocols
So far, all validation protocols have always been applied separately and independently. Certainly, this fact shows the multifaceted nature of this line of research, where effective solutions have been developed using very different strategies. Yet, it also reflects its somewhat fragmented condition, since each protocol does not benefit from potential insights provided by the others. This limitation becomes relevant in realistic scenarios with noise and finite data sets, since each validation protocol suits some task better than the others, with different degrees of sample efficiency and resilience.
In this section, we present a novel, synergistic approach to validation, which aims at combining the strengths of these protocols to form a joint, enhanced validator. Specifically, we focus on validation protocols that make use of machine learning, and propose to combine them with a meta-algorithm (AdaBoost [59]) that attempts an adaptive boosting of their individual performance. The output of AdaBoost is a weighted sum of the predictions of these learning algorithms ('weak learners'), which are asked, sequentially, to pay more attention to the instances that were incorrectly classified by the previous learners. As long as the performance of each learner is slightly better than chance, the classifier resulting from Ad- Haar-random unitary transformations (a) are simulated using exact algorithms to numerically sample S events from quantum [48] (Q) and classical [20] (C) Boson Sampling (b). c) N T rain U sets of S events for both Q and C are pre-processed by a collection of validation protocols. Here, we considered the statistical benchmark VS [30] and the visual assessment VC [29], which produced N T rain U input data in the form of, respectively, pairs of moments (NM, CV) and images (using t-SNE [60] for dimensionality reduction). d) Each protocol has its own classifier, in this case a neural network (NN) for VS and a convolutional neural network (CNN) for VV , respectively. These classifiers are then applied sequentially on the same input data, iteratively adjusting their weights (AdaBoost) to focus on misclassifed data. e) The resulting, joint protocol has higher accuracy on test data than each individual classifier.
aBoost provably converges to a better validation protocol.
We numerically test this approach by combining two validation protocols that employ machine learning: the statistical benchmark V S [30] (equipped with a simple neural network classifier trained on numerically generated data, as in Fig. 3b,c) and the visual assessment V V [29], which uses dimensionality reduction algorithms and convolutional neural networks. Here we do not consider the Bayesian approach, since, in its current formulation, it does not fit the framework of machine learning. A schematic description of our proof-of-concept analysis, which we carry out for n = 10 and m = 100, is shown in Fig. 7. Since V S requires fewer events than V V to validate ideal, noiseless experiments [20,48], to perform this test we trained V S on data sets with a tunable amount noise, purposely assembled to be hard to validate. To this end, samples (S = 2 × 10 3 ) for 500 Haar-random unitary transformations were constructed by sampling with a certain probability p (or 1 − p) from a Boson Sampler with fully indistinguishable (or distinguishable) photons. This probability p was then varied in time, to simulate, for instance, a periodic drift in the synchronization of the input photons. As expected with these settings, we find that AdaBoost maintains the original accuracy of V S and V V when applied to, respectively, batches of V S and V V that are already highly accurate. This is mainly due to complexity of these classifiers, which are already strong learners and, hence, hard to enhance by AdaBoost. Analogous results are found with mixed batches of V S and V V , for which AdaBoost returns a joint classifier that practically focuses on the most accurate one in the set. A different result is obtained, instead, by combining several weak V V , for which we purposely spoil the training of the convolutional neural network (accuracy A ∼ 51% instead of A ∼ 98%) by reducing the number of training epochs. In this case, AdaBoost does in fact enhance the accuracy of V V up to A ∼ 57%.
In future, we expect that this approach will prove useful in non-ideal conditions with experimental noise, where validation protocols do not operate in the ideal settings where they were conceived. Furthermore, the above analyses can show larger boosts if applied to actual experiments that involve structured (non-Haar-random) interferometers, for which protocols such as V S and V V can have lower accuracies and different behaviors. Finally, still in non-ideal settings, more favorable boosts can be obtained if new validation protocols are developed that are as sample-efficient as V S .