Matrix Product State (MPS) Circuit Simulation in Rust

posted on 2 March, 2025 by Dan Vonk in , , ,

Bells and whistles: unless you have one of these in your cupboard, you might have to settle for classical simulation.
IBM Quantum Computer

I was recently playing around with a research artefact from the paper Grafeyn which implemented some interesting techniques for circuit simulation. Also, it was written in Rust, so that doubly piqued my interest! Here’s what the paper has to say about itself…

In this paper, we present a hybrid Schrödinger-Feynman technique which takes advantage of sparsity by selectively synchronizing Feynman paths. Our hybrid technique partitions the circuit into kernels (groups of gates) and uses Feynman simulation within each kernel. It then synchronizes across the kernels by using Schrödinger-style simulation.

Background

So there are two main ways to simulate quantum circuits on classical machines: Schrödinger-style and Feynman-style simulation. Both of these have trade-offs in terms of time and space requirements. Schrödinger simulation is the simpler scheme to understand, where the unitary transformation for each gate in the circuit is directly applied to the full state vector. Note that in practice, simulators store the amplitudes for each qubit separately, so a gate can be directly applied to the intended qubit without formulating a global state over all qubits. Nevertheless, due to needing to store every qubit’s amplitude, the space complexity is 𝒪(2n) for any number of gates, whilst the time complexity is 𝒪(m ⋅ 2N) for m gates.

The other approach to quantum circuit simulation is Feynman-style simulation, which is based on the path integral formulation of quantum mechanics, where a path is a sequence of computations that the system might traverse during the gate computation. Once all possible paths have been enumerated, they are summed together and this represents the total amplitudes of the measurement until the end of the chosen gate. The advantage of this method is that it does not require storing the entire (large) state vector. Memory is only needed to track the intermediate values along each path, which can be done in 𝒪(n) time for a single path of n qubits. However, by contrast, the time complexity is much higher, as it must loop over all 2n possible paths.

Therefore, neither of these methods dominate the other, which leads to the idea that they could be used in conjunction. The paper Grafeyn does just this with the additional ealisation that Feynman simulation becomes inefficient when the computation branches significantly and then returns to a single point (“combinatorial explosion”), whereas Schrödinger simulation is more efficient in these cases as only the state vector must be stored. However, many parts of circuits have no branching at all (such as CX or CNOT gates). Therefore it is possible to group these non-interfering sub-circuits into “kernels”, which can be simulated using the Feynman technique, whilst leaving the interfering parts to be simulated using the Schrödinger technique.

Matrix Product States

An interesting middle-ground between these two methods are Tensor Networks, which are a collection of methods for representing wave functions in a more compact manner. The simplest method is the matrix product state (MPS), which is a one-dimensional chain of tensors connected to each other by a bond dimension. I thought it would be interesting to compare MPS with the hybrid method from the paper, so I implemented it alongside the existing code in Rust.

Back to MPS: instead of storing the c1, …, cn weights (amplitudes for each qubit) explicitly, it decomposes them into a product of N tensors

c1, …, cn = Ai1Ai2AiN

Each of these corresponds to one of the qubits (called a site) and is a rank-3 tensor A[k](αk − 1, ik, αk). The index αk − 1 is the left-bond index and represents the entanglement between the current site and all other sites to the left. Meanwhile ik ∈ {0, 1} represents the physical dimension and corresponds to the state of 0 and 1 at this site. Finally, αk is the right bond index and represents the same information as the subsequent site’s left bond index.

In order to model the MPS form in code, a numerics library is needed so that tensors can be represented and a higher-order singular-value decomposition (SVD) can be applied on them to bring the tensors back into MPS form once a gate has been applied. As Grafeyn is written in Rust, this choice was more complicated than initially expected. The most common crate in the Rust ecosystem for representing tensors is ndarray. However, this crate does not contain any linear algebra operations such as the SVD. Therefore, a companion crate ndarray-linalg must be used to provide these features. This sounds acceptable in theory, but some problems were encountered while trying to actually use these crates! Firstly, ndarray-linalg is not compatible with the latest version of ndarray (quite strange I must say), so an older version must be used, which means important library features become missing. For example, ndarray stores its tensors in row-major format, but when one wants to use the SVD on this tensor, the ndarray-linalg crate will call its bindings into the OpenBLAS C library, which uses column-major (“Fortran”) format. However, no automatic conversion between formats will occur, so one must do this manually. Unfortunately, this older version of ndarray does not support the into_col_major functions, so this also needs to be done by hand by transposing the tensors. This is also not completely trivial as the tensor objects in this library are kept as “views” into the underlying data and so it is difficult to know when the actual operations will be executed. This was all pretty dissapointing for a Rust fan as I know this would have all been trivial in C++ with a library like Eigen.

Due to all of these difficulties, I decided to avoid using any tensor libraries and decompose the rank-3 tensors into collections of matrices, which allows the use of the nalgebra crate, which supports SVD natively and does not require any conversions from row-major format:

pub struct MPSState {
    // One component in the pair for Left |0> and Right |1> respectively.
    pub tensors: Vec<(DMatrix<Complex>, DMatrix<Complex>)>,
    // Bond dimensions between sites: (dL, dR)
    pub bond_dims: Vec<(usize, usize)>,
    // Number of sites (qubits)
    pub n_sites: usize,
}

Given the MPSState representation, it is now possible to apply gates to the matrix product state. The simplest gates to apply are ones which operate only on one qubit, such as the T, H or Sdg gates. Here one simply needs to identify the site which the gate operates on and apply the gate’s unitary matrix to the physical dimensions of the tensors:

let (tensor_0, tensor_1) = &mut self.tensors[site];
let mat = &gate.mat;

// Apply the gate to the |0> and |1> components of the site
let new_tensor_0 = tensor_0.clone() * mat[(0, 0)]
                 + tensor_1.clone() * mat[(0, 1)];
let new_tensor_1 = tensor_0.clone() * mat[(1, 0)]
                 + tensor_1.clone() * mat[(1, 1)];

*tensor_0 = new_tensor_0;
*tensor_1 = new_tensor_1;

However, applying a two-qubit gate is significantly more complex and is a bit too long to be shown as a code snippet. The first case to consider is when a gate is applied to two adjacent qubits and goes something like this:

  1. Input: Site indices l and r, gate g. Check l < r else swap indices.
  2. Retrieve tensors pairs and from MPS state.
  3. Build joint A tensor of shape (dL ⋅ dR, 4) by iterating over the tensor and copying values from the site tensors:
    1. Apply gate: A = A ⋅ g.
    2. Reshape A to (dL ⋅ 2, dR ⋅ 2) by manual iteration otherwise the site will become permanently fused after SVD application.
    3. Perform SVD: A = UΣVT and truncate the number of singular values Σ by a maximum threshold defined in the configuration ().
    4. Scale the columns of U and the rows of VT by the singluar values (balanced splitting of the entanglement information).
    5. U now has shape (dL ⋅ 2, |Σ|), now split into two (dL, |Σ|) for each physical dimension and store back in the correct site in .
    6. VT has shape (|Σ|,dR ⋅ 2) so likewise split it back into a slice for each physical dimension and store in the correct site.

This process only works for applying gates to adjacent qubits. If the sites are not adjacent, they must be swapped until they become adjacent. This is done by inserting Swap gates into the gate stream. Once the gate has been successfully applied, the sites are simply swapped in the reverse direction until their original sites are reached again.

Grafeyn also supports gates that affect three sites, such as CSwap or CCX. Although there are advanced methods that are able to apply these gates directly to the MPS state, I took the easier path and used algebraic rewriting patterns to decompose them into two-qubit gates, which greatly simplified the process.

Performance

The performance of the existing simulators was compared to the new MPS simulator in some example circuits. The first circuit, cascading entanglement has 10 qubits and starts with an H gate on site 0, it then entangles this qubit with the rest of the qubits by using CX gates. The second experiment is grover_n2 is taken from QASMBench and is a well-known quantum search algorithm. It is able to search a list in 𝒪(N1/2) time rather than 𝒪(N) for a classical computer. The third experiment, rotations, uses several rotation gates but does not create entangelement between distant qubits. Finally, VQE is another standard circuit taken from QASMBench and is an algorithm which approximates the lowest eigenvalue of a Hamiltonian, a common operation in quantum computing. As seen in the chart, the performance of the MPS simulator usually lies somewhere in between the sparse and dense simulators. This was expected considering many of these experiments have moderate entanglement and the performance of the MPS simulator is directly related to the dimensions of its tensors, which is governed by the maximum bond dimension χ, which is itself related to the amount of entangelement in the system.

MPS simulation compared with the existing Grafeyn simulators.
Runtime performance on selected experiments

One of the limitations of the current MPS setup is that it is hard to achieve good parallelism with it. It’s relatively trivial to apply 1Q gates in parallel with it, as long as these touch different sites. However, as soon as 2Q gates need to be applied, things get more complicated, in part because the whole state might be need to locked if the gate operation must swap two sites on opposite ends of the MPS until they become adjacent, preventing the application of any other gates. By contrast, both the dense simulator and sparse simulators make use of parallelism and the dense one can even make use of CUDA through the scripting language Futhark. Nevertheless, switching to and from MPS might still make sense in some scenarios, e.g. dense states with low entanglement.