A membrane parallel rapidly-exploring random tree algorithm for robotic motion planning

In recent years, incremental sampling-based motion planning algorithms have been widely used to solve robot motion planning problems in high-dimensional configuration spaces. In particular, the Rapidly-exploring Random Tree (RRT) algorithm and its asymptotically-optimal counterpart called RRT* are popular algorithms used in real-life applications due to its desirable properties. Such algorithms are inherently iterative, but certain modules such as the collision-checking procedure can be parallelized providing significant speedup with respect to sequential implementations. In this paper, the RRT and RRT* algorithms have been adapted to a bioinspired computational framework called Membrane Computing whose models of computation, a.k.a. P systems, run in a non-deterministic and massively parallel way. A large number of robotic applications are currently using a variant of P systems called Enzymatic Numerical P systems (ENPS) for reactive controlling, but there is a lack of solutions for motion planning in the framework. The novel models in this work have been designed using the ENPS framework. In order to test and validate the ENPS models for RRT and RRT*, we present two ad-hoc implementations able to emulate the computation of the models using OpenMP and CUDA. Finally, we show the speedup of our solutions with respect to sequential baseline implementations. The results show a speedup up to 6x using OpenMP with 8 cores against the sequential implementation and up to 24x using CUDA against the best multi-threading configuration.


Introduction
The development of modern autonomous robots poses difficult tasks associated with the interaction between the robot and the external world, see [25,64]. For example, a shape reconstruction system of 3D objects is proposed in [68] and a vision based navigation system for drones is proposed in [5]. Several studies on the modelling of the artificial visual system have been performed, e.g. a neural model [40] and methods on the functioning of artificial compound eye in [74,73]. The task of detecting and interpreting a moving object, that is motion tracking, is fundamental for the correct functioning of a robot, [8,76].
This article studies a crucially important problem related to motion tracking, i.e. motion planning. The motion planning problem consists of finding a trajectory to move an agent trough a complex environment from a starting point to a desired area avoiding any obstacle while considering constraints related to the agent such as shape, kinematics and others. This problem is critical in almost all robot applications since, by definition, a robot is a machine developing tasks in the real world. Furthermore, this problem is also relevant in many other research areas such as verification, computational biology, and computer animation [34]. The first computational complexity results for robot motion planning were due to J. Reif [63]. Specifically, he established that the cited problem is PSPACE-hard when the final positions of the obstacles are specified. Moreover, this problem belongs to the class PSPACE when it is expressed using polynomials with rational coefficient [12], that is, this variant is a PSPACE-complete problem. Nevertheless, the robot motion planning problem where the final positions of the obstacles are unspecified, is shown to be NP-hard [22].
Several approximated solutions have been proposed to the motion planning problem. For example, an incremental search algorithm called D* [70] has been used for path planning in real-time environments. In [15], an extension of the classical A* algorithm is given for grids with blocked and unblocked cells. In [48], the authors studies the motion planning problem in the challenging context of navigation through real-weld, where traditional on-site sign posts could be limited or not available.
An special mention should be given to a category of algorithms to build Rapidly-exploring Random Trees (RRTs) [35]. They are based on the randomized exploration of the configuration space by building a tree where nodes represent reachable states and edges represent transitions. In particular, the RRT* algorithm [29] is able to build an RRT whose paths asymptotically converge in time of computation to optimal motion planning solutions considering a cost function. Several variants of the RRT and RRT* algorithms have been designed. For example, RRT*-Smart [44] is a method for accelerating the convergence rate of RRT*. A*-RRT and A*-RRT* [10] are two-phase motion planning methods that use a graph search algorithm to search an initial feasible path in a low-dimensional space in a first phase and then focus the RRT* in the second phase. RRT*-FN [4] is a RRT* with a fixed number of nodes, which randomly removes a leaf node in every iteration. Informed RRT* [21] improves the convergence speed of RRT* by introducing a heuristic. While the base RRT algorithms are inherently sequential, there are modules that can be parallelized such as the obstacle collision detection [9,20]. An experimental comparison is proposed in [46] Membrane Computing [58,24,61,62] is a computing paradigm inspired from the living cells and it provides distributed, massively parallel devices, see [1,66,67]. Models in Membrane Computing are generically called P systems and they have been used in different contexts. Among others, we stress the following: (a) showing the ability of some models to give polynomial time solutions to computationally hard problems, by trading space for time; (b) providing a new methodology to tackle the P versus NP problem [57,50,37,69]; (c) being a framework for modelling biomolecular processes as well as real ecosystems [60,23,7,14]; (d) incorporating fuzzy reasoning in models that mimic the way that neurons communicate with each other by means of short electrical impulses, and applying them to many different industrial applications related to fault diagnosis [72].
In this paper, Membrane Computing is used to design bio-inspired parallel RRT models that can be efficiently simulated by means of parallel software/ hardware architectures such as OpenMP [49] and CUDA [47]. A variant of P systems called Enzimatic Numerical P systems (ENPS, for short) has been used to model and simulate robot controllers [54,55,77,19]. An extension of ENPS, called RENPSM, was used for the first time in [56] to simulate basic RRT algorithms.
The main contribution of this work is to use the framework of ENPS for modelling the RRT and the RRT * algorithms. It is worth pointing out that the major advantage of using P systems is the inherent parallelism associated with the theoretical computing devices that they provide. Moreover, no additional ingredients to the ENPS framework has been included, as in [56], where features such as proteins and shared memory were used. In consequence, the presented models are compatible with existing ENPS robot controllers given that they belong exactly to the same framework. In order to validate and test these models, two ad-hoc simulators have been implemented in OpenMP and CUDA, showing experimental results in four scenarios. Henceforth, the parallel implementations show that the introduced ENPS-based models can be easily parallelized for multicores and manycores at high-end servers as well as on-board platforms.
The rest of this paper is structured as follows. Next section summarizes some preliminary concepts. Section 3 is devoted to present two specific ENPS models for RRT and RRT * . The simulators implemented in OpenMP and CUDA are described in Sec-tion 4. The experimental results for validation and testing are presented in Section 5. The paper ends with the main conclusions of the work.

Preliminaries
This Section provides the reader with the basic concepts and notation used throughout this paper.

Motion planning
In general terms, the problem of motion planning can be defined in the configuration space of a mobile agent as follows. Given: -An initial configuration state.
-A set of valid final configuration states.
-A map of obstacles in the environment. -A description of the agent shape.
-A description of the agent kinematics.
Find a sequence of configuration states through the configuration space, a.k.a. trajectory or plan, from the initial state to one of the final states, which does not touch any obstacle in the environment considering the agent shape and kinematics.
There are two variants of the problem: -The feasibility problem is to find a feasible trajectory, if one exists, and report failure otherwise. -The optimality problem is to find a feasible trajectory with minimal cost where the cost of a trajectory is given by a computable function.
In a robot navigation software architecture, the module in charge of solving the motion planning problem is called Global Planner. It computes trajectories by means of an anytime algorithm, i.e, an algorithm that can return a valid solution to the problem even if it is interrupted before it ends. After that, the Local Planner module generates motion commands to follow the trajectory in a safe manner considering the information given by the sensors in real-time. Finally, the Controller module manages the power of the actuators to fit each motion command and maintain a constant velocity until the next command, see [11].

Rapidly-exploring random trees
An RRT [35] is a randomized tree structure for rapidly exploring the obstacle-free configuration space. It has successfully been used to solve nonholonomic and kinodynamic motion planning problems [36]. Nodes in an RRT represent possible reachable states, edges represent transitions between states. The root of an RRT is the initial state. Each state in an RRT can be reached by following the trajectory from the root to the corresponding node, as can be seen in Figure 1. Algorithms used in robotics to generate RRTs are known to be anytime and any-angle, i.e, the produced trajectories could contain turns in any valid angle considering the robot constraints and kinodynamics. The first algorithm to generate RRTs, called the RRT algorithm [35], was able to solve the feasibility problem for motion planning in robotics.

The RRT* algorithm
The RRT* algorithm [29] is a variant of the previous algorithm which is able to build an RRT whose branches asymptotically converge in time of computation to optimal motion planning solutions with respect to a cost function, i.e, it is able to solve the optimality problem using an infinite time of computation. Nevertheless, the algorithm solves the feasibility problem and provides good solutions with respect to the cost function in an anytime and any-angle fashion [29].

Membrane Computing
P systems are theoretical computing devices biologically inspired from the structure and processes taking place in living cells [58]. As mentioned above, Membrane Computing is the field that studies these devices. The syntactic ingredients of P systems are an abstraction of chemical components and compartments of cells: a set of membranes delimiting regions where multisets of objects reside and evolve according to a set of rewriting rules. The type of membrane structure and rules depends on the class of P system, which can be cell (rooted tree), tissue (undirected graph) or neural-like (directed graph). So far, a wide variety of variants have been define within each class that answers to different semantics. In general, a computation of a P system is a sequence of configurations provided by transitions steps where the rules are executed and modify the state of the system. P systems are inherently non-deterministic and massively parallel devices, given that different sets of rules might be applicable at the same time, and each of such sets includes a high amount of rules that are executed in parallel. A global clock takes the control of each transition step.
Rules can modify only the objects in the region where they are defined, or exchange objects with a neighbor region, or modify the membrane structure by dividing or dissolving compartments, etc. Membranes can also be enriched by associating electrical charges to them. The objects that can appear in the membranes are defined in an alphabet. Several different semantics have been defined for P systems. Usually, the rules of a P system are executed in a maximally parallel way; that is, any rule that can be executed in a transition step must be executed and cannot remain unselected. On the other hand, it is possible to use a minimal parallel policy of execution; that is, at least one rule within each region must be executed. Some types of rules might be incompatible with the execution of other (e.g. division and send-out rules in active membranes), and this is given by a derivation mode.
Main research concerning P systems focuses on their computational power and efficiency, see [27,6,39]. In this sense, they have been used as a tool to attack the P vs NP problem by sharpening the frontiers [57,50,37]. However, their flexibility and massively parallelism has been employed to define mod-elling frameworks for real-life phenomena, such as biological systems at both micro (e.g. signally pathways, bacteria colony) [60,23] and macro (e.g. ecosystems, population dynamics) levels [7,14], physical systems, image processing, robot control, economic systems, etc. In general, solutions based on P systems must implement parallelism, given that this component is natural and inherent. These applications have been accompanied normally with software simulators in order to ease validation and virtual experimentation processes [16,71]. A recent trend is to accelerate these simulations by using parallel devices like GPUs [41,42,13].

Numerical P systems
Numerical P systems (NPS) are P systems, see [52,75], introduced in [59] to model economical and business processes. In NPS, the concept of multisets of objects is replaced by numerical variables that evolve from initial values by means of production functions and repartition protocols. A numerical P system is formally expressed by: where m is the number of membranes; m ≥ 1; -H is an alphabet of labels, containing m symbols; µ is the membrane structure; -Var i is a set of variables for compartment i, being Var i (0) their initial values; -Pr i is a set of programs for compartment i. A program has the following syntax: where * The left-hand-side of the program is called production function and the right-hand-side is called repartition protocol.
. . , v n are output variables. The output value of F will be distributed among the output variables according to the repartition protocol given by c 1 , . . . , c n . * c 1 , . . . , c n are numeric values representing the portion of the output which is going to be assigned to the corresponding variable.
For the sake of simplicity, in the rest of this paper, we will use the next syntax: when there is one and only one output variable in the repartition protocol. fibonacci ;this box defines a membrane with label fibonacci ;next lines define variables with initial values for this compartment

Enzymatic numerical P systems
;next lines define programs for this compartment ;this box defines an inner membrane For the sake of simplicity, it is not necessary to write the condition function when it is the constant function Cond() = true. Figure 2 shows an ENPS model to generate the 100 first terms of the fibonacci succession as a toy example to illustrate how ENPS models are going to be written in this paper.
The membrane structure is designed to organize programs and variables in modules. All the variables are considered global, i.e, any program can read or write a variable regardless of the compartment where the variable is defined. The definition of variables as well as their initial values is given by the syntax x[v] where x is a variable and v is a numeric value. The reserved word input can be written instead of a numeric value for v when the initial value should be read from external inputs, e.g, robot sensors. In this paper, we will use the syntax a [b] in order to refer to variable a i where i is the value stored in variable b.

Parallel Computing: OpenMP and CUDA
Since the advent of the transistor, computer processors have been increasing the amount of resources in order to achieve higher efficiency, and so, power [2,65,32]. Recent trend is to incorporate more computing cores within the same chip, so today we can easily find multicore (up to dozens of cores) and manycore (up to thousands of cores) processors in the market [3,53]. The former concerns current CPUs, and the latter involves mainly GPUs. The main standard to harness the parallelism in multicore processors is OpenMP [49], which is managed by the OpenMP ARB consortium. First introduced in the late 90's, it provides an API for common languages such as Fortran, C and C++, and that is supported in the majority of systems. The main advantage of OpenMP is its ease of use, given that the programming is made through directives on top of the code. The execution model is fork-join, where a master takes control of the children and synchronizes the threads created. This model is also based on shared memory (all threads have access to the same memory system), although there are variants for distributed memory as well. OpenMP has worked also effectively with other standards for distributed computing such as MPI [28], and in fact, this combination is widely used in today clusters and supercomputers [30].
Nowadays, GPUs can be used as manycore processors to accelerate scientific computation [31,18]. CUDA [47] is the most widely used programming model, given that it was first introduced in 2007 by NVIDIA, and it has been strongly supported by this company. Although OpenCL is the standard introduced by Kronos for GPU computing, CUDA is still better supported by NVIDIA GPUs. In any case, similar concepts can be found in OpenCL and in CUDA. It is a heterogeneous programming model where CPU (host) and GPU (device) are separated, and so, has different memory spaces. GPUs contain several multiprocessors including computing units and fast shared memory system, and all multiprocessors are interconnected with a global memory that can be also accessed by the CPU, specially to copy and retrieve data. CUDA provides an abstraction of the GPU in form of threads that execute the same code which is a function called kernel [45]. Threads are grouped in blocks, so each block is assigned to a multipro-cessor, being able to efficiently cooperate locally inside the blocks. CUDA programmers have to manage these aspects manually: thread grid and memory layout. Moreover, threads are executed completely in parallel when they are fully synchronized in the code; that is, they follow a SIMD fashion (single instruction multiple data). Moreover, accesses to memory get optimized when threads query contiguous positions of data [45,51]. Therefore, parallel algorithms have to be carefully adapted to this model bearing in mind several efficiency aspects. Some algorithms fit very well to the GPU architecture and so they are employed as primitives and building blocks, such as map (application of a function to all elements of a vector), reduction [33] (calculation of a function over all elements of a vector), scan (accumulative application of a function to elements of a vector), etc. [31,51].

ENPS models for RRT algorithms
In this section, four ENPS models are presented in order to simulate the behavior of the RRT and RRT* algorithms taking advantage of the inherent parallelism level existing in the membrane computing framework. We have divided the whole problem in the next subproblems: • Find the nearest point to a given point according to the Euclidean distance. • Determine if a given trajectory is obstacle free.
• Simulate the RRT algorithm.
The main ideas of the P system design are the following: • To use enzymes in order to synchronize the processes. • To compute Euclidean distances in parallel.
• To use a parallel reduction process in order to compute minimum distances. • To add a node to the RRT in each iteration.

Finding the nearest point
Given a set of points X = {(x i , y i )} : 1 ≤ i ≤ 2 n and a target point (x t , y t ), find the nearest point (x, y) in X to the target point according to the Euclidean distance.

Solution for n=3
The solution of the nearest point for n = 2 is shown in Figure 3 Where The P system computes in one step of computation the squared Euclidean distance for all the points in X to the target point. After that, a reduction operation is conducted in 3 steps to compute the minimum distance as well as the nearest point in X. The computation stops after 4 steps, then halt is set to 1 and the nearest point is stored in (x nearest , y nearest ). Variable α is used as an step counter.

General solution
The general solution, for n points is given in Figure 4 After n + 1 steps, the nearest point is stored in (x nearest , y nearest ).   y 1 ), determine if the trajectory following a straight line from (x 0 , y 0 ) to (x 1 , y 1 ) is obstacle free. A trajectory is obstacle free if the distance from the nearest obstacle to the trajectory is greater or equal than a given parameter ξ , see Figure 5.
With reference to Figure 5: a y ), (b x , b y )]. -After m + 1 steps, the variable collision contains a value equal or less than zero if the trajectory is obstacle free.
Algorithm 1 shows the pseudocode of the pDist function.
The P system computes in one step of computation the squared Euclidean distance for all the obstacles to the segment given by [(x 0 , y 0 ), (x 1 , y 1 )]. After that, a reduction operation is conducted in m steps of computation, obtaining the minimum distance. In the last step, variables collision and halt are set. the RRT algorithm is illustrated in Figure 6.

The RRT algorithm
The inner modules are defined as shown in Figures 7, 8, and 9 where random() returns a random number independent identically distributed (i.i.d.) in [0, 1] ∩ R RRT(n,m,p,q,δ ,ξ ) ObstacleFree RRT (m,ξ ) Extend RRT (n,m) Figure 6. Rapidly-exploring Random Tree rm(x, y) returns the remainder of the integer division between x and y. -The coordinates of the RRT nodes will be {(x i , y i )} : 1 ≤ i ≤ 2 n -The coordinates of the RRT parent nodes will be {(px i , py i )} : 2 ≤ i ≤ 2 n -The RRT is completely generated and the computation stops when halt = 1.
The P system generates the point (x rand , y rand ) in one step of computation, after that, the module Nearest RRT (n) computes the point (x nearest , y nearest ) in n + 1 steps of computation as explained in subsection 3.1. Then, the (x new , y new ) point is computed ac- Extend RRT (n,m) Figure 9. Extend RRT Procedure cording to the δ parameter. The next module to be executed is the ObstacleFree RRT (n, ξ ) module, see subsection 3.2, computing the squared Euclidean distance of the nearest obstacle point to the segment [(x nearest , y nearest ), (x new , y new )], if this value is less than ξ parameter, then the variable collision will contain RRT*(n,m,p,q,δ ,ξ ) ObstacleFree RRT (m,ξ ) Extend RRT * (n,m,p,q,ξ ) Figure 10. Asymptotically optimal Rapidly-exploring Random Tree (RRT*) a value greater than 0 after m + 1 steps of computation. Finally, if the segment is obstacle free, the module Extend(n, m) updates the RRT. In this way, each node is added to the RRT in m + n + 6 computation steps if the corresponding edge is obstacle free.

The RRT* algorithm
The asymptotically-optimal version of the RRT algorithm, i.e. the RRT* algorithm is outlined in Figure 10. The first m + n + 5 steps of computation for the ENPS RRT* algorithm are the same to those given in the ENPS RRT algorithm, see subsection 3.3. In the m + n + 6 step, the module Extend RRT * (n, m, p, q, ξ ) generates in one step of computation the squared Euclidean distances from all the obstacles to all the possible segments [(x j , y j ), (x new , y new )] being (x j , y j ) all the nodes in the RRT, i.e, parent candidates for the new edge. After that, a reduction process computes the minimum distance for each parent candidate in m steps of computation. Then, the costs for all the possible trajectories to (x new , y new ) are computed where trajectories passing through obstacles will have a larger cost value than obstacle-free trajectories. After that, the minimum cost as well as the best parent candidate are computed in n steps of computation by applying a reduction process. The RRT will be updated with the new node as well as a new edge from the best parent candidate to the new node. The cost of the new node will be also stored. Finally, each node (x i , y i ) in the RRT computes in one step of computation an alternative cost considering (x new , y new ) as its parent node. If the alternative cost is less than the current cost and the segment [(x i , y i ), (x new , y new )] is obstacle-free, then the parent of (x i , y i ) is updated to (x new , y new ). In this way, each node is added to the RRT in 2 · m + 2 · n + 10 steps of computation if the corresponding edge is obstacle free. The Extend RRT * is shown ins Figure 11. where c i is the cost associated to node (x i , y i ).

Developed software
We have implemented two specific (ad-hoc) simulators using OpenMP and CUDA, each one is able to simulate the ENPS-RRT and ENPS-RRT* models described in 3. Let us recall that a specific simulator aims at simulating a certain P system model (as in this case), instead of simulating a whole P system variant (i.e. a generic simulator) [42]. In a specific simulator, the developer encode the P system directly in the source code, and can implement assumptions for simplicity and efficiency. In any case, the simulator must simulate the model; that is, there should be a way to know the state of the P system at any time. The software can be downloaded at x new | (α=2·m+2·n+10)and(c i >0)and(d 1,i =0) → px i : 2 ≤ i ≤ 2 n y new | (α=2·m+2·n+10)and(c i >0)and(d 1,i =0) → py i : 2 ≤ i ≤ 2 n Figure 11. Extend RRT * procedure https://github.com/Ignacio-Perez/enps rrt, where the master is the OpenMP simulator and the CUDA branch is the GPU version. The conventional RRT and RRT* algorithms have also been implemented as baseline software for testing and validation using C++ and OpenMP. It is available at https://github.com/Ignacio-Perez/openmprrt.

An OpenMP simulator
The OpenMP simulator is able to manage image files in PGM format defining the obstacle maps, four pre-defined scenarios have been included with the software as it will be explained in Section 5. For each one, a PGM file is included along with a fixed initial position for the robot. The resolution for all maps is 5cm 2 /pixel, which is the resolution of the LIDAR sensor used to generate the map1 and ccia h maps.
The software provides a command-line interface in order to specify the scenario to be used, as well as the model (ENPS-RRT or ENPS-RRT*) and the random seed. The output is a new PGM file where an RRT has been drawn over the original map. The number of CPU threads to be used by the OpenMP simulator is configured by the environment variable OMP NUM THREADS.
The software allows to set all the parameter values. For the experiments in this paper, the next set of values has been used: -n = 12, i.e, 2 12 RRT nodes will be generated. It produces a large enough tree to stress the parallel simulators. δ = 15cm, i.e, the RRT edges will have 15cm length. ε = 20cm, the robot radius.
m, p, q depend on the specific PGM file.
The selected values for δ and ε are common values for indoor platforms such as the Pioneer DX-2 robot.
The simulator uses data structures to store the value of the ENPS variables in run-time and programming sentences in C with OpenMP in order to simulate the behaviour of the ENPS programs. That is, the simulator allocates arrays storing the objects for: • obstacle positions (a and b).
• RRT points and their parents, (x, y, px and py).
• auxiliary objects for reduction at nearest (xp, yp and d) and at obstacleFree modules (d ).
Moreover, the simulator is written in a modular way, where it is easy to identify the different modules of the model: • Computing (x rand , y rand ) on the CPU. • nearest, in order to compute (x nearest , y nearest ). This is done by: 1. computing squared distances from all points in RRT to (x rand , y rand ) 2. computing minimum distance and nearest point in RRT • obstacleFree, in order to compute obstacle collision. This is done by: where (x,y) are points in RRT 2. for each point (x,y) in RRT, compute the minimum distance to all obstacles 3. compute new cost for all points in RRT 4. compute the minimum cost and the variable with that value (that is, argmin operation) 5. fix the edges in RRT As an example, the next C/OpenMP code computes the squared distances from all the points in the RRT to the (x nearest , y nearest ) point.

A CUDA simulator
Additionally, we have outsourced the ad-hoc simulator to the GPU. This enables its parallelization for both high-end and low-end GPUs by using CUDA. Let us recall that this technology provides automatic scalability and portability of the code. Thus, the most computing-intensive part of the simulation is executed on device, what will allow to accelerate its execution significantly, as we will discuss in the results. First of all, we have replicated the data structures employed on the OpenMP simulator on the GPU.
Simple map functions applied to objects in parallel are implemented by CUDA kernels, optimized by launching 256 threads per block (we have tuned this number experimentally), a number of thread blocks multiple of the amount of streaming multiprocessors on the specific device, and looping by tiles over the objects (contiguous threads work with contiguous positions of arrays, and iterate in this manner until covering all positions). The following parts are implemented in this way: • At nearest module, computing squared distances from all points in RRT to (x rand , y rand ). This kernel specifically employs a 2-dimensional grid of thread blocks to cover the two nested loops in parallel. * computing new cost for all points in RRT, while checking the minimun distance is greater than epsilon. * Fixing edges.
As made for the OpenMP implementations, global minimum calculations can be efficiently implemented by the reduction operation. In CUDA, reduction is a parallel primitive, and there is a plethora of optimized implementations. We have focused only on reduction with minimum operation for floating numbers. We have tested three different libraries in order to select the fastest one: (1) reduction example from CUDA Toolkit SDK 10.1 [47], (2) Thrust (as redistributed in CUDA 10.1) [26], and (3) CUB 1.8 [43]. Implementation (1) is based on the optimization ideas using per-warp shuffle operations [38], available first in Kepler architecture. We have modified the implementation in the SDK by replacing the sum (+) operation by the built-in fminf function in CUDA. Concerning implementation (2), we have used the STL-like library Thrust, which offers an optimized minimum reduction function. We have ensured that the function is launched on the device. Finally, we have also tested CUB primitives library by NVIDIA labs in implementation (3). This is a set of header files that do not require to compile, but just including them in our code. Both Thrust and CUB also have a variant for argmin operation, required in order to calculate in which position of the array the minimum takes place. Finally, as for reduction with minimum operation, CUB outperformed the other two, (3) against (1) by three times, and (3) against (2) slightly better.
We have therefore used CUB to compute the following reduction operations: • At nearest module: compute minimum distance and nearest point by using argmin reduction. • At obstacleFree module: compute minimum distance by using min reduction. • At RRT* extend module: * Compute the minimum distance for each point (x,y) in RRT to all obstacles. This requires to construct a matrix with a row per RRT point and columns the number of obstacles. We have used the segmented reduction minimum variant to calculate all the minimums per row in parallel. This variant performs reduction globally but it only takes effect inside local segments that are previously defined. * Compute minimum cost by argmin.
Finally, in order to keep all the computation on the GPU and hence, avoiding transferring data from the GPU to the CPU at every iteration, store also the scalar values of x nearest , y nearest , x new , y new on device. These are updated by kernels with single thread. The objective is the reduce the amount of memory transfer as much as possible, even if just one thread is in charge of updating single objects.

Experimental results
In this section, we will show the experiments conducted to test the performance of the ad-hoc simulators for both OpenMP and CUDA. We first start analyzing the results for the multicore version, selecting the best configuration. After that, we show the runtimes obtained with GPU version, and conclude with a comparison of configurations.
The source code of the simulators is written in C/C++ programming language, and binaries are compiled with GCC 7.4 and NVCC 10.1 using the flag -O3. The random numbers are always generated on the CPU by using the seed 1. Execution times are measures with time bash instruction, so that we consider the whole time to read and write the maps and the simulation itself with all the overheads. We have used the hardware setup shown below, where we looked for a low-end and a high-end GPU, and employed two versions of Intel CPUs corresponding to the processors where the GPUs are plugged. In this way, we validate our simulators and its acceleration by emulating two different situations: (a) when the RRT is pre-computed on a server with a high-end GPU and transmitted to the robot before starting to move, and (b) when the RRT is pre-computed using the low-end GPU on-board the robot like a NVIDIA Jetson.
We have used four different maps for the experiments. These are shown in Figure 12. We have sorted them by increasing complexity, so that we expect that RRT and RRT* will require more iterations for maze than for map1 for example. For all experiments, we provide two fixed distant points in the map. The baseline software has been executed using the i7-9700  hardware with configurations of 4 and 8 cores. The speedup for each map, each algorithm and each CPU configuration is shown in Table 1. An average speedup of 1.39x was obtained using the baseline RRT algorithm with 8 cores with respect to the sequential implementation. The speedup was 2.57x in the case of the RRT* algorithm. In Figure 13, the runtimes of the multicore simu-lator for RRT are displayed. We increase the number of threads by 2, starting in 1 and until the total amount of cores in the corresponding CPU. We can see how it scales well and the runtime drops proportionally to the number of threads executed, specially when going from just 1 thread to the half amount of cores in each CPU. Runtimes from half to all cores remain almost constant. Only in maze we can experiment some improvement. Low-complexity map1 is handled fast in all CPUs, so running several threads is not worthy. Finally, runtimes are similar in both CPUs, and is always below 2 seconds when using all cores. For i7-8700, a speedup from 1.8x (map1) to 4.5x (maze) is obtained when using all cores against just one core (sequential version). For i7-9700, a speedup from 2.2x (map1) to 4.9x (maze) is achieved.
We can see some differences when running the multicore simulator with RRT*, given its higher complexity. The runtimes are plotted in Figure 14. For i7-8700, half of the cores (6) are enough to saturate the processor. For i7-9700, the best runtime is achieved with full occupancy of the cores (8). We can see how the runtime decrease is proportional to the amount of cores again. Runtime for i7-9700 is slightly lower than for i7-8700, being below 200 seconds against 300 seconds in the latter. The map maze requires the higher computation time. For i7-8700, a speedup from 4x (maze) to 5.2x (map1) can be obtained when using all cores against the sequential version. For i7-9700, a speedup of around 6.7x for all maps is reported.
In Figure 15, we compare the runtimes obtained with the GPU against the CPU. We consider for this experiment the best configuration for both CPUs, that is using all the cores in each processor. For RRT*, the GPU outperforms the CPU in all maps, with alwarys very low runtimes. The example of maze stands out, because the RTX2080 scales very well and requires just 13.21 seconds to construct the RRT* tree, whereas the CPUs requires 2 to 4 minutes to complete. In this specific map, the GTX1050 still behaves better than the CPU. Note that we do not report the results for RRT, for which the GPU is always slower than the CPU mainly because the overhead to perform reduction on device is not worthy for such small amount of item. For RRT*, we need to execute a reduction much larger, so we can see the good effect of using the GPU.  Figure 13. Runtimes of the OpenMP simulator for RRT on the i7-8700 (a) and i7-9700 (b) CPU. X-axis shows the number of cores used during execution (1 to 12 for i7-8700, and 1 to 8 for i7-9700, both by step of 2). Y-axis shows the time in seconds.
Finally, let us assume the best and worst device: fastest CPU is i7-9700 with 8 cores, slowest CPU is i7-8700 with 12 cores, fastest GPU is RTX2080 and slowest GPU is GTX1050. In Figure 16, we show the speedups obtained by comparing one by one the devices tested for RRT*. We can see that the slowest GPU still outperforms the fastest GPU by 1.5x to 2.3x. On the contrary, the fastest GPU against the slowest CPUs leads to very high speedups, of around 20x. Finally, the fastest CPU and GPU comparison gives speedups of around 14x for the GPU. It is interesting to see that the GPU always get higher speedups for the office-type maps (ccia h and office).  Figure 14. Runtimes of the OpenMP simulator for RRT* on the i7-8700 (a) and i7-9700 (b) CPU. X-axis shows the number of cores used during execution (1 to 12 for i7-8700, and 1 to 8 for i7-9700, both by step of 2). Y-axis shows the time in seconds.
In light of the results, we can draw the following conclusions: CPU is the best candidate if we want to run the adhoc simulator for RRT, and the GPU is always the best candidate to run the ad-hoc simulator for RRT*; and for the CPU, it is always worthy using at least half of the cores in the processor.

Conclusions
In this work we present two ENPS models emulating the computation of the RRT and RRT* algorithms in order to solve the motion planning problem in robotics. The ENPS framework was successfully used Figure 15. Runtime of the CUDA simulator for RRT* on GTX1050 and RTX2080 compared to the best multi-threading configuration for the OpenMP simulator with the two CPUs. in previous works to design and simulate robot controllers, but there is a lack of solutions for global path planning using the framework. The RRT and RRT* algorithms are inherently iterative, but they have modules that can run in parallel, such as the obstacle collision detection and others. By using a massively parallel model of computation, we provide a solution able to add nodes to an RRT in constant time. Two simulators have been implemented using OpenMP and CUDA, i.e., two parallel computing frameworks in silico. Several processes in the models, such as the computation of the nearest point to a given point by using reduction, have been translated to the simulators in a natural way. Finally, we have validated and tested the models and simulators by using four scenarios. The experimental results show a speedup up to 6x us-ing OpenMP with 8 cores against the sequential implementation and up to 24x using CUDA against the best multi-threading configuration. The best speedup using a baseline OpenMP implementation with 8 cores against the corresponding baseline sequential implementation was 2.57x. Such results show a better parallel scalability when using the membrane computing paradigm which is a maximal parallel model of computation.