========== onnx ========== ========== command ========== trtexec --onnx=/tmp/datadir/jetson_orin_profiler/fVq2Mr.onnx --best ========== profiling ========== &&&& RUNNING TensorRT.trtexec [TensorRT v8400] # trtexec --onnx=/tmp/datadir/jetson_orin_profiler/fVq2Mr.onnx --best [10/09/2023-14:43:51] [I] === Model Options === [10/09/2023-14:43:51] [I] Format: ONNX [10/09/2023-14:43:51] [I] Model: /tmp/datadir/jetson_orin_profiler/fVq2Mr.onnx [10/09/2023-14:43:51] [I] Output: [10/09/2023-14:43:51] [I] === Build Options === [10/09/2023-14:43:51] [I] Max batch: explicit batch [10/09/2023-14:43:51] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default [10/09/2023-14:43:51] [I] minTiming: 1 [10/09/2023-14:43:51] [I] avgTiming: 8 [10/09/2023-14:43:51] [I] Precision: FP32+FP16+INT8 [10/09/2023-14:43:51] [I] LayerPrecisions: [10/09/2023-14:43:51] [I] Calibration: Dynamic [10/09/2023-14:43:51] [I] Refit: Disabled [10/09/2023-14:43:51] [I] Sparsity: Disabled [10/09/2023-14:43:51] [I] Safe mode: Disabled [10/09/2023-14:43:51] [I] DirectIO mode: Disabled [10/09/2023-14:43:51] [I] Restricted mode: Disabled [10/09/2023-14:43:51] [I] Build only: Disabled [10/09/2023-14:43:51] [I] Save engine: [10/09/2023-14:43:51] [I] Load engine: [10/09/2023-14:43:51] [I] Profiling verbosity: 0 [10/09/2023-14:43:51] [I] Tactic sources: Using default tactic sources [10/09/2023-14:43:51] [I] timingCacheMode: local [10/09/2023-14:43:51] [I] timingCacheFile: [10/09/2023-14:43:51] [I] Input(s)s format: fp32:CHW [10/09/2023-14:43:51] [I] Output(s)s format: fp32:CHW [10/09/2023-14:43:51] [I] Input build shapes: model [10/09/2023-14:43:51] [I] Input calibration shapes: model [10/09/2023-14:43:51] [I] === System Options === [10/09/2023-14:43:51] [I] Device: 0 [10/09/2023-14:43:51] [I] DLACore: [10/09/2023-14:43:51] [I] Plugins: [10/09/2023-14:43:51] [I] === Inference Options === [10/09/2023-14:43:51] [I] Batch: Explicit [10/09/2023-14:43:51] [I] Input inference shapes: model [10/09/2023-14:43:51] [I] Iterations: 10 [10/09/2023-14:43:51] [I] Duration: 3s (+ 200ms warm up) [10/09/2023-14:43:51] [I] Sleep time: 0ms [10/09/2023-14:43:51] [I] Idle time: 0ms [10/09/2023-14:43:51] [I] Streams: 1 [10/09/2023-14:43:51] [I] ExposeDMA: Disabled [10/09/2023-14:43:51] [I] Data transfers: Enabled [10/09/2023-14:43:51] [I] Spin-wait: Disabled [10/09/2023-14:43:51] [I] Multithreading: Disabled [10/09/2023-14:43:51] [I] CUDA Graph: Disabled [10/09/2023-14:43:51] [I] Separate profiling: Disabled [10/09/2023-14:43:51] [I] Time Deserialize: Disabled [10/09/2023-14:43:51] [I] Time Refit: Disabled [10/09/2023-14:43:51] [I] Inputs: [10/09/2023-14:43:51] [I] === Reporting Options === [10/09/2023-14:43:51] [I] Verbose: Disabled [10/09/2023-14:43:51] [I] Averages: 10 inferences [10/09/2023-14:43:51] [I] Percentile: 99 [10/09/2023-14:43:51] [I] Dump refittable layers:Disabled [10/09/2023-14:43:51] [I] Dump output: Disabled [10/09/2023-14:43:51] [I] Profile: Disabled [10/09/2023-14:43:51] [I] Export timing to JSON file: [10/09/2023-14:43:51] [I] Export output to JSON file: [10/09/2023-14:43:51] [I] Export profile to JSON file: [10/09/2023-14:43:51] [I] [10/09/2023-14:43:51] [I] === Device Information === [10/09/2023-14:43:51] [I] Selected Device: Orin [10/09/2023-14:43:51] [I] Compute Capability: 8.7 [10/09/2023-14:43:51] [I] SMs: 16 [10/09/2023-14:43:51] [I] Compute Clock Rate: 1.3 GHz [10/09/2023-14:43:51] [I] Device Global Memory: 30622 MiB [10/09/2023-14:43:51] [I] Shared Memory per SM: 164 KiB [10/09/2023-14:43:51] [I] Memory Bus Width: 128 bits (ECC disabled) [10/09/2023-14:43:51] [I] Memory Clock Rate: 1.3 GHz [10/09/2023-14:43:51] [I] [10/09/2023-14:43:51] [I] TensorRT version: 8.4.0 [10/09/2023-14:43:51] [I] [TRT] [MemUsageChange] Init CUDA: CPU +302, GPU +0, now: CPU 327, GPU 7995 (MiB) [10/09/2023-14:43:53] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +403, GPU +534, now: CPU 749, GPU 8546 (MiB) [10/09/2023-14:43:53] [I] Start parsing network model [10/09/2023-14:43:53] [I] [TRT] ---------------------------------------------------------------- [10/09/2023-14:43:53] [I] [TRT] Input filename: /tmp/datadir/jetson_orin_profiler/fVq2Mr.onnx [10/09/2023-14:43:53] [I] [TRT] ONNX IR version: 0.0.6 [10/09/2023-14:43:53] [I] [TRT] Opset version: 11 [10/09/2023-14:43:53] [I] [TRT] Producer name: pytorch [10/09/2023-14:43:53] [I] [TRT] Producer version: 1.12.1 [10/09/2023-14:43:53] [I] [TRT] Domain: [10/09/2023-14:43:53] [I] [TRT] Model version: 0 [10/09/2023-14:43:53] [I] [TRT] Doc string: [10/09/2023-14:43:53] [I] [TRT] ---------------------------------------------------------------- [10/09/2023-14:43:53] [I] Finish parsing network model [10/09/2023-14:43:53] [I] [TRT] ---------- Layers Running on DLA ---------- [10/09/2023-14:43:53] [I] [TRT] ---------- Layers Running on GPU ---------- [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_0 + Relu_1 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_2 + Relu_3 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_4 + Relu_5 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] POOLING: MaxPool_6 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_7 + Relu_8 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_9 + Add_10 + Relu_11 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_12 + Relu_13 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_14 + Add_15 + Relu_16 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_17 + Relu_18 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_77 + Relu_78 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_19 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_20 + Add_21 + Relu_22 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_23 + Relu_24 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_25 + Add_26 + Relu_27 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_28 + Relu_29 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_30 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_31 + Add_32 + Relu_33 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_34 + Relu_35 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_36 + Add_37 + Relu_38 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_39 + Relu_40 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_41 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_42 + Add_43 + Relu_44 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_45 + Relu_46 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_47 + Add_48 + Relu_49 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] POOLING: GlobalAveragePool_50 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_60 + Relu_61 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_62 + Relu_63 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_66 + Relu_67 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_70 + Relu_71 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_51 + Relu_52 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_64 + Relu_65 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_68 + Relu_69 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_72 + Relu_73 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] RESIZE: Resize_59 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] COPY: onnx::Concat_317 copy [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_75 + Relu_76 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] RESIZE: Resize_82 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] COPY: onnx::Concat_360 copy [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_84 + Relu_85 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_86 + Relu_87 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONV_DEP_SEP: Conv_88 + Relu_89 + Conv_90 + Relu_91 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_92 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] RESIZE: Resize_96 [10/09/2023-14:43:53] [I] [TRT] [GpuLayer] TOPK: ArgMax_97 [10/09/2023-14:43:54] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +779, now: CPU 1330, GPU 9415 (MiB) [10/09/2023-14:43:54] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +85, GPU +135, now: CPU 1415, GPU 9550 (MiB) [10/09/2023-14:43:54] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored. [10/09/2023-14:49:53] [I] [TRT] Detected 1 inputs and 1 output network tensors. [10/09/2023-14:49:53] [I] [TRT] Total Host Persistent Memory: 68768 [10/09/2023-14:49:53] [I] [TRT] Total Device Persistent Memory: 0 [10/09/2023-14:49:53] [I] [TRT] Total Scratch Memory: 8388608 [10/09/2023-14:49:53] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 15 MiB, GPU 10504 MiB [10/09/2023-14:49:53] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 1.44971ms to assign 7 blocks to 41 nodes requiring 23069184 bytes. [10/09/2023-14:49:53] [I] [TRT] Total Activation Memory: 23069184 [10/09/2023-14:49:53] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +12, GPU +16, now: CPU 12, GPU 16 (MiB) [10/09/2023-14:49:53] [I] Engine built in 362.216 sec. [10/09/2023-14:49:53] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1622, GPU 10512 (MiB) [10/09/2023-14:49:53] [I] [TRT] Loaded engine size: 13 MiB [10/09/2023-14:49:53] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +11, now: CPU 0, GPU 11 (MiB) [10/09/2023-14:49:53] [I] Engine deserialized in 0.0166877 sec. [10/09/2023-14:49:53] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +22, now: CPU 0, GPU 33 (MiB) [10/09/2023-14:49:53] [I] Using random values for input input [10/09/2023-14:49:53] [I] Created input binding for input with dimensions 1x3x512x512 [10/09/2023-14:49:53] [I] Using random values for output output [10/09/2023-14:49:53] [I] Created output binding for output with dimensions 1x1x512x512 [10/09/2023-14:49:53] [I] Starting inference [10/09/2023-14:49:56] [I] Warmup completed 62 queries over 200 ms [10/09/2023-14:49:56] [I] Timing trace has 1090 queries over 3.00721 s [10/09/2023-14:49:56] [I] [10/09/2023-14:49:56] [I] === Trace details === [10/09/2023-14:49:56] [I] Trace averages of 10 runs: [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90507 ms - Host latency: 3.1623 ms (enqueue 0.416463 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90359 ms - Host latency: 3.1557 ms (enqueue 0.434381 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.9075 ms - Host latency: 3.16463 ms (enqueue 0.418427 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90257 ms - Host latency: 3.15648 ms (enqueue 0.438312 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90617 ms - Host latency: 3.1637 ms (enqueue 0.417685 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90267 ms - Host latency: 3.15533 ms (enqueue 0.450476 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91083 ms - Host latency: 3.16675 ms (enqueue 0.424933 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90709 ms - Host latency: 3.15871 ms (enqueue 0.425381 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90696 ms - Host latency: 3.16137 ms (enqueue 0.425742 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91012 ms - Host latency: 3.16513 ms (enqueue 0.415768 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90859 ms - Host latency: 3.16026 ms (enqueue 0.416815 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91745 ms - Host latency: 3.17538 ms (enqueue 0.416156 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91884 ms - Host latency: 3.17151 ms (enqueue 0.405414 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91745 ms - Host latency: 3.17446 ms (enqueue 0.373779 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91771 ms - Host latency: 3.1718 ms (enqueue 0.353784 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91638 ms - Host latency: 3.17329 ms (enqueue 0.402789 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91426 ms - Host latency: 3.16848 ms (enqueue 0.411945 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.87631 ms - Host latency: 3.12305 ms (enqueue 0.385657 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72517 ms - Host latency: 2.90593 ms (enqueue 0.413159 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7231 ms - Host latency: 2.90192 ms (enqueue 0.399103 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72532 ms - Host latency: 2.90799 ms (enqueue 0.408411 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72289 ms - Host latency: 2.90276 ms (enqueue 0.404736 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7274 ms - Host latency: 2.90708 ms (enqueue 0.400537 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72813 ms - Host latency: 2.90961 ms (enqueue 0.364954 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72476 ms - Host latency: 2.90514 ms (enqueue 0.324939 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7298 ms - Host latency: 2.91313 ms (enqueue 0.357587 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7288 ms - Host latency: 2.90847 ms (enqueue 0.322028 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72578 ms - Host latency: 2.91083 ms (enqueue 0.364374 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72618 ms - Host latency: 2.91013 ms (enqueue 0.377563 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72821 ms - Host latency: 2.90884 ms (enqueue 0.414325 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72363 ms - Host latency: 2.90326 ms (enqueue 0.399463 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72731 ms - Host latency: 2.90835 ms (enqueue 0.401697 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7262 ms - Host latency: 2.90316 ms (enqueue 0.401685 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7264 ms - Host latency: 2.90647 ms (enqueue 0.414526 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72576 ms - Host latency: 2.90641 ms (enqueue 0.401624 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72671 ms - Host latency: 2.90752 ms (enqueue 0.3901 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72703 ms - Host latency: 2.90739 ms (enqueue 0.414172 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72511 ms - Host latency: 2.90651 ms (enqueue 0.414685 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72607 ms - Host latency: 2.90415 ms (enqueue 0.405457 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72124 ms - Host latency: 2.90184 ms (enqueue 0.411145 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72335 ms - Host latency: 2.90562 ms (enqueue 0.403052 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72317 ms - Host latency: 2.90028 ms (enqueue 0.40033 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72258 ms - Host latency: 2.90234 ms (enqueue 0.414868 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72322 ms - Host latency: 2.90375 ms (enqueue 0.400281 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72747 ms - Host latency: 2.90956 ms (enqueue 0.409656 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72385 ms - Host latency: 2.90184 ms (enqueue 0.398645 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72313 ms - Host latency: 2.90262 ms (enqueue 0.399622 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72268 ms - Host latency: 2.90144 ms (enqueue 0.416138 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72307 ms - Host latency: 2.90114 ms (enqueue 0.393555 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72283 ms - Host latency: 2.90148 ms (enqueue 0.411365 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72465 ms - Host latency: 2.9046 ms (enqueue 0.403308 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72552 ms - Host latency: 2.90345 ms (enqueue 0.400049 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72345 ms - Host latency: 2.90293 ms (enqueue 0.398767 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72509 ms - Host latency: 2.90634 ms (enqueue 0.410291 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72419 ms - Host latency: 2.90126 ms (enqueue 0.39884 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72383 ms - Host latency: 2.90428 ms (enqueue 0.394775 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72299 ms - Host latency: 2.90341 ms (enqueue 0.412415 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72114 ms - Host latency: 2.90051 ms (enqueue 0.393494 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7235 ms - Host latency: 2.9048 ms (enqueue 0.412061 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72556 ms - Host latency: 2.90575 ms (enqueue 0.401501 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72245 ms - Host latency: 2.9038 ms (enqueue 0.413293 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72371 ms - Host latency: 2.90236 ms (enqueue 0.402747 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72251 ms - Host latency: 2.90408 ms (enqueue 0.407153 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72361 ms - Host latency: 2.90463 ms (enqueue 0.423059 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72444 ms - Host latency: 2.90293 ms (enqueue 0.395801 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72235 ms - Host latency: 2.90092 ms (enqueue 0.41676 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72723 ms - Host latency: 2.90797 ms (enqueue 0.396533 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72432 ms - Host latency: 2.90239 ms (enqueue 0.398633 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72578 ms - Host latency: 2.90315 ms (enqueue 0.400977 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72141 ms - Host latency: 2.90327 ms (enqueue 0.414038 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72839 ms - Host latency: 2.90642 ms (enqueue 0.400439 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72566 ms - Host latency: 2.90493 ms (enqueue 0.395142 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72695 ms - Host latency: 2.90674 ms (enqueue 0.401538 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72703 ms - Host latency: 2.90393 ms (enqueue 0.398169 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72717 ms - Host latency: 2.90852 ms (enqueue 0.409937 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7249 ms - Host latency: 2.9041 ms (enqueue 0.39729 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72429 ms - Host latency: 2.90579 ms (enqueue 0.412842 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72502 ms - Host latency: 2.90471 ms (enqueue 0.399365 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72319 ms - Host latency: 2.90342 ms (enqueue 0.399414 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72314 ms - Host latency: 2.90562 ms (enqueue 0.418555 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72461 ms - Host latency: 2.90591 ms (enqueue 0.33313 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72988 ms - Host latency: 2.9114 ms (enqueue 0.361914 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72549 ms - Host latency: 2.9095 ms (enqueue 0.360474 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72043 ms - Host latency: 2.90259 ms (enqueue 0.421533 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72314 ms - Host latency: 2.90029 ms (enqueue 0.400269 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72419 ms - Host latency: 2.90427 ms (enqueue 0.412036 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72695 ms - Host latency: 2.90618 ms (enqueue 0.396558 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72493 ms - Host latency: 2.90625 ms (enqueue 0.413403 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72527 ms - Host latency: 2.90703 ms (enqueue 0.40354 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72546 ms - Host latency: 2.90757 ms (enqueue 0.401294 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72673 ms - Host latency: 2.90671 ms (enqueue 0.413403 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72415 ms - Host latency: 2.90442 ms (enqueue 0.398706 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72507 ms - Host latency: 2.90488 ms (enqueue 0.415356 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72952 ms - Host latency: 2.90859 ms (enqueue 0.394995 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72417 ms - Host latency: 2.90459 ms (enqueue 0.398315 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72478 ms - Host latency: 2.90437 ms (enqueue 0.39978 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72388 ms - Host latency: 2.90344 ms (enqueue 0.406934 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72383 ms - Host latency: 2.90427 ms (enqueue 0.405688 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72639 ms - Host latency: 2.90825 ms (enqueue 0.409619 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7217 ms - Host latency: 2.90139 ms (enqueue 0.402881 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72532 ms - Host latency: 2.90596 ms (enqueue 0.397681 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72151 ms - Host latency: 2.90239 ms (enqueue 0.403076 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72319 ms - Host latency: 2.90249 ms (enqueue 0.401807 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72466 ms - Host latency: 2.90664 ms (enqueue 0.417407 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72554 ms - Host latency: 2.9032 ms (enqueue 0.399731 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72524 ms - Host latency: 2.90249 ms (enqueue 0.402832 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72507 ms - Host latency: 2.90859 ms (enqueue 0.41062 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72581 ms - Host latency: 2.90664 ms (enqueue 0.396826 ms) [10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72693 ms - Host latency: 2.90413 ms (enqueue 0.40542 ms) [10/09/2023-14:49:56] [I] [10/09/2023-14:49:56] [I] === Performance summary === [10/09/2023-14:49:56] [I] Throughput: 362.462 qps [10/09/2023-14:49:56] [I] Latency: min = 2.88501 ms, max = 3.21155 ms, mean = 2.94758 ms, median = 2.90625 ms, percentile(99%) = 3.18567 ms [10/09/2023-14:49:56] [I] Enqueue Time: min = 0.203857 ms, max = 0.647705 ms, mean = 0.401836 ms, median = 0.349258 ms, percentile(99%) = 0.624023 ms [10/09/2023-14:49:56] [I] H2D Latency: min = 0.107422 ms, max = 0.196655 ms, mean = 0.127499 ms, median = 0.122742 ms, percentile(99%) = 0.17981 ms [10/09/2023-14:49:56] [I] GPU Compute Time: min = 2.70557 ms, max = 2.9538 ms, mean = 2.75517 ms, median = 2.72681 ms, percentile(99%) = 2.92682 ms [10/09/2023-14:49:56] [I] D2H Latency: min = 0.0327148 ms, max = 0.0891724 ms, mean = 0.0649182 ms, median = 0.0617676 ms, percentile(99%) = 0.0858765 ms [10/09/2023-14:49:56] [I] Total Host Walltime: 3.00721 s [10/09/2023-14:49:56] [I] Total GPU Compute Time: 3.00313 s [10/09/2023-14:49:56] [I] Explanations of the performance metrics are printed in the verbose logs. [10/09/2023-14:49:56] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8400] # trtexec --onnx=/tmp/datadir/jetson_orin_profiler/fVq2Mr.onnx --best [10/09/2023-14:43:53] [W] [TRT] onnx2trt_utils.cpp:363: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [10/09/2023-14:43:53] [W] [TRT] Tensor DataType is determined at build time for tensors not marked as input or output. [10/09/2023-14:43:53] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool. [10/09/2023-14:49:56] [W] * GPU compute time is unstable, with coefficient of variance = 2.50917%. [10/09/2023-14:49:56] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability. ========== megpeak CPU perf ========== there are 12 cores, currently use core id :0 Vendor is: ARM, uArch: unknown, frequency: 0Hz bandwidth: 18.162656 Gbps nop throughput: 0.051038 ns 19.593052 GFlops latency: 0.063943 ns : ldd throughput: 0.156822 ns 12.753348 GFlops latency: 0.155758 ns : ldq throughput: 0.158044 ns 25.309477 GFlops latency: 0.164016 ns : stq throughput: 0.234704 ns 17.042721 GFlops latency: 0.232976 ns : ldpq throughput: 0.311298 ns 25.698837 GFlops latency: 0.310349 ns : lddx2 throughput: 0.233646 ns 17.119894 GFlops latency: 0.233240 ns : ld1q throughput: 0.154976 ns 25.810520 GFlops latency: 0.158148 ns : eor throughput: 0.233010 ns 17.166622 GFlops latency: 0.932019 ns : fmla throughput: 0.233660 ns 34.237728 GFlops latency: 1.862553 ns : fmlad throughput: 0.233286 ns 17.146313 GFlops latency: 1.866461 ns : fmla_x2 throughput: 0.461863 ns 34.642342 GFlops latency: 3.736033 ns : mla throughput: 0.466095 ns 17.163897 GFlops latency: 1.864779 ns : fmul throughput: 0.232068 ns 17.236309 GFlops latency: 1.400778 ns : mul throughput: 0.467611 ns 8.554125 GFlops latency: 1.867239 ns : addp throughput: 0.233756 ns 17.111837 GFlops latency: 0.931935 ns : sadalp throughput: 0.467289 ns 8.560019 GFlops latency: 1.908405 ns : add throughput: 0.234036 ns 17.091366 GFlops latency: 0.934213 ns : fadd throughput: 0.232060 ns 17.236898 GFlops latency: 0.931553 ns : smull throughput: 0.465611 ns 8.590869 GFlops latency: 1.863507 ns : smlal_4b throughput: 0.462315 ns 17.304235 GFlops latency: 1.862949 ns : smlal_8b throughput: 0.465971 ns 34.336929 GFlops latency: 1.864434 ns : dupd_lane_s8 throughput: 0.249114 ns 32.113747 GFlops latency: 0.944063 ns : mlaq_lane_s16 throughput: 0.466453 ns 34.301445 GFlops latency: 1.868227 ns : sshll throughput: 0.465749 ns 17.176647 GFlops latency: 0.931611 ns : tbl throughput: 0.235838 ns 67.843071 GFlops latency: 0.931875 ns : ins throughput: 0.465695 ns 4.294660 GFlops latency: 1.133675 ns : sqrdmulh throughput: 0.466977 ns 8.565740 GFlops latency: 1.865243 ns : usubl throughput: 0.232038 ns 17.238537 GFlops latency: 0.945231 ns : abs throughput: 0.234244 ns 17.076187 GFlops latency: 0.932531 ns : fcvtzs throughput: 0.931721 ns 4.293129 GFlops latency: 1.867801 ns : scvtf throughput: 0.931455 ns 4.294355 GFlops latency: 1.857579 ns : fcvtns throughput: 0.932677 ns 4.288729 GFlops latency: 1.867835 ns : fcvtms throughput: 0.958486 ns 4.173250 GFlops latency: 1.864045 ns : fcvtps throughput: 0.928809 ns 4.306590 GFlops latency: 1.864524 ns : fcvtas throughput: 0.931749 ns 4.293001 GFlops latency: 1.870157 ns : fcvtn throughput: 0.936997 ns 4.268956 GFlops latency: 1.866197 ns : fcvtl throughput: 0.933835 ns 4.283411 GFlops latency: 1.864444 ns : prefetch_very_long throughput: 13.518499 ns 0.295891 GFlops latency: 0.156454 ns : ins_ldd throughput: 0.470285 ns 4.252743 GFlops latency: 0.467163 ns :Test ldd ins dual issue ldd_ldx_ins throughput: 1.126971 ns 3.549336 GFlops latency: 0.468377 ns : ldqstq throughput: 2.956711 ns 1.352854 GFlops latency: 2.997606 ns :Test ldq stq dual issue ldq_fmlaq throughput: 0.232492 ns 34.409740 GFlops latency: 0.233272 ns : stq_fmlaq_lane throughput: 0.310987 ns 25.724541 GFlops latency: 2.331987 ns :Test stq fmlaq_lane dual issue ldd_fmlad throughput: 0.233794 ns 17.109056 GFlops latency: 0.232268 ns :Test ldd fmlad dual issue ldq_fmlaq_sep throughput: 0.232920 ns 34.346581 GFlops latency: 1.863248 ns :Test throughput ldq + 2 x fmlaq ldq_fmlaq_lane_sep throughput: 0.233372 ns 34.279984 GFlops latency: 2.344309 ns :Test compute throughput ldq + 2 x fmlaq_lane ldd_fmlaq_sep throughput: 0.232732 ns 34.374252 GFlops latency: 1.862347 ns :Test compute throughput ldq + fmlaq lds_fmlaq_lane_sep throughput: 0.231064 ns 34.622395 GFlops latency: 2.328885 ns : ldd_fmlaq_lane_sep throughput: 0.233898 ns 34.202896 GFlops latency: 2.328241 ns :Test compute throughput ldd + fmlaq_lane ldx_fmlaq_lane_sep throughput: 0.232600 ns 34.393761 GFlops latency: 2.330561 ns : ldd_ldx_ins_fmlaq_lane_sep throughput: 0.394026 ns 20.303207 GFlops latency: 2.328663 ns :Test compute throughput ldd+fmlaq+ldx+fmlaq+ins+fmlaq ldd_nop_ldx_ins_fmlaq_lane_sep throughput: 0.354759 ns 22.550552 GFlops latency: 2.331833 ns : ins_fmlaq_lane_1_4_sep throughput: 0.409954 ns 19.514381 GFlops latency: 3.771058 ns :Test compute throughput ins + 4 x fmlaq_lane ldd_fmlaq_lane_1_4_sep throughput: 0.233190 ns 34.306808 GFlops latency: 2.337683 ns :Test compute throughput ldd + 4 x fmlaq_lane ldq_fmlaq_lane_1_4_sep throughput: 0.233305 ns 34.289913 GFlops latency: 0.229988 ns :Test compute throughput ldq + 4 x fmlaq_lane ins_fmlaq_lane_1_3_sep throughput: 0.427551 ns 18.711222 GFlops latency: 3.767305 ns :Test compute throughput ins + 3 x fmlaq_lane ldd_fmlaq_lane_1_3_sep throughput: 0.393252 ns 20.343193 GFlops latency: 3.820154 ns : ldq_fmlaq_lane_1_3_sep throughput: 0.233680 ns 34.234901 GFlops latency: 0.232836 ns :Test compute throughput ldq + 3 x fmlaq_lane ldq_fmlaq_lane_1_2_sep throughput: 0.232827 ns 34.360226 GFlops latency: 0.232666 ns :Test compute throughput ldq + 2 x fmlaq_lane ins_fmlaq_lane_sep throughput: 1.190536 ns 6.719664 GFlops latency: 2.328707 ns : dupd_fmlaq_lane_sep throughput: 0.698771 ns 11.448673 GFlops latency: 2.327169 ns : smlal_8b_addp throughput: 0.468809 ns 34.129063 GFlops latency: 3.263544 ns : smlal_8b_dupd throughput: 0.466561 ns 34.293507 GFlops latency: 1.866995 ns : ldd_smlalq_sep_8b throughput: 0.462229 ns 34.614906 GFlops latency: 0.461599 ns :Test ldd smlalq dual issue ldq_smlalq_sep throughput: 0.458745 ns 34.877800 GFlops latency: 0.456417 ns :Test ldq smlalq dual issue lddx2_smlalq_sep throughput: 0.456300 ns 35.064613 GFlops latency: 0.456667 ns : smlal_sadalp throughput: 0.456355 ns 35.060459 GFlops latency: 3.670070 ns : smull_smlal_sadalp throughput: 0.937691 ns 34.126366 GFlops latency: 5.503775 ns :Test smull smlal dual issue smull_smlal_sadalp_sep throughput: 0.456601 ns 35.041515 GFlops latency: 5.516257 ns : ins_smlalq_sep_1_2 throughput: 0.586703 ns 27.271025 GFlops latency: 3.406196 ns : ldx_ins_smlalq_sep throughput: 0.456225 ns 35.070450 GFlops latency: 3.439508 ns : dupd_lane_smlal_s8 throughput: 0.456293 ns 35.065220 GFlops latency: 3.236840 ns : ldd_mla_s16_lane_1_4_sep throughput: 0.458767 ns 34.876091 GFlops latency: 0.456275 ns : ldrd_sshll throughput: 0.456251 ns 17.534225 GFlops latency: 0.456243 ns : sshll_ins_sep throughput: 0.746300 ns 10.719557 GFlops latency: 2.080399 ns : ========== task ========== taskid: 1143receiver: jetson orin profiler