========== onnx ==========
========== command ==========
trtexec --onnx=/tmp/datadir/jetson_orin_profiler/fVq2Mr.onnx --best
========== profiling ==========
&&&& RUNNING TensorRT.trtexec [TensorRT v8400] # trtexec --onnx=/tmp/datadir/jetson_orin_profiler/fVq2Mr.onnx --best
[10/09/2023-14:43:51] [I] === Model Options ===
[10/09/2023-14:43:51] [I] Format: ONNX
[10/09/2023-14:43:51] [I] Model: /tmp/datadir/jetson_orin_profiler/fVq2Mr.onnx
[10/09/2023-14:43:51] [I] Output:
[10/09/2023-14:43:51] [I] === Build Options ===
[10/09/2023-14:43:51] [I] Max batch: explicit batch
[10/09/2023-14:43:51] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/09/2023-14:43:51] [I] minTiming: 1
[10/09/2023-14:43:51] [I] avgTiming: 8
[10/09/2023-14:43:51] [I] Precision: FP32+FP16+INT8
[10/09/2023-14:43:51] [I] LayerPrecisions: 
[10/09/2023-14:43:51] [I] Calibration: Dynamic
[10/09/2023-14:43:51] [I] Refit: Disabled
[10/09/2023-14:43:51] [I] Sparsity: Disabled
[10/09/2023-14:43:51] [I] Safe mode: Disabled
[10/09/2023-14:43:51] [I] DirectIO mode: Disabled
[10/09/2023-14:43:51] [I] Restricted mode: Disabled
[10/09/2023-14:43:51] [I] Build only: Disabled
[10/09/2023-14:43:51] [I] Save engine: 
[10/09/2023-14:43:51] [I] Load engine: 
[10/09/2023-14:43:51] [I] Profiling verbosity: 0
[10/09/2023-14:43:51] [I] Tactic sources: Using default tactic sources
[10/09/2023-14:43:51] [I] timingCacheMode: local
[10/09/2023-14:43:51] [I] timingCacheFile: 
[10/09/2023-14:43:51] [I] Input(s)s format: fp32:CHW
[10/09/2023-14:43:51] [I] Output(s)s format: fp32:CHW
[10/09/2023-14:43:51] [I] Input build shapes: model
[10/09/2023-14:43:51] [I] Input calibration shapes: model
[10/09/2023-14:43:51] [I] === System Options ===
[10/09/2023-14:43:51] [I] Device: 0
[10/09/2023-14:43:51] [I] DLACore: 
[10/09/2023-14:43:51] [I] Plugins:
[10/09/2023-14:43:51] [I] === Inference Options ===
[10/09/2023-14:43:51] [I] Batch: Explicit
[10/09/2023-14:43:51] [I] Input inference shapes: model
[10/09/2023-14:43:51] [I] Iterations: 10
[10/09/2023-14:43:51] [I] Duration: 3s (+ 200ms warm up)
[10/09/2023-14:43:51] [I] Sleep time: 0ms
[10/09/2023-14:43:51] [I] Idle time: 0ms
[10/09/2023-14:43:51] [I] Streams: 1
[10/09/2023-14:43:51] [I] ExposeDMA: Disabled
[10/09/2023-14:43:51] [I] Data transfers: Enabled
[10/09/2023-14:43:51] [I] Spin-wait: Disabled
[10/09/2023-14:43:51] [I] Multithreading: Disabled
[10/09/2023-14:43:51] [I] CUDA Graph: Disabled
[10/09/2023-14:43:51] [I] Separate profiling: Disabled
[10/09/2023-14:43:51] [I] Time Deserialize: Disabled
[10/09/2023-14:43:51] [I] Time Refit: Disabled
[10/09/2023-14:43:51] [I] Inputs:
[10/09/2023-14:43:51] [I] === Reporting Options ===
[10/09/2023-14:43:51] [I] Verbose: Disabled
[10/09/2023-14:43:51] [I] Averages: 10 inferences
[10/09/2023-14:43:51] [I] Percentile: 99
[10/09/2023-14:43:51] [I] Dump refittable layers:Disabled
[10/09/2023-14:43:51] [I] Dump output: Disabled
[10/09/2023-14:43:51] [I] Profile: Disabled
[10/09/2023-14:43:51] [I] Export timing to JSON file: 
[10/09/2023-14:43:51] [I] Export output to JSON file: 
[10/09/2023-14:43:51] [I] Export profile to JSON file: 
[10/09/2023-14:43:51] [I] 
[10/09/2023-14:43:51] [I] === Device Information ===
[10/09/2023-14:43:51] [I] Selected Device: Orin
[10/09/2023-14:43:51] [I] Compute Capability: 8.7
[10/09/2023-14:43:51] [I] SMs: 16
[10/09/2023-14:43:51] [I] Compute Clock Rate: 1.3 GHz
[10/09/2023-14:43:51] [I] Device Global Memory: 30622 MiB
[10/09/2023-14:43:51] [I] Shared Memory per SM: 164 KiB
[10/09/2023-14:43:51] [I] Memory Bus Width: 128 bits (ECC disabled)
[10/09/2023-14:43:51] [I] Memory Clock Rate: 1.3 GHz
[10/09/2023-14:43:51] [I] 
[10/09/2023-14:43:51] [I] TensorRT version: 8.4.0
[10/09/2023-14:43:51] [I] [TRT] [MemUsageChange] Init CUDA: CPU +302, GPU +0, now: CPU 327, GPU 7995 (MiB)
[10/09/2023-14:43:53] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +403, GPU +534, now: CPU 749, GPU 8546 (MiB)
[10/09/2023-14:43:53] [I] Start parsing network model
[10/09/2023-14:43:53] [I] [TRT] ----------------------------------------------------------------
[10/09/2023-14:43:53] [I] [TRT] Input filename:   /tmp/datadir/jetson_orin_profiler/fVq2Mr.onnx
[10/09/2023-14:43:53] [I] [TRT] ONNX IR version:  0.0.6
[10/09/2023-14:43:53] [I] [TRT] Opset version:    11
[10/09/2023-14:43:53] [I] [TRT] Producer name:    pytorch
[10/09/2023-14:43:53] [I] [TRT] Producer version: 1.12.1
[10/09/2023-14:43:53] [I] [TRT] Domain:           
[10/09/2023-14:43:53] [I] [TRT] Model version:    0
[10/09/2023-14:43:53] [I] [TRT] Doc string:       
[10/09/2023-14:43:53] [I] [TRT] ----------------------------------------------------------------
[10/09/2023-14:43:53] [I] Finish parsing network model
[10/09/2023-14:43:53] [I] [TRT] ---------- Layers Running on DLA ----------
[10/09/2023-14:43:53] [I] [TRT] ---------- Layers Running on GPU ----------
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_0 + Relu_1
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_2 + Relu_3
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_4 + Relu_5
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] POOLING: MaxPool_6
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_7 + Relu_8
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_9 + Add_10 + Relu_11
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_12 + Relu_13
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_14 + Add_15 + Relu_16
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_17 + Relu_18
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_77 + Relu_78
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_19
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_20 + Add_21 + Relu_22
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_23 + Relu_24
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_25 + Add_26 + Relu_27
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_28 + Relu_29
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_30
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_31 + Add_32 + Relu_33
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_34 + Relu_35
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_36 + Add_37 + Relu_38
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_39 + Relu_40
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_41
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_42 + Add_43 + Relu_44
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_45 + Relu_46
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_47 + Add_48 + Relu_49
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] POOLING: GlobalAveragePool_50
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_60 + Relu_61
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_62 + Relu_63
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_66 + Relu_67
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_70 + Relu_71
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_51 + Relu_52
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_64 + Relu_65
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_68 + Relu_69
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_72 + Relu_73
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] RESIZE: Resize_59
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] COPY: onnx::Concat_317 copy
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_75 + Relu_76
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] RESIZE: Resize_82
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] COPY: onnx::Concat_360 copy
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_84 + Relu_85
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_86 + Relu_87
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONV_DEP_SEP: Conv_88 + Relu_89 + Conv_90 + Relu_91
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] CONVOLUTION: Conv_92
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] RESIZE: Resize_96
[10/09/2023-14:43:53] [I] [TRT] [GpuLayer] TOPK: ArgMax_97
[10/09/2023-14:43:54] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +534, GPU +779, now: CPU 1330, GPU 9415 (MiB)
[10/09/2023-14:43:54] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +85, GPU +135, now: CPU 1415, GPU 9550 (MiB)
[10/09/2023-14:43:54] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[10/09/2023-14:49:53] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[10/09/2023-14:49:53] [I] [TRT] Total Host Persistent Memory: 68768
[10/09/2023-14:49:53] [I] [TRT] Total Device Persistent Memory: 0
[10/09/2023-14:49:53] [I] [TRT] Total Scratch Memory: 8388608
[10/09/2023-14:49:53] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 15 MiB, GPU 10504 MiB
[10/09/2023-14:49:53] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 1.44971ms to assign 7 blocks to 41 nodes requiring 23069184 bytes.
[10/09/2023-14:49:53] [I] [TRT] Total Activation Memory: 23069184
[10/09/2023-14:49:53] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +12, GPU +16, now: CPU 12, GPU 16 (MiB)
[10/09/2023-14:49:53] [I] Engine built in 362.216 sec.
[10/09/2023-14:49:53] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1622, GPU 10512 (MiB)
[10/09/2023-14:49:53] [I] [TRT] Loaded engine size: 13 MiB
[10/09/2023-14:49:53] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +11, now: CPU 0, GPU 11 (MiB)
[10/09/2023-14:49:53] [I] Engine deserialized in 0.0166877 sec.
[10/09/2023-14:49:53] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +22, now: CPU 0, GPU 33 (MiB)
[10/09/2023-14:49:53] [I] Using random values for input input
[10/09/2023-14:49:53] [I] Created input binding for input with dimensions 1x3x512x512
[10/09/2023-14:49:53] [I] Using random values for output output
[10/09/2023-14:49:53] [I] Created output binding for output with dimensions 1x1x512x512
[10/09/2023-14:49:53] [I] Starting inference
[10/09/2023-14:49:56] [I] Warmup completed 62 queries over 200 ms
[10/09/2023-14:49:56] [I] Timing trace has 1090 queries over 3.00721 s
[10/09/2023-14:49:56] [I] 
[10/09/2023-14:49:56] [I] === Trace details ===
[10/09/2023-14:49:56] [I] Trace averages of 10 runs:
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90507 ms - Host latency: 3.1623 ms (enqueue 0.416463 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90359 ms - Host latency: 3.1557 ms (enqueue 0.434381 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.9075 ms - Host latency: 3.16463 ms (enqueue 0.418427 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90257 ms - Host latency: 3.15648 ms (enqueue 0.438312 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90617 ms - Host latency: 3.1637 ms (enqueue 0.417685 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90267 ms - Host latency: 3.15533 ms (enqueue 0.450476 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91083 ms - Host latency: 3.16675 ms (enqueue 0.424933 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90709 ms - Host latency: 3.15871 ms (enqueue 0.425381 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90696 ms - Host latency: 3.16137 ms (enqueue 0.425742 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91012 ms - Host latency: 3.16513 ms (enqueue 0.415768 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.90859 ms - Host latency: 3.16026 ms (enqueue 0.416815 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91745 ms - Host latency: 3.17538 ms (enqueue 0.416156 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91884 ms - Host latency: 3.17151 ms (enqueue 0.405414 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91745 ms - Host latency: 3.17446 ms (enqueue 0.373779 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91771 ms - Host latency: 3.1718 ms (enqueue 0.353784 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91638 ms - Host latency: 3.17329 ms (enqueue 0.402789 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.91426 ms - Host latency: 3.16848 ms (enqueue 0.411945 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.87631 ms - Host latency: 3.12305 ms (enqueue 0.385657 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72517 ms - Host latency: 2.90593 ms (enqueue 0.413159 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7231 ms - Host latency: 2.90192 ms (enqueue 0.399103 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72532 ms - Host latency: 2.90799 ms (enqueue 0.408411 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72289 ms - Host latency: 2.90276 ms (enqueue 0.404736 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7274 ms - Host latency: 2.90708 ms (enqueue 0.400537 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72813 ms - Host latency: 2.90961 ms (enqueue 0.364954 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72476 ms - Host latency: 2.90514 ms (enqueue 0.324939 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7298 ms - Host latency: 2.91313 ms (enqueue 0.357587 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7288 ms - Host latency: 2.90847 ms (enqueue 0.322028 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72578 ms - Host latency: 2.91083 ms (enqueue 0.364374 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72618 ms - Host latency: 2.91013 ms (enqueue 0.377563 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72821 ms - Host latency: 2.90884 ms (enqueue 0.414325 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72363 ms - Host latency: 2.90326 ms (enqueue 0.399463 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72731 ms - Host latency: 2.90835 ms (enqueue 0.401697 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7262 ms - Host latency: 2.90316 ms (enqueue 0.401685 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7264 ms - Host latency: 2.90647 ms (enqueue 0.414526 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72576 ms - Host latency: 2.90641 ms (enqueue 0.401624 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72671 ms - Host latency: 2.90752 ms (enqueue 0.3901 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72703 ms - Host latency: 2.90739 ms (enqueue 0.414172 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72511 ms - Host latency: 2.90651 ms (enqueue 0.414685 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72607 ms - Host latency: 2.90415 ms (enqueue 0.405457 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72124 ms - Host latency: 2.90184 ms (enqueue 0.411145 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72335 ms - Host latency: 2.90562 ms (enqueue 0.403052 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72317 ms - Host latency: 2.90028 ms (enqueue 0.40033 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72258 ms - Host latency: 2.90234 ms (enqueue 0.414868 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72322 ms - Host latency: 2.90375 ms (enqueue 0.400281 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72747 ms - Host latency: 2.90956 ms (enqueue 0.409656 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72385 ms - Host latency: 2.90184 ms (enqueue 0.398645 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72313 ms - Host latency: 2.90262 ms (enqueue 0.399622 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72268 ms - Host latency: 2.90144 ms (enqueue 0.416138 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72307 ms - Host latency: 2.90114 ms (enqueue 0.393555 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72283 ms - Host latency: 2.90148 ms (enqueue 0.411365 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72465 ms - Host latency: 2.9046 ms (enqueue 0.403308 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72552 ms - Host latency: 2.90345 ms (enqueue 0.400049 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72345 ms - Host latency: 2.90293 ms (enqueue 0.398767 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72509 ms - Host latency: 2.90634 ms (enqueue 0.410291 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72419 ms - Host latency: 2.90126 ms (enqueue 0.39884 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72383 ms - Host latency: 2.90428 ms (enqueue 0.394775 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72299 ms - Host latency: 2.90341 ms (enqueue 0.412415 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72114 ms - Host latency: 2.90051 ms (enqueue 0.393494 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7235 ms - Host latency: 2.9048 ms (enqueue 0.412061 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72556 ms - Host latency: 2.90575 ms (enqueue 0.401501 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72245 ms - Host latency: 2.9038 ms (enqueue 0.413293 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72371 ms - Host latency: 2.90236 ms (enqueue 0.402747 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72251 ms - Host latency: 2.90408 ms (enqueue 0.407153 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72361 ms - Host latency: 2.90463 ms (enqueue 0.423059 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72444 ms - Host latency: 2.90293 ms (enqueue 0.395801 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72235 ms - Host latency: 2.90092 ms (enqueue 0.41676 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72723 ms - Host latency: 2.90797 ms (enqueue 0.396533 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72432 ms - Host latency: 2.90239 ms (enqueue 0.398633 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72578 ms - Host latency: 2.90315 ms (enqueue 0.400977 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72141 ms - Host latency: 2.90327 ms (enqueue 0.414038 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72839 ms - Host latency: 2.90642 ms (enqueue 0.400439 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72566 ms - Host latency: 2.90493 ms (enqueue 0.395142 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72695 ms - Host latency: 2.90674 ms (enqueue 0.401538 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72703 ms - Host latency: 2.90393 ms (enqueue 0.398169 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72717 ms - Host latency: 2.90852 ms (enqueue 0.409937 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7249 ms - Host latency: 2.9041 ms (enqueue 0.39729 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72429 ms - Host latency: 2.90579 ms (enqueue 0.412842 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72502 ms - Host latency: 2.90471 ms (enqueue 0.399365 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72319 ms - Host latency: 2.90342 ms (enqueue 0.399414 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72314 ms - Host latency: 2.90562 ms (enqueue 0.418555 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72461 ms - Host latency: 2.90591 ms (enqueue 0.33313 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72988 ms - Host latency: 2.9114 ms (enqueue 0.361914 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72549 ms - Host latency: 2.9095 ms (enqueue 0.360474 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72043 ms - Host latency: 2.90259 ms (enqueue 0.421533 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72314 ms - Host latency: 2.90029 ms (enqueue 0.400269 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72419 ms - Host latency: 2.90427 ms (enqueue 0.412036 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72695 ms - Host latency: 2.90618 ms (enqueue 0.396558 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72493 ms - Host latency: 2.90625 ms (enqueue 0.413403 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72527 ms - Host latency: 2.90703 ms (enqueue 0.40354 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72546 ms - Host latency: 2.90757 ms (enqueue 0.401294 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72673 ms - Host latency: 2.90671 ms (enqueue 0.413403 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72415 ms - Host latency: 2.90442 ms (enqueue 0.398706 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72507 ms - Host latency: 2.90488 ms (enqueue 0.415356 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72952 ms - Host latency: 2.90859 ms (enqueue 0.394995 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72417 ms - Host latency: 2.90459 ms (enqueue 0.398315 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72478 ms - Host latency: 2.90437 ms (enqueue 0.39978 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72388 ms - Host latency: 2.90344 ms (enqueue 0.406934 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72383 ms - Host latency: 2.90427 ms (enqueue 0.405688 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72639 ms - Host latency: 2.90825 ms (enqueue 0.409619 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.7217 ms - Host latency: 2.90139 ms (enqueue 0.402881 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72532 ms - Host latency: 2.90596 ms (enqueue 0.397681 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72151 ms - Host latency: 2.90239 ms (enqueue 0.403076 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72319 ms - Host latency: 2.90249 ms (enqueue 0.401807 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72466 ms - Host latency: 2.90664 ms (enqueue 0.417407 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72554 ms - Host latency: 2.9032 ms (enqueue 0.399731 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72524 ms - Host latency: 2.90249 ms (enqueue 0.402832 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72507 ms - Host latency: 2.90859 ms (enqueue 0.41062 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72581 ms - Host latency: 2.90664 ms (enqueue 0.396826 ms)
[10/09/2023-14:49:56] [I] Average on 10 runs - GPU latency: 2.72693 ms - Host latency: 2.90413 ms (enqueue 0.40542 ms)
[10/09/2023-14:49:56] [I] 
[10/09/2023-14:49:56] [I] === Performance summary ===
[10/09/2023-14:49:56] [I] Throughput: 362.462 qps
[10/09/2023-14:49:56] [I] Latency: min = 2.88501 ms, max = 3.21155 ms, mean = 2.94758 ms, median = 2.90625 ms, percentile(99%) = 3.18567 ms
[10/09/2023-14:49:56] [I] Enqueue Time: min = 0.203857 ms, max = 0.647705 ms, mean = 0.401836 ms, median = 0.349258 ms, percentile(99%) = 0.624023 ms
[10/09/2023-14:49:56] [I] H2D Latency: min = 0.107422 ms, max = 0.196655 ms, mean = 0.127499 ms, median = 0.122742 ms, percentile(99%) = 0.17981 ms
[10/09/2023-14:49:56] [I] GPU Compute Time: min = 2.70557 ms, max = 2.9538 ms, mean = 2.75517 ms, median = 2.72681 ms, percentile(99%) = 2.92682 ms
[10/09/2023-14:49:56] [I] D2H Latency: min = 0.0327148 ms, max = 0.0891724 ms, mean = 0.0649182 ms, median = 0.0617676 ms, percentile(99%) = 0.0858765 ms
[10/09/2023-14:49:56] [I] Total Host Walltime: 3.00721 s
[10/09/2023-14:49:56] [I] Total GPU Compute Time: 3.00313 s
[10/09/2023-14:49:56] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/09/2023-14:49:56] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8400] # trtexec --onnx=/tmp/datadir/jetson_orin_profiler/fVq2Mr.onnx --best
[10/09/2023-14:43:53] [W] [TRT] onnx2trt_utils.cpp:363: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[10/09/2023-14:43:53] [W] [TRT] Tensor DataType is determined at build time for tensors not marked as input or output.
[10/09/2023-14:43:53] [W] [TRT] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool.
[10/09/2023-14:49:56] [W] * GPU compute time is unstable, with coefficient of variance = 2.50917%.
[10/09/2023-14:49:56] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.

========== megpeak CPU perf ==========
there are 12 cores, currently use core id :0
Vendor is: ARM, uArch: unknown, frequency: 0Hz

bandwidth: 18.162656 Gbps
nop throughput: 0.051038 ns 19.593052 GFlops latency: 0.063943 ns :
ldd throughput: 0.156822 ns 12.753348 GFlops latency: 0.155758 ns :
ldq throughput: 0.158044 ns 25.309477 GFlops latency: 0.164016 ns :
stq throughput: 0.234704 ns 17.042721 GFlops latency: 0.232976 ns :
ldpq throughput: 0.311298 ns 25.698837 GFlops latency: 0.310349 ns :
lddx2 throughput: 0.233646 ns 17.119894 GFlops latency: 0.233240 ns :
ld1q throughput: 0.154976 ns 25.810520 GFlops latency: 0.158148 ns :
eor throughput: 0.233010 ns 17.166622 GFlops latency: 0.932019 ns :
fmla throughput: 0.233660 ns 34.237728 GFlops latency: 1.862553 ns :
fmlad throughput: 0.233286 ns 17.146313 GFlops latency: 1.866461 ns :
fmla_x2 throughput: 0.461863 ns 34.642342 GFlops latency: 3.736033 ns :
mla throughput: 0.466095 ns 17.163897 GFlops latency: 1.864779 ns :
fmul throughput: 0.232068 ns 17.236309 GFlops latency: 1.400778 ns :
mul throughput: 0.467611 ns 8.554125 GFlops latency: 1.867239 ns :
addp throughput: 0.233756 ns 17.111837 GFlops latency: 0.931935 ns :
sadalp throughput: 0.467289 ns 8.560019 GFlops latency: 1.908405 ns :
add throughput: 0.234036 ns 17.091366 GFlops latency: 0.934213 ns :
fadd throughput: 0.232060 ns 17.236898 GFlops latency: 0.931553 ns :
smull throughput: 0.465611 ns 8.590869 GFlops latency: 1.863507 ns :
smlal_4b throughput: 0.462315 ns 17.304235 GFlops latency: 1.862949 ns :
smlal_8b throughput: 0.465971 ns 34.336929 GFlops latency: 1.864434 ns :
dupd_lane_s8 throughput: 0.249114 ns 32.113747 GFlops latency: 0.944063 ns :
mlaq_lane_s16 throughput: 0.466453 ns 34.301445 GFlops latency: 1.868227 ns :
sshll throughput: 0.465749 ns 17.176647 GFlops latency: 0.931611 ns :
tbl throughput: 0.235838 ns 67.843071 GFlops latency: 0.931875 ns :
ins throughput: 0.465695 ns 4.294660 GFlops latency: 1.133675 ns :
sqrdmulh throughput: 0.466977 ns 8.565740 GFlops latency: 1.865243 ns :
usubl throughput: 0.232038 ns 17.238537 GFlops latency: 0.945231 ns :
abs throughput: 0.234244 ns 17.076187 GFlops latency: 0.932531 ns :
fcvtzs throughput: 0.931721 ns 4.293129 GFlops latency: 1.867801 ns :
scvtf throughput: 0.931455 ns 4.294355 GFlops latency: 1.857579 ns :
fcvtns throughput: 0.932677 ns 4.288729 GFlops latency: 1.867835 ns :
fcvtms throughput: 0.958486 ns 4.173250 GFlops latency: 1.864045 ns :
fcvtps throughput: 0.928809 ns 4.306590 GFlops latency: 1.864524 ns :
fcvtas throughput: 0.931749 ns 4.293001 GFlops latency: 1.870157 ns :
fcvtn throughput: 0.936997 ns 4.268956 GFlops latency: 1.866197 ns :
fcvtl throughput: 0.933835 ns 4.283411 GFlops latency: 1.864444 ns :
prefetch_very_long throughput: 13.518499 ns 0.295891 GFlops latency: 0.156454 ns :
ins_ldd throughput: 0.470285 ns 4.252743 GFlops latency: 0.467163 ns :Test ldd ins dual issue
ldd_ldx_ins throughput: 1.126971 ns 3.549336 GFlops latency: 0.468377 ns :
ldqstq throughput: 2.956711 ns 1.352854 GFlops latency: 2.997606 ns :Test ldq stq dual issue
ldq_fmlaq throughput: 0.232492 ns 34.409740 GFlops latency: 0.233272 ns :
stq_fmlaq_lane throughput: 0.310987 ns 25.724541 GFlops latency: 2.331987 ns :Test stq fmlaq_lane dual issue
ldd_fmlad throughput: 0.233794 ns 17.109056 GFlops latency: 0.232268 ns :Test ldd fmlad dual issue
ldq_fmlaq_sep throughput: 0.232920 ns 34.346581 GFlops latency: 1.863248 ns :Test throughput ldq + 2 x fmlaq
ldq_fmlaq_lane_sep throughput: 0.233372 ns 34.279984 GFlops latency: 2.344309 ns :Test compute throughput ldq + 2 x fmlaq_lane
ldd_fmlaq_sep throughput: 0.232732 ns 34.374252 GFlops latency: 1.862347 ns :Test compute throughput ldq + fmlaq
lds_fmlaq_lane_sep throughput: 0.231064 ns 34.622395 GFlops latency: 2.328885 ns :
ldd_fmlaq_lane_sep throughput: 0.233898 ns 34.202896 GFlops latency: 2.328241 ns :Test compute throughput ldd + fmlaq_lane
ldx_fmlaq_lane_sep throughput: 0.232600 ns 34.393761 GFlops latency: 2.330561 ns :
ldd_ldx_ins_fmlaq_lane_sep throughput: 0.394026 ns 20.303207 GFlops latency: 2.328663 ns :Test compute throughput ldd+fmlaq+ldx+fmlaq+ins+fmlaq
ldd_nop_ldx_ins_fmlaq_lane_sep throughput: 0.354759 ns 22.550552 GFlops latency: 2.331833 ns :
ins_fmlaq_lane_1_4_sep throughput: 0.409954 ns 19.514381 GFlops latency: 3.771058 ns :Test compute throughput ins + 4 x fmlaq_lane
ldd_fmlaq_lane_1_4_sep throughput: 0.233190 ns 34.306808 GFlops latency: 2.337683 ns :Test compute throughput ldd + 4 x fmlaq_lane
ldq_fmlaq_lane_1_4_sep throughput: 0.233305 ns 34.289913 GFlops latency: 0.229988 ns :Test compute throughput ldq + 4 x fmlaq_lane
ins_fmlaq_lane_1_3_sep throughput: 0.427551 ns 18.711222 GFlops latency: 3.767305 ns :Test compute throughput ins + 3 x fmlaq_lane
ldd_fmlaq_lane_1_3_sep throughput: 0.393252 ns 20.343193 GFlops latency: 3.820154 ns :
ldq_fmlaq_lane_1_3_sep throughput: 0.233680 ns 34.234901 GFlops latency: 0.232836 ns :Test compute throughput ldq + 3 x fmlaq_lane
ldq_fmlaq_lane_1_2_sep throughput: 0.232827 ns 34.360226 GFlops latency: 0.232666 ns :Test compute throughput ldq + 2 x fmlaq_lane
ins_fmlaq_lane_sep throughput: 1.190536 ns 6.719664 GFlops latency: 2.328707 ns :
dupd_fmlaq_lane_sep throughput: 0.698771 ns 11.448673 GFlops latency: 2.327169 ns :
smlal_8b_addp throughput: 0.468809 ns 34.129063 GFlops latency: 3.263544 ns :
smlal_8b_dupd throughput: 0.466561 ns 34.293507 GFlops latency: 1.866995 ns :
ldd_smlalq_sep_8b throughput: 0.462229 ns 34.614906 GFlops latency: 0.461599 ns :Test ldd smlalq dual issue
ldq_smlalq_sep throughput: 0.458745 ns 34.877800 GFlops latency: 0.456417 ns :Test ldq smlalq dual issue
lddx2_smlalq_sep throughput: 0.456300 ns 35.064613 GFlops latency: 0.456667 ns :
smlal_sadalp throughput: 0.456355 ns 35.060459 GFlops latency: 3.670070 ns :
smull_smlal_sadalp throughput: 0.937691 ns 34.126366 GFlops latency: 5.503775 ns :Test smull smlal dual issue
smull_smlal_sadalp_sep throughput: 0.456601 ns 35.041515 GFlops latency: 5.516257 ns :
ins_smlalq_sep_1_2 throughput: 0.586703 ns 27.271025 GFlops latency: 3.406196 ns :
ldx_ins_smlalq_sep throughput: 0.456225 ns 35.070450 GFlops latency: 3.439508 ns :
dupd_lane_smlal_s8 throughput: 0.456293 ns 35.065220 GFlops latency: 3.236840 ns :
ldd_mla_s16_lane_1_4_sep throughput: 0.458767 ns 34.876091 GFlops latency: 0.456275 ns :
ldrd_sshll throughput: 0.456251 ns 17.534225 GFlops latency: 0.456243 ns :
sshll_ins_sep throughput: 0.746300 ns 10.719557 GFlops latency: 2.080399 ns :

========== task ==========
taskid: 1143receiver: jetson orin profiler