rel:: [[Software Engineering]] # Performance ## Techniques - [[Profiling]] - [[Stochastic Profiling]] - [[Method R]] - [IPC Monitoring vs CPU%](x-devonthink-item://12C1EADD-88D5-4CCE-85AA-8F0870EA8DA6) ^c983f8 - on [[Linux]], CPU% is computed as 100-idle time. - *The kernel keeps tabs of time devoted to the idle thread, and from there derives CPU% usage (100 – idle%).* - *If it stalls waiting on memory, the kernel still reports it as non-idle CPU% usage for as long as it is “running.” [...] an instruction could already be in the CPU pipeline but stalled waiting on memory for its operands. Or the CPU could stall waiting on memory to feed the instruction into the pipeline.* - pipelined, superscalar CPUs execute multiple instructions per cycle (IPC). A majority of x86 CPUs can hit 4 IPC with [newest models hitting 6](https://www.agner.org/optimize/microarchitecture.pdf)). - Cache misses, or branch prediction misses can result in less utilization even though CPU utilization is high. e.g. for x86, if IPC is 2, only 50% of the CPU’s full capacity is utilized, even if CPU% usage is 95%. - [which metrics to use](https://tratt.net/laurie/blog/2022/what_metric_to_use_when_benchmarking.html) ([devonthink](x-devonthink-item://0D5F2EB8-4144-4AAB-BF95-728B0F0D81D8)) - [5 performance myths](https://readwise.io/reader/shared/01gg1kry4ne1b9v57mver2y570) - [[SIMD]] ## Resources ### Experience Reports - [[Extreme HTTP Performance Tuning - 1.2M API req_s]] - [Linux Kernel vs DPDK: HTTP Performance Showdown](x-devonthink-item://A50E8B5C-2C61-4565-A5E0-7F3EBFB86DBD) - [DPDK](https://en.wikipedia.org/wiki/Data_Plane_Development_Kit) is a kernel-bypass library for data plane and network interface controller drivers to process TCP packets in user space. This can be significantly faster than the kernel network stack from - aggressive busy polling - no copying packet data between kernel and user space - thread-per-core architecture - DPDK downsides include - must have more than one interface for control as the application controls an entire interface - observability must be provided by the application - polling mode processing makes CPU run at 100% all the time - [[Command-line Tools can be 235x Faster than your Hadoop Cluster]] - [Memory Access Patterns and New Year’s Resolutions](x-devonthink-item://D79489FB-5FC2-4AEB-AA32-9992FF18A472) - [How Fast Are Linux Pipes Anyway](x-devonthink-item://76FA6ED3-6018-4B08-B6A4-88A99089E44E) ## Historical Context - [[IO Advancements Are Breaking Intuition on Performance]]