We’re excited to launch Unified Profiling for PySpark Consumer-Outlined Features (UDFs) as a part of Databricks Runtime 17.0 (launch notes). Unified Profiling for PySpark UDFs lets builders profile the efficiency and reminiscence utilization of their PySpark UDFs, together with monitoring perform calls, execution time, reminiscence utilization, and different metrics. This permits PySpark builders to simply establish and tackle bottlenecks, resulting in sooner and extra resource-efficient UDFs.
The unified profilers might be enabled by setting the Runtime SQL configuration “spark.sql.pyspark.udf.profiler” to “perf” or “reminiscence” to allow the efficiency or reminiscence profiler, respectively, as proven under.
Substitute for Legacy Profiling
Legacy profiling (1, 2) was carried out on the SparkContext stage and, thus, didn’t work with Spark Join. The brand new profiling is SparkSession-based, applies to Spark Join, and might be enabled or disabled at runtime. It maximizes API parity with legacy profiling by offering “present” and “dump” instructions to visualise profile outcomes and save them to a workspace folder. Moreover, it presents comfort APIs to assist handle and reset profile outcomes on demand. Lastly, it helps registered UDFs, which weren’t supported by the legacy profiling.
PySpark Efficiency Profiler
The PySpark efficiency profiler leverages Python’s built-in profilers to increase profiling capabilities to the motive force and UDFs executed on executors in a distributed method.
Let’s dive into an instance to see the PySpark efficiency profiler in motion. We run the next code on Databricks Runtime 17.0 notebooks.
The added.present() command shows efficiency profiling outcomes as proven under.
The output consists of info such because the variety of perform calls, whole time spent within the given perform, and the filename, together with the road quantity to help navigation. This info is crucial for figuring out tight loops in your PySpark applications and enabling you to make choices to enhance efficiency.
It is essential to notice that the UDF id in these outcomes immediately correlates with the one discovered within the Spark plan, by observing the “ArrowEvalPython (add1(…)#50L)”, which is revealed when calling the clarify methodology on the dataframe.
Lastly, we will dump the profiling outcomes to a folder and clear the outcome profiles as proven under.
PySpark Reminiscence Profiler
It’s based mostly on memory-profiler, which may profile the motive force, as seen right here. PySpark has expanded its utilization to incorporate profiling UDFs, that are executed on executors in a distributed method.
To allow reminiscence profiling on a cluster, we should always set up the memory-profiler on the cluster as proven under.
The above instance modifies the final two traces by:
Then we acquire reminiscence profiling outcomes as proven under.
The output consists of a number of columns that offer you a complete view of how your code performs by way of reminiscence utilization. “Mem utilization” reveals the reminiscence utilization after executing that line. “Increment” particulars the change in reminiscence utilization from the earlier line, serving to you notice the place reminiscence utilization spikes. “Occurrences” signifies what number of instances every line was executed.
The UDF id in these outcomes additionally immediately correlates with the one discovered within the Spark plan, the identical as efficiency profiling outcomes, by observing the “ArrowEvalPython (add1(…)#4L)”, which is revealed when calling the clarify methodology on the dataframe as proven under.
Please notice that for this performance to work, the memory-profiler package deal have to be put in in your cluster.
Conclusion
PySpark Unified Profiling, which incorporates efficiency and reminiscence profiling for UDFs, is obtainable in Databricks Runtime 17.0. Unified Profiling gives a streamlined methodology for observing essential facets reminiscent of perform name frequency, execution durations, and reminiscence consumption. It simplifies the method of pinpointing and resolving bottlenecks, paving the way in which for the event of sooner and extra resource-efficient UDFs.
Able to discover extra? Take a look at the PySpark API documentation for detailed guides and examples.