This is tricky, I think. It will depend a lot on the particular allocation of nodes that the job scheduler is giving you.
PETSc does give a number of options for timing and they do help identify code that is unbalanced; see
https://www.mcs.anl.gov/petsc/petsc-current/docs/manual.pdf#chapter.13 .
With regard to particular nodes you could try placing calls to PetscTime in your code (this will return the current time in seconds -- from some reference, typically the epoch). If you print this along with the value of rank, then you will see which nodes arrived at the print statement at what times. This will tell you which nodes are slower than others.
I don't know if there is a Fortran wrapper for PetscTime, so you will just have to try. Also I do not know if it is synchronized across all processes. If not, you can directly use the MPI_Wtime() function which returns a real*8 time in seconds since a fixed reference (like the epoch). The function MPI_Wtick() will give you the resolution. I believe that the MPI clocks are not guaranteed to be synchronized; see the value of MPI_WTIME_IS_GLOBAL.