Thank you Professor Govindjee. If I understand correctly, I should have the following three options to achieve the same result:
1) Use PETSc to compute the inner product of rhs and sol to obtain the energy if the problem is linear (since I should converge in one iteration).
2) If the problem is nonlinear I could accumulate this incremental energy after each iteration. I'm not sure about this yet because of the nonlinearity (i.e. perhaps it isn't valid to perform the sum).
3) If the problem is nonlinear, compute the inner product using U (hr(np(40)), and F (hr(np(27)) on each processor and call MPI_Reduce to accumulate this quantity across processors. I think this is the way I will likely go so thank you for your suggestion. It seems as though this would avoid the ghost node complexity since numpn does not seem to include the ghost nodes (at least according to my interpretation of the statement "The sum of the nodes in a partition (numpn) and its ghost nodes defines the
total number of nodes in each partitioned data file (i.e., the total number of nodes, numnp, in each mesh partition)." in the parallel manual).
Thank you again,
Jonathan