Difference: AtlasJobMonitoring ( vs. 1)

Revision 12018-03-06 - seuster

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="AtlasJobMonitoring"

Atlas Job Monitoring

We are running ATLAS payloads on many clouds. We have only little influence what type of jobs are submitted to our worker nodes. We can select either single or multi core jobs, or can select the fraction of simulation, general and analysis jobs. Typically we announce standard jobs paramters in terms of disk and memory available.

We do close monitoring already of CPU and memory usage on our clouds. What was missing is the application status, if these jobs we were running on our clouds were successful or not, and what problem they faced in latter case. ATLAS publishes extensive information in a job database called bigpanda, which is accessible via a REST interface and returns the information in a JSON data a structure.

The job monitoring for ATLAS is (currently) on this webpage

Three sources of information are used and combined in a python script:

  • on condor.heprc.uvic.ca runs a cron job that collects information from condor(1) and cloudscheduler(2) one minute past each hour and stores in in a file. This file is transfered to belletunsrv.heprc.uvic.ca, processed and stored in elasticsearch on above instance
  • belletunsrv.heprc.uvic.ca does a curl request to the panda database 9 minutes past each hour
    1. condor :
      condor_status -long -attributes GlobalJobId,DetectedCpus,Machine,JobStart,VMType,Name -xml
    2. cloud_scheduler:
      cloud_status -m
    3. panda db:
       curl -q -H "Accept: application/json" -H "Content-Type: application/json"  -k "https://bigpanda.cern.ch/jobs/?computingsite=" + site + "&json&hours=1"
      with site == "IAAS" and "IAAS_MCORE".

The panda DB curl query returns all jobs that were scheduled to run in our two panda queues in the last hour. The information from cloud_scheduler complements on which cloud it ran, matched by the hostname of the VM. Condor then adds further information, the matching here is done via the GlobalJobId in condor and the batchid in panda. The data is then bulk uploaded into elasticsearch, the python script is here in github.

Daily updates/cleanups of elasticsearch

Occasionally, the collection of information might be interrupted, due to DB downtimes, unexpected input to the python scripts etc. Once a day, a cleanup script will try to cleanup stale information in elasticsearch. The script retrieves all pandaids for jobs that didn't report within the last 3 hours and that are not declared "finished", "failed", "cancelled" or "closed". With these panda ids the pandaDB is then enquired for updates and all information is updated, the python script is here in github.

Plot description

Note, might change over time !
The information is currently uploaded into the index "test2", that's why lots of plots and dashboard contain this string.
  • top row: "cloud" and "VMType" are from cloudscheduler / condor
  • 2nd row: "jobStatus" from Panda and Job transform (a.k.a. type of job) and in- and output.
  • 3rd row: "job duration in seconds" and "MaxPSS", which is the total memory consumption of this job, sharing of code in athenaMP taking fully into account
  • 4th and futher rows might go in future
-- seuster - 2018-03-06
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback