Browse Source

HPCC-27439 Ensure job done is sent to slaves after part failure

If a job fails during initialization (e.g. out of disk saving
query dll), we still need to make sure the job done call is
made to all slaves, to enure they are cleared up.
Without it the CJobSlave instance was leaked.

Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>
Jake Smith 3 years ago
parent
commit
baac4ab5cf
2 changed files with 3 additions and 1 deletions
  1. 1 1
      thorlcr/graph/thgraphmaster.cpp
  2. 2 0
      thorlcr/slave/slavmain.cpp

+ 1 - 1
thorlcr/graph/thgraphmaster.cpp

@@ -1707,9 +1707,9 @@ void CJobMaster::sendQuery()
     compressToBuffer(msg, tmp.length(), tmp.toByteArray());
 
     CTimeMon queryToSlavesTimer;
+    querySent = true;
     broadcast(queryNodeComm(), msg, masterSlaveMpTag, LONGTIMEOUT, "sendQuery");
     PROGLOG("Serialization of query init info (%d bytes) to slaves took %d ms", msg.length(), queryToSlavesTimer.elapsed());
-    querySent = true;
 }
 
 void CJobMaster::jobDone()

+ 2 - 0
thorlcr/slave/slavmain.cpp

@@ -1916,6 +1916,8 @@ public:
                         StringAttr key;
                         msg.read(key);
                         CJobSlave *job = jobs.find(key.get());
+                        if (!job)
+                            throw makeStringException(0, "QueryDone: job not found"); // can happen if job failed during initialization on some slaves
                         StringAttr wuid = job->queryWuid();
                         StringAttr graphName = job->queryGraphName();