Преглед изворни кода

HPCC-27439 Ensure job done is sent to slaves after part failure

If a job fails during initialization (e.g. out of disk saving
query dll), we still need to make sure the job done call is
made to all slaves, to enure they are cleared up.
Without it the CJobSlave instance was leaked.

Signed-off-by: Jake Smith <jake.smith@lexisnexisrisk.com>
Jake Smith пре 3 година
родитељ
комит
baac4ab5cf
2 измењених фајлова са 3 додато и 1 уклоњено
  1. 1 1
      thorlcr/graph/thgraphmaster.cpp
  2. 2 0
      thorlcr/slave/slavmain.cpp

+ 1 - 1
thorlcr/graph/thgraphmaster.cpp

@@ -1707,9 +1707,9 @@ void CJobMaster::sendQuery()
     compressToBuffer(msg, tmp.length(), tmp.toByteArray());
 
     CTimeMon queryToSlavesTimer;
+    querySent = true;
     broadcast(queryNodeComm(), msg, masterSlaveMpTag, LONGTIMEOUT, "sendQuery");
     PROGLOG("Serialization of query init info (%d bytes) to slaves took %d ms", msg.length(), queryToSlavesTimer.elapsed());
-    querySent = true;
 }
 
 void CJobMaster::jobDone()

+ 2 - 0
thorlcr/slave/slavmain.cpp

@@ -1916,6 +1916,8 @@ public:
                         StringAttr key;
                         msg.read(key);
                         CJobSlave *job = jobs.find(key.get());
+                        if (!job)
+                            throw makeStringException(0, "QueryDone: job not found"); // can happen if job failed during initialization on some slaves
                         StringAttr wuid = job->queryWuid();
                         StringAttr graphName = job->queryGraphName();