Wednesday, June 12, 2024

#1018 - Monitoring scheduled jobs with OCI Service Metrics



Many customers make use of OIC scheduled integrations, often with tight schedules. Sometimes these schedules cannot be adhered to e.g. I have a job running every 3 minutes, importing  sales orders to Netsuite. Job1 starts at 09:00 am, job2 is scheduled to start at 09:03. However, job2 cannot start until job1 has completed. Job1 could be delayed for a variety of reasons - 

  • volume of data to be processed
  • complexity of the integration's orchestration logic
  • capacity of the downstream app (e.g. Netsuite) to process the data
Normally, the bottleneck is not OIC itself. However, there are still OIC limits that are applied to scheduled jobs. First port of call is the OIC3 Service Limits doc. Here you will see the following under concurrency -

Asynchronous concurrent executions: 50.
Asynchronous concurrent execution includes scheduled + triggered + connectivity agent.

So there is a limit here, however, compared to OIC gen2, which had stricter limits on scheduled jobs, this should not be an issue. Maybe though, just have it in the back of your mind, when planning your schedules.

As I have already mentioned, delays in the execution of scheduled jobs, may be caused by tight schedules and resource contention.  

The goal of this post is to see how we can monitor our scheduled jobs and isolate such issues. We have a plethora of tools at our disposal. 

OIC3 Observability allows us to monitor scheduled jobs -


OIC Observability also gives the duration view of the activity stream - letting us see, at a glance, which actions are taking the most time -


However, OIC does not currently provide an aggregated, time-based view of this data. An aggregated view of resources being used e.g. multiple scheduled jobs using the same adapter, would also be very useful.

So let's see how OCI Metrics can assist us here.
 
Simple use case here - 2 scheduled jobs that invoke an Oracle DB via the DB adapter.
The processing logic is banal, but will suffice for demo purposes -


































AA-DB-Job
has an integration property defined -  waitTimeInSecs controls the Wait action. This property is also passed to a DB stored procedure, via the Invoke action - the procedure just does a sleep.

 CREATE OR REPLACE PROCEDURE wait_a_while
(
   waitPeriod IN NUMBER
)
AS
BEGIN
  sys.dbms_session.sleep(waitPeriod);
  EXCEPTION
   WHEN VALUE_ERROR THEN
      dbms_output.put_line('Something went wrong');
END;
/

I clone this integration and rename the clone to BB-DB-Job.
Simple stuff!

I do 4 runs -
  • AA-DB-Job with property set to 10 secs
  • AA-DB-Job with property set to 20 secs
  • BB-DB-Job with property set to 20 secs
  • BB-DB-Job with property set to 30 secs - this I execute, while the previous job is still running.
Now to OCI Service Metrics - here is the first widget, based on the MessagesReceivedCount metric. 


Note the difference, when I set the interval to 5 minutes -

Takeaway here? The interval may be important to you. 

The next widget is based on the MessagesSuccessfulCount metric.

I now add a new scheduled job, CC-NoDB-Job, that just contains a Wait action. I execute this once.

Note I've also added a new widget - Number of DB Adapter Invokes

This is configured as follows -

Note the use of the dimension - adapterIdentifier.
 
The next widget looks at the durations of the DB invokes -

The scale is in msecs - 
remember my test runs? 
AA-DB-Job 10 secs and 20 secs, then
BB-DB-Job 20 secs and 30 secs.

I now run AA-DB-Job and BB-DB-Job again with waits of 60 and 50 secs respectively.

Here we see that both jobs have been started around the same time.

We also see concurrency at DB adapter invoke level - 









 


Finally I add a widget, based on a saved search of the OIC activity stream for my scheduled integrations.

This widget is based on the following query - 

search "yourActivityStream log group" |data.integrationFlowIdentifier='AA_DB_JOB!01.00.0000' or data.integrationFlowIdentifier='BB_DB_JOB!01.00.0000' or data.integrationFlowIdentifier='CC_NODB_JOB!01.00.0000' | sort by datetime desc



Summa, summarum - this dashboard gives us an overview of integrations running and their use of specific resources; in this case, the DB.

The top 2 widgets, per default,  will show all integrations, ergo, it includes more than scheduled jobs. However, this may be very useful when considering resource contention issues.

You can, however, use OCI Logging Analytics to create widgets based on the OIC activity stream and limit this to your scheduled jobs -

I have discussed OCI Logging Analytics in other posts. Please see those here.

The widget above is driven by the following query - 


'Log Source' = 'OCI Integration Activity Stream Logs' and 'OCI Resource Name' = yourOICOCID and Integration in (cc_nodb_job, aa_db_job, bb_db_job) | timestats distinctcount(Instance) as logrecords by Integration | sort -logrecords'

The syntax may look complex, but I built it by selecting fields from the Fields palette. I'll recreate the above query, step by step.

Select the Log Source - here I select the OIC activity stream.

 
Note the query has changed - 
I have 3 OIC instances pushing activity stream data to the same OCI Logging Analytics log group, but here, I just want to select a specific instance - 





I now want to limit the query to my scheduled integrations. Here I need to filter on them. The field I use is Integration.



The query is now set to - 'Log Source' = 'OCI Integration Activity Stream Logs' and 'OCI Resource Name' = myOICOCID and Integration in (cc_nodb_job, aa_db_job, bb_db_job) | stats count as logrecords by 'Log Source' | sort -logrecords

The default visualisation is Pie - 

Naturally, this is not very useful, except for telling me that instances of my 3 scheduled integrations generated a total of 194 log messages.

What I want to know is how many instances of each scheduled job executed in the specified timeframe.

Each instance flow has its own integration id, so what I need is a distinct count of these. Here I edit the query manually - 
replacing stats count as logrecords with timestats distinctcount(Instance) as logrecords by Integration. We've already met the Integration field, Instance is the field containing the instance id. 

The visualisation immediately changes to Records with Histogram


I could also choose the Pie visualisation -






No comments: