When you work as DBA, many people will approach you with a complaint like "Application is taking ages to load the data on a page,could you please check something going wrong with database server?" There might be hundred of other reason for slowness of the page.It might be a Problem with application server,network issues,really a bad implementation or problem with database server due to generation of huge report /job running at that moment. What ever be the issue, database gets the blame first. Then it is our responsibility to cross check the server state.
Let us discuss how we can approach this issue. I use following script to diagnose the issues. The first script which I will run against server is given below:
parent_node_id AS Node_Id,
COUNT(*) AS [No.of CPU In the NUMA],
SUM(COUNT(*)) OVER()AS [Total No. of CPU],
SUM(runnable_tasks_count ) AS [Runnable Task Count],
SUM(pending_disk_io_count) AS [Pending disk I/O count],
SUM(work_queue_count) AS [Work queue count]
FROM sys.dm_os_schedulers WHERE status='VISIBLE ONLINE' GROUP BY parent_node_id
This will give following information.
- Number of records in the output will be equal to number of NUMA nodes (if it is fetching only one record , it is not a NUMA supported server)
- Node_id : NUMA node id . Can be mapped into the later scripts.
- No.of CPU in the NUMA : Total number of CPU assigned to the specific NUMA node or the number of schedulers.
- Total No. of CPU : Total No. of CPU available in the server.If you have set the affinity mask, total number of CPU assigned to this instance.
- Runnable Task Count: Number of workers, with tasks assigned to them, that are waiting to be scheduled on the runnable queue. Is not nullable. In short number of request in runnable queue.To understand more about Runnable queue , read my earlier post.
- Pending disk I/O count : Number of pending I/Os that are waiting to be completed. Each scheduler has a list of pending I/Os that are checked to determine whether they have been completed every time there is a context switch. The count is incremented when the request is inserted. This count is decremented when the request is completed.
- Work queue count: Number of tasks in the pending queue. These tasks are waiting for a worker to pick them up.
I have scheduled this scrip to store the output this query in a table for two days in the interval of 10 minutes. That will give baseline data about what is normal in your environment. In my environment people will start complaining once the Runnabable Task Count of most of the nodes goes beyond 10 consistently. In normal scenario, the value of Runnabable Task Count will be always below 10 on each node and never seen a value greater than 0 for work queue count field.This will give a picture of current state of the system.If the output of this step is normal, we are safe to an extent, the slow response might be issue which might be beyond our control or a blocking and slow response is only for a couple of screens(sessions) not for entire system.
This is the first step in my way of diagnosing. I will explain the further steps in my next post.
If you liked this post, do like my page on FaceBook