The Windows Assessment and Deployment Kit contains tools for system level profiling in Windows. The proper term for this is Event Tracing for Windows. In this post I'll describe how to use these tools to get information about different aspects of the system during the running of an application. I was using the tools to determine why an application was stopping for almost a second, so I'll be using this scenario as an example for the interpretation of data. It's important to note that these tools won't often just give you an answer to a question. You have to be able to know what's going on in the system to be able to interpret the data and make theories about how that data relates to system behaviour.
Windows 7 comes with logman and Windows Performance Analyzer (WPA) by default, but they aren't as feature rich as the Windows 8 version. To install this and the other tools on Windows 7 just download and install the installer from here: Windows Assessment and Deployment Kit (Windows ADK) for Windows 8.1 Update
logman is a basic basic event collector that produces event traces for WPA. It collects simple information about data such as CPU usage, DPC/ISR duration, IO activity/disk utilization, hard faults, among other things. xperf is a more feature rich version of logman. It is capable of capturing more events than logman and can also do stack tracing, but the command line usage is messier. Windows Performance Analyzer is used to view the traces produced by logman and xperf. Other tools include wpr which is a bit like logman and xperf but with some usage differences, wprui which is a GUI on top of wpr, and xperfview which is an uglier version of WPA but is capable of producing more detailed summary tables.
First let's look at a basic example where we trace during the lifetime of superMegaUltra.exe:
logman start "NT Kernel Logger" -o C:\benchmark.etl -p "Windows Kernel Trace" 0x237f -bs 1024 -ets & superMegaUltra.exe & logman stop "NT Kernel Logger" -ets
"NT Kernel Logger" is just the name of the logman session, which you use to end the capture later on.
-o C:\benchmark.etl is what file the trace gets saved to.
-p "Windows Kernel Trace" 0x237f describes the provider to use. Providers in logman, xperf, and wpr are the collectors for the events. These collectors determine what data to collect. In this case we are using a hex value to describe the collection of CPU usage, DPC/ISR duration, IO activity/disk utilization, and hard faults. I don't know how the hex value is formatted or where it comes from but I use xperf if I need other providers. This page outlines the other available providers.
-bs 1024 Sets the buffer size in kilobytes. Anything from 8 to 1024 is allowed. If this is too small then events will be missed, but logman will warn you if this happens.
-ets starts and creates the session. Without this logman can be used to build up what's called the collection query, which describes how the collection occurs. So for example we can add extra command line arguments using a separate call to logman, which could be useful for scripting.
So in this example we're just starting a basic collection, running the application, then as soon as the application ends the trace is ended. There's still some overhead in the starting and stopping of the trace but we can filter the results to just those within the lifetime of the application from within WPA. logman must be run in an administrator console, otherwise you get the following error:
Error: Access is denied. You're running with a restricted token, try running elevated.
xperf -on Base -buffersize 1024 -minbuffers 400 -stackwalk CSwitch+FileRead+HardFault+PagefaultHard+PageAccess -SetProfInt 1221 & superMegaUltra.exe & xperf -d C:\benchmark.etl
-on flags is a list of all of the providers to use for collection. A complete list of the default kernel providers can be obtained by running the following: xperf -providers kf. Predefined groups of flags are also available: xperf -providers kg. In the example we're using the 'Base' group to collect similar data to a run of logman.
-buffersize 1024 -minbuffers 400 configures 400 buffers of 1MB each in size. xperf will put the information it collects into these memory buffers during the trace. Like logman, too few and/or too small buffers results in the loss of some events but you'll be warned if that happens. Too many buffers is a waste of memory. It's a good idea to run a trace with a sensible number of buffers to get an idea of how big the trace file is, then adjusting the number to balance the size of the output and the amount of memory available.
-stackwalk flags turns on stack walking. The stack is collected on the events listed. This has a fair overhead but the information it provides can be incredibly useful. Unless you are bothered about the overheads, don't add too many flags here. A list of available events can be obtained by running xperf -help stackwalk.
-SetProfInt 1221 sets the sample rate to the maximum of 8Hz (1221 * 100ns = 0.1221ms). The default is 1Hz. Adjust this to the level of detail needed to decrease overheads and output size.
As with logman, xperf must also be run in an administrator console.
All of the available graphs are listed on the left hand side. Some can be expanded to show variations by clicking the arrow next to the description. To open the table relating to a graph, click the table button to the top right of an opened graph. If you select a row in the table then the related area of the graph gets highlighted. eg If you select a process in the CPU Usage table, the lifetime of the process is selected. You can then zoom into the selection which, depending on the data, will also filter the table.
You can also filter the graphs by whatever a single item of whatever they show (eg if it's a CPU usage graph it can be filtered by process or thread). To filter just click the coloured boxes in the Series column to the left of the graph. To filter by a single process/thread, select the item, right click it, then go to Disable -> All but selection. Filtering by threads is slightly different because they are children of the process items. Right click on the column, then go to Disable -> In entire graph -> All series, and then enable whichever items that are needed.
You can save the tabs, graphs, tables and their configurations by going to Profiles -> Export. To reopen a layout go to Profiles -> Apply. The zoom level of graphs is not saved. You can apply a profile to a different trace that you saved it in.
The stack is visible only on CPU graphs. A sampled CPU graph can be used to traverse the stack in order of execution for a specific process or thread. A precise graph will only let you view the stack of the old and new thread during a context switch. In both cases you have to add the relevant column to the table. You'll also need to load the symbols for your application and for Windows, otherwise the stack shows up as questions and exclamation marks or something like superMegaUltra.exe!?. To load in symbols go to Trace -> Configure Symbol Paths. Already filled in is the information necessary to load in remote symbols for Windows. Add the path to the folder containing the PDB for your application before the existing information and separate the two with a semicolon. For example:
Then click OK and go to Trace -> Load Symbols. A bar should appear at the top that shows the symbols being loaded in the background. The symbols for your application should load quickly but the Windows symbols will take a long time, which is why it's important to put your applications symbols first; you can walk the stack with the symbols loaded for your application while the Windows symbols load. This won't happen if you didn't have the necessary flags set in xperf. You'll have to experiment to find which flags you need if this is the case.
Even if symbols are loaded, the stack may show "?!?" at the top level, then nothing else. I found that a "CPU Usage (Precise)" is required for good stack tracing. You'll need the NewThreadStack column.
Stack walking on it's own is a difficult way to traverse the graph. It's easier to have two graphs open; one CPU usage graph without the stack column and one with it. That way you can sort the blocks of computation in order of SwitchInTime on the graph without the stack and scroll through them in order, having WPA highlight the current function in the stack.
I've only used this to get a more detailed table about IO information so this is all the detail I will go into. Find the graph relating to the information you want, right click it, and open the detailed graph. Right click this graph and go to Summary Table.
xperf -on PROC_THREAD+LOADER+PROFILE+CSWITCH+COMPACT_CSWITCH+DISPATCHER+DPC+INTERRUPT+SYSCALL+PERF_COUNTER+DISK_IO+DISK_IO_INIT+FILE_IO+FILE_IO_INIT+HARD_FAULTS+FILENAME+SPLIT_IO+MEMINFO+ALL_FAULTS+FOOTPRINT+MEMORY+CONTMEMGEN -buffersize 1024 -minbuffers 400 -stackwalk CSwitch+FileRead+HardFault+PagefaultHard+PageAccess -SetProfInt 1221 & superMegaUltra.exe & xperf -d trace.etl
The application I was profiling consisted of two threads. One was getting data from numerous memory mapped files on a hard disk and performing computations on the data, and the other was prefetching those files into memory by opening native Windows files handles for those files and touching a single bytes on each block that the other thread needed. The block size was 8 pages because the file was opened with the FILE_FLAG_SEQUENTIAL_SCAN hint. The problem was that the main thread was hanging constantly for anywhere between a tenth of a second and a whole second, despite the prefetch thread being ahead of the main thread. Originally it looked like a scheduling issue.
To see if this was the case I followed the "Investigation" part of the "Thread Interference" section of this page. For Step 3 of "Determine a Thread's Processor Restrictions" it means add the table to the graph, click the cog at the top of the graph to open the View Editor, drag "Cpu" from the left hand side to below "NewThreadId", then change the aggregation dropdown to "Unique Count". The troublesome thread had a count of 8, but by expanding the thread on the table and selecting each Cpu item in turn, I could see from the highlights on the graph that it was running on core 2 most of the time, specifically around the area where it was pausing.
Continuing in the "Investigation" section, in the part titled "Determine what Preempted or Starved the Thread", for Step 3 I had to remove all of the existing entries before adding everything shown in the image. The thread's average priority was 9 and it was ready during the paused area. Although the thread was being preempted at the paused area, the new process was being swapped out for idle. So the other process was given CPU time, finished, but then the main thread wasn't swapped back in, despite being ready. Therefore I concluded that the pause was not caused by preemption or starvation.
Next I used the stack trace to determine where the main thread was stopping. I zoomed into the area where the pause was occurring, keeping some of the time around the pause that had high CPU usage from the thread in question. I opened two CPU graphs; one for "Utilization by CPU" without a stack column, and another by thread with a NewThreadStack column. I then filtered the CPU core graph to just core 2, and sorted it by SwitchInTime, then selecting each relevant item in the table in turn. This was highlighting each block of CPU usage on the graph in order of the time at which it happened. This was also highlighting the functions on the stack in the table of the other graph (after expanding the thread's stack out) so that I could see what was running during the block of CPU usage. In the block of the thread's CPU usage before the pause there was a tiny blip where another process got swapped in, but finished straight away, allowing the main thread to swap back in again. Interestingly, the following was highlighted in the stack, just before the pause:
ntkrnlmp.exe!KiPageFault ntkrnlmp.exe!MmAccessFault ntkrnlmp.exe!MiIssueHardFault ntkrnlmp.exe!IoPageRead fltMgr.sys!FltpDispatch fltMgr.sys!FltpLegacyProcessingAfterPreCallbacksCompleted ntfs.sys!NtfsFsdRead ntfs.sys!NtfsCommonRead ntfs.sys!NtfsNonCachedIo ntkrnlmp.exe!KeWaitForSingleObject ntkrnlmp.exe!KiCommitThreadWait ntkrnlmp.exe!KiSwapContext ntkrnlmp.exe!SwapContext_PatchXRstor
This was interesting for two reasons. Firstly because the prefetch thread was supposed to have got all of the information that the main thread needed already, and this was indicating it hadn't because a hard fault had occurred and the main thread was having to get information from disk. The other reason it was interesting what although I could now see that the swap out was caused by needing to do IO, it didn't explain why it never got swapped back in, because even a large IO request shouldn't take up to a second to complete. Also this meant that the ready graph was deceptive because the main thread was not in fact ready; it was waiting on IO.
The next step was to investigate why this item of data had been prefetched, and why the thread wasn't being swapped back in. For this I used the more detailed table of IO requests in xperfview (explained in the Usage section). I sorted the table by "Init Time". It ended up that finding the IO request in question was easy because it's "IO Time" was way longer than any others. By putting the table in order I could see that there were many other requests being made and completed by the IO thread between the time the main thread's request was made and it's Complete Time. Interestingly, just before the point of completion, xperf had to empty it's buffers onto to disk. So it looked like the OS was prioritising the prefetch thread's sequential IO above the main thread's request, and it's only at the point that xperf interrupts this pattern that the OS services the main thread's request. Also by using the offsets in the table I could see that the prefetch thread had missed out prefetching the page that the main thread needed, which was why it was having to make the request itself. Another interesting point to notice was the IO queue (QD/I - Queue Depth at Init Time). It was empty just before the main thread made it's request and it didn't drop back down to having nothing in it until the main thread's request finished, so it was showing the main thread's request was probably queued up during all of that time.