Newcircuit 1c8db4401779333389b774086d6390f160bbb49078a0eda577b43dcf5018a77f

Ashley Whetter's Blog

[programming, linux] ++ [stuff]

Social inside twitter icon ee809e0d7c5b2294b7d548e7223909c678d76c5484045ae68ecb7c559cd7fade Social inside google icon dbbc88b746b52525f1e1bd5e2bf44152b2713fb976f7e68b903c2d13e9fbd160 Octocat 1fcbe3003defcb5ee415b3a7aa989bbc724ccf3c446436e848c27437901c6ca2 Social inside rss icon ff8ccbbb13f265afbe3cadba03c838cdbcad75b6f223a7937eab3ee35afc933d Social inside linkedin icon 5f3acefcb096b75afe32a0e0b3d90d1f2c1b374049eb8c1996203225e1ae34ba

Using imports in to define your public API

The Problem

In many languages outside of Python, packages may look normal. They can also look like a great way to organise an API. Where am I going to find the Tiger class? In the tiger module of course! But overuse of packages can make an API tiresome to use. The import that users will have to use could look like:

import animals.tiger
rajah = animals.tiger.Tiger()
By overusing packages, users have to type "tiger" twice just to use our class! Furthermore, the import could get quite long. As more levels of packages are added, the more users have to type to use our API. It also gives them less space in the 79 characters per line for their own code. Fortunately, can help us solve these problems while still keeping our code organised.

An Almost Perfect Solution

Currently our code looks like this:

|- animals
|    || -
|    || -
|    || -
|    || -
# animals/
class Animal(object):
# animals/
import animals.base
import animals.utils

class Tiger(animals.base.Animal):
    TOOTH_SHAPE = animals.utils.ToothShape.SHARP
# animals/
import enum

class ToothShape(enum.Enum):
    SHARP = 0

By adding an import of Tiger into animals/, our users would no longer have to type "tiger" twice and we've shortened our import.

# animals/
from animals.base import Animal
from animals.tiger import Tiger
Now users can use Tiger with:
import animals
rajah = animals.Tiger()
We can do this with submodules as well. If we were to import utils into animal's then users would only have to use import animals to be able to use animals.utils (similarly to how you can access os.path when only importing os).
# animals/
from animals import utils
And users use utils with:
import animals
sharp = animals.utils.ToothShape.SHARP

A further benefit of this usage of is that you have the freedom to rearrange your codebase without breaking users' code. Let's say that animal/ starts to get really big and we want to split it into multiple files. Because we can define our public API, we have the flexibility to turn into a package, and import all of it's methods from our new split files, into it's Users will never see a change as they still use import animals.utils as before.

Some Traps

You should not use your public API inside the code of your library. Your code will become sensitive to changes in the ordering of imports inside files. For example, the following will fail:

# animals/
from animals.tiger import Tiger
from animals.base import Animal
# animals/
import animals
import animals.utils

class Tiger(animals.Animal):
    TOOTH_SHAPE = animals.utils.ToothShape.SHARP
with the exception:
Traceback (most recent call last):
  File "", line 1, in 
  File "animals/", line 1, in 
    from animals.tiger import Tiger
  File "animals/", line 4, in 
    class Tiger(animals.Animal):
AttributeError: module 'animals' has no attribute 'Animal'
Can you spot why this fails?

animals is trying to import tiger, and tiger imports animals.Animal, but the import of animals has not yet reached it's import of Animal. Therefore Animal does not yet exist on animals.

One way to avoid this problem is to use your "private" API inside of your library's code (just as users could unwisely choose to). But this introduces the problems of having to use long imports. You would also lose the benefits of being able to (easily) change your code's layout, as your own code now uses those fixed package positions in it's imports.

Instead you should use relative imports in your own library so that you have shorter imports, and layout flexibility.

The Final Solution

# animals/
from .base import Animal
from .tiger import Tiger

from . import utils
# animals/
class Animal(object):
# animals/
from . import base
from . import utils

class Tiger(base.Animal):
    TOOTH_SHAPE = utils.ToothShape.SHARP
# animals/
import enum

class ToothShape(enum.Enum):
    SHARP = 0

Further Reading

The requests library's is a great example of how to define a public API.

The standard library itself is also a great example of package layout. It abides by the rule that packages should be used only to resolve namespace conflicts, and for splitting up large modules. It therefore has a very flat hierarchy and only a few packages.

As the size of your code base increases, putting everything under a single module can make imports slow. The werkzeug library's uses a custom lazy loading system to put it's large codebase under a single module that is still fast to import. Note that the authors did this because splitting it into a package would not have made organisational sense, otherwise this would have been a legitimate use case for a package.


Categories: Programming Python

netctl Notifications

Using netctl's hooks it's actually quite easy to get notified when netctl-auto or netctl-ifplugd connects or disconnects from a network. Put the following in a file in /etc/netctl/hooks/, change the wireless interface and the username to your own, and make the file executable.


function send_notification {
	su -c "DISPLAY=:0 notify-send \"$1\" --icon=network-$2" ashley

if [[ "$INTERFACE" == "wlp4s0" ]]
	case "$ACTION" in
			send_notification "Connected to $SSID" wireless
			send_notification "Disconnected from $SSID" wireless
			send_notification "Connected to $1" wired
			send_notification "Disconnected from $1" wired

Categories: None

Using logman, xperf, and Windows Performance Analyzer

The Windows Assessment and Deployment Kit contains tools for system level profiling in Windows. The proper term for this is Event Tracing for Windows. In this post I'll describe how to use these tools to get information about different aspects of the system during the running of an application. I was using the tools to determine why an application was stopping for almost a second, so I'll be using this scenario as an example for the interpretation of data. It's important to note that these tools won't often just give you an answer to a question. You have to be able to know what's going on in the system to be able to interpret the data and make theories about how that data relates to system behaviour.


Windows 7 comes with logman and Windows Performance Analyzer (WPA) by default, but they aren't as feature rich as the Windows 8 version. To install this and the other tools on Windows 7 just download and install the installer from here: Windows Assessment and Deployment Kit (Windows ADK) for Windows 8.1 Update

What is each tool used for?

logman is a basic basic event collector that produces event traces for WPA. It collects simple information about data such as CPU usage, DPC/ISR duration, IO activity/disk utilization, hard faults, among other things. xperf is a more feature rich version of logman. It is capable of capturing more events than logman and can also do stack tracing, but the command line usage is messier. Windows Performance Analyzer is used to view the traces produced by logman and xperf. Other tools include wpr which is a bit like logman and xperf but with some usage differences, wprui which is a GUI on top of wpr, and xperfview which is an uglier version of WPA but is capable of producing more detailed summary tables.


logman (

First let's look at a basic example where we trace during the lifetime of superMegaUltra.exe:

logman start "NT Kernel Logger" -o C:\benchmark.etl -p "Windows Kernel Trace" 0x237f -bs 1024 -ets & 
superMegaUltra.exe & 
logman stop "NT Kernel Logger" -ets 

"NT Kernel Logger" is just the name of the logman session, which you use to end the capture later on.
-o C:\benchmark.etl is what file the trace gets saved to.
-p "Windows Kernel Trace" 0x237f describes the provider to use. Providers in logman, xperf, and wpr are the collectors for the events. These collectors determine what data to collect. In this case we are using a hex value to describe the collection of CPU usage, DPC/ISR duration, IO activity/disk utilization, and hard faults. I don't know how the hex value is formatted or where it comes from but I use xperf if I need other providers. This page outlines the other available providers.
-bs 1024 Sets the buffer size in kilobytes. Anything from 8 to 1024 is allowed. If this is too small then events will be missed, but logman will warn you if this happens.
-ets starts and creates the session. Without this logman can be used to build up what's called the collection query, which describes how the collection occurs. So for example we can add extra command line arguments using a separate call to logman, which could be useful for scripting.

So in this example we're just starting a basic collection, running the application, then as soon as the application ends the trace is ended. There's still some overhead in the starting and stopping of the trace but we can filter the results to just those within the lifetime of the application from within WPA. logman must be run in an administrator console, otherwise you get the following error:

Access is denied. 
You're running with a restricted token, try running elevated. 

xperf (

xperf -on Base -buffersize 1024 -minbuffers 400 -stackwalk CSwitch+FileRead+HardFault+PagefaultHard+PageAccess -SetProfInt 1221 & 
superMegaUltra.exe & 
xperf -d C:\benchmark.etl 

-on flags is a list of all of the providers to use for collection. A complete list of the default kernel providers can be obtained by running the following: xperf -providers kf. Predefined groups of flags are also available: xperf -providers kg. In the example we're using the 'Base' group to collect similar data to a run of logman.
-buffersize 1024 -minbuffers 400 configures 400 buffers of 1MB each in size. xperf will put the information it collects into these memory buffers during the trace. Like logman, too few and/or too small buffers results in the loss of some events but you'll be warned if that happens. Too many buffers is a waste of memory. It's a good idea to run a trace with a sensible number of buffers to get an idea of how big the trace file is, then adjusting the number to balance the size of the output and the amount of memory available.
-stackwalk flags turns on stack walking. The stack is collected on the events listed. This has a fair overhead but the information it provides can be incredibly useful. Unless you are bothered about the overheads, don't add too many flags here. A list of available events can be obtained by running xperf -help stackwalk.
-SetProfInt 1221 sets the sample rate to the maximum of 8Hz (1221 * 100ns = 0.1221ms). The default is 1Hz. Adjust this to the level of detail needed to decrease overheads and output size.

As with logman, xperf must also be run in an administrator console.

Windows Performance Analyzer

Graphs and Tables

All of the available graphs are listed on the left hand side. Some can be expanded to show variations by clicking the arrow next to the description. To open the table relating to a graph, click the table button to the top right of an opened graph. If you select a row in the table then the related area of the graph gets highlighted. eg If you select a process in the CPU Usage table, the lifetime of the process is selected. You can then zoom into the selection which, depending on the data, will also filter the table.

You can also filter the graphs by whatever a single item of whatever they show (eg if it's a CPU usage graph it can be filtered by process or thread). To filter just click the coloured boxes in the Series column to the left of the graph. To filter by a single process/thread, select the item, right click it, then go to Disable -> All but selection. Filtering by threads is slightly different because they are children of the process items. Right click on the column, then go to Disable -> In entire graph -> All series, and then enable whichever items that are needed.

Saving the Layout

You can save the tabs, graphs, tables and their configurations by going to Profiles -> Export. To reopen a layout go to Profiles -> Apply. The zoom level of graphs is not saved. You can apply a profile to a different trace that you saved it in.

Stack Walking

The stack is visible only on CPU graphs. A sampled CPU graph can be used to traverse the stack in order of execution for a specific process or thread. A precise graph will only let you view the stack of the old and new thread during a context switch. In both cases you have to add the relevant column to the table. You'll also need to load the symbols for your application and for Windows, otherwise the stack shows up as questions and exclamation marks or something like superMegaUltra.exe!?. To load in symbols go to Trace -> Configure Symbol Paths. Already filled in is the information necessary to load in remote symbols for Windows. Add the path to the folder containing the PDB for your application before the existing information and separate the two with a semicolon. For example:


Then click OK and go to Trace -> Load Symbols. A bar should appear at the top that shows the symbols being loaded in the background. The symbols for your application should load quickly but the Windows symbols will take a long time, which is why it's important to put your applications symbols first; you can walk the stack with the symbols loaded for your application while the Windows symbols load. This won't happen if you didn't have the necessary flags set in xperf. You'll have to experiment to find which flags you need if this is the case.

Even if symbols are loaded, the stack may show "?!?" at the top level, then nothing else. I found that a "CPU Usage (Precise)" is required for good stack tracing. You'll need the NewThreadStack column.

Stack walking on it's own is a difficult way to traverse the graph. It's easier to have two graphs open; one CPU usage graph without the stack column and one with it. That way you can sort the blocks of computation in order of SwitchInTime on the graph without the stack and scroll through them in order, having WPA highlight the current function in the stack.


I've only used this to get a more detailed table about IO information so this is all the detail I will go into. Find the graph relating to the information you want, right click it, and open the detailed graph. Right click this graph and go to Summary Table.

Interpretation of the Data

The xperf command I used was the following:

The application I was profiling consisted of two threads. One was getting data from numerous memory mapped files on a hard disk and performing computations on the data, and the other was prefetching those files into memory by opening native Windows files handles for those files and touching a single bytes on each block that the other thread needed. The block size was 8 pages because the file was opened with the FILE_FLAG_SEQUENTIAL_SCAN hint. The problem was that the main thread was hanging constantly for anywhere between a tenth of a second and a whole second, despite the prefetch thread being ahead of the main thread. Originally it looked like a scheduling issue.

To see if this was the case I followed the "Investigation" part of the "Thread Interference" section of this page. For Step 3 of "Determine a Thread's Processor Restrictions" it means add the table to the graph, click the cog at the top of the graph to open the View Editor, drag "Cpu" from the left hand side to below "NewThreadId", then change the aggregation dropdown to "Unique Count". The troublesome thread had a count of 8, but by expanding the thread on the table and selecting each Cpu item in turn, I could see from the highlights on the graph that it was running on core 2 most of the time, specifically around the area where it was pausing.

Continuing in the "Investigation" section, in the part titled "Determine what Preempted or Starved the Thread", for Step 3 I had to remove all of the existing entries before adding everything shown in the image. The thread's average priority was 9 and it was ready during the paused area. Although the thread was being preempted at the paused area, the new process was being swapped out for idle. So the other process was given CPU time, finished, but then the main thread wasn't swapped back in, despite being ready. Therefore I concluded that the pause was not caused by preemption or starvation.

Next I used the stack trace to determine where the main thread was stopping. I zoomed into the area where the pause was occurring, keeping some of the time around the pause that had high CPU usage from the thread in question. I opened two CPU graphs; one for "Utilization by CPU" without a stack column, and another by thread with a NewThreadStack column. I then filtered the CPU core graph to just core 2, and sorted it by SwitchInTime, then selecting each relevant item in the table in turn. This was highlighting each block of CPU usage on the graph in order of the time at which it happened. This was also highlighting the functions on the stack in the table of the other graph (after expanding the thread's stack out) so that I could see what was running during the block of CPU usage. In the block of the thread's CPU usage before the pause there was a tiny blip where another process got swapped in, but finished straight away, allowing the main thread to swap back in again. Interestingly, the following was highlighted in the stack, just before the pause:


This was interesting for two reasons. Firstly because the prefetch thread was supposed to have got all of the information that the main thread needed already, and this was indicating it hadn't because a hard fault had occurred and the main thread was having to get information from disk. The other reason it was interesting what although I could now see that the swap out was caused by needing to do IO, it didn't explain why it never got swapped back in, because even a large IO request shouldn't take up to a second to complete. Also this meant that the ready graph was deceptive because the main thread was not in fact ready; it was waiting on IO.

The next step was to investigate why this item of data had been prefetched, and why the thread wasn't being swapped back in. For this I used the more detailed table of IO requests in xperfview (explained in the Usage section). I sorted the table by "Init Time". It ended up that finding the IO request in question was easy because it's "IO Time" was way longer than any others. By putting the table in order I could see that there were many other requests being made and completed by the IO thread between the time the main thread's request was made and it's Complete Time. Interestingly, just before the point of completion, xperf had to empty it's buffers onto to disk. So it looked like the OS was prioritising the prefetch thread's sequential IO above the main thread's request, and it's only at the point that xperf interrupts this pattern that the OS services the main thread's request. Also by using the offsets in the table I could see that the prefetch thread had missed out prefetching the page that the main thread needed, which was why it was having to make the request itself. Another interesting point to notice was the IO queue (QD/I - Queue Depth at Init Time). It was empty just before the main thread made it's request and it didn't drop back down to having nothing in it until the main thread's request finished, so it was showing the main thread's request was probably queued up during all of that time.

Sources and Further Reading

Categories: None