Setting up an image to use dask

I am trying to understand how you folks have set up dask to run on GeoLab. Maybe this is in buried in your documentation somewhere but I couldn’t find it. Gemini helped me discover that you are using a package called “dask_gateway”.

It seems the cloud infrastructure allows you to launch an instance of dask for each user as they connect. The magic changes Sophia made for the MsPASS container for the upcoming course allowed me to do this in MsPASS:

dask_client = mspass_client.get_scheduler()
cluster = dask_client.cluster 
cluster.scale(8)

Noting:

  1. mspass_client is the global client used by mspass to access all the services the package requires. get_scheduler is a method to retrieve the dask client. A first question for clarification for other is what, if any, magic does a user need to do to get the url for the scheduler. The mspass_client is resolving this because the url matches an internal default.
  2. cluster is an attribute of a dask client that is an abstract handle to manipulate dask configuration. My example just calls the scale method to change the number of workers form 4 == number of cpus I selected on startup to 8. Note the dask documentation shows that a cluster object has a number of other useful methods for changing configuration in the JupyterHub environment. A particularly useful thing there is to just type in a code box:
cluster

and you get an html display with links to dask diagnostics.

So, part of that was sharing this information with the community. The other part is a request to Earthscope folks to create a documentation page on using dask with GeoLab. After the MsPASS class is over in July I would even volunteer to help.

I thought this solved a problem we had earlier with the mspass container but it just moved it something else.

It appears that “dask_gateway” is not a stock component of dask. At least the way the mspass container pull in packages dask_gateway ends up being undefined. The reason that is an issue is that we need to fix the mspass client to handle geolab correctly. I am pretty sure it will need dask_gateway to be able to connect to the instance already running on GeoLab.

Part of the problem is discussed in an open Issues page on the mspass repository found here. The solution may be as simple as resolving the url to connect to the running scheduler that GeoLab appears to launch. The alternative is to change the code for the mspass client to use dask_gateway. If we can sort out the right url it would be far far easier and more in line with the abstraction dask uses to define what “cluster” is. i.e. the idea is supposed to be that if you get an instance of a dask client it should have a common api for local clusters (desktop), HPC clusters, or kubernetes implementation like this.

Can someone tell me how to resolve the URL for the dask scheduler?

Gary,

The dask cluster is available via the dask-labextension which is installed in the mspass image. You can instantiate it with:

from dask.distributed import Client

client = Client()

client
print(client.scheduler_info())

It is not necessary, but you can find the dask configuration information:

env | grep "DASK"

Note that dask_labextension, which is the default in the mspass Dockerfile, does not include Dask Gateway which is used to handle multi-tenant cluster provisioning.

sophia

Ah, Gemini misled me on that. Let me see if I can make that work

Well, we still have disconnect here that is going to take some digging. At the moment I’m a bit confused by what is happening. I’m sure a reason is there is a huge state dependency that is hard to avoid here. Some questions I have are:

  1. Am I mistaken about an instance of dask running automatically in GeoLab with this mspass image? There is evidence that there is, in fact, nothing running on initial startup. I restarted “server” from the hub control panel and issued the standard incantation to create a “mspass client” (which includes a dask scheduler) in one notebook. That worked fine. I switch to a different notebook, issue the same commands, and then I get the error that is the subject of the github post - “cluster already running”. I think that what happened was the first notebook created a running dask instance. The second needed to connect to that one, not create a new one.
  2. Given 1, I still don’t see how to connect to the running instance. Your suggestion to use env | grep DASK is helpful, but there are multiple host names there as there seem to be multiple proxies in this JupyterHub.
  3. Note when I create a dask.distributed.Client with defaults like above, after I’ve launched dask in another notebook I get the same error - a cluster is already running.

I think anyone using dask in any context with multiple notebooks open is going to face this issue. We need to figure out how to create a client for a notebook that can connect to an already running instance of the dask scheduler on JupyterHub. I think the problem is all those proxies make it hard to figure out the right incantation to tell the Client constructor where to look.