MsRaster Design
MsRaster is the first test of the approach described in
visibility plotting. It provides an optional GUI
interface to a Python interface to create raster plots shown in a browser tab or
saved to file.
Note
This design page contains detailed notes aimed at developers.
Code organization
The vidavis source code is divided into the broad categories (with
corresponding directory names) of apps, data, plot, bokeh, and toolbox.
apps: MsRaster (future home of MsScatter as well)
data:
data/measurement_set: subdirectory for MeasurementSet data
data/measurement_set/processing_set: subdirectory for ProcessingSet data and support functions
data subdirectories allow addition of other types of data, e.g. calibration tables
measurement_set subdirectories allow addition of other formats of data, e.g. casatools interface
plot:
plot/ms_plot: subdirectory for MeasurementSet plots and support functions
bokeh: Color palette (common with
cubevis)toolbox: AppContext (common with
cubevis)
MsRaster Data
Abbreviations:
ps : ProcessingSet
ms : MeasurementSet
xdt : Xarray DataTree
xds : Xarray Dataset
xda : Xarray DataArray
MsRaster only supports XRADIO ProcessingSet Zarr files. If an MSv2 path is provided, the MSv2 is converted to MSv4 in the same directory using the XRADIO conversion function with default partitioning. The Zarr file is lazily opened with only the metadata, returning an Xarray DataTree in the ProcessingSet schema (ps_xdt), which contains MSv4 Xarray DataTrees (ms_xdt).
These MSv4s are the MSv2 partitioned when the Zarr file was created. The default partitioning is data description (spectral window and polarization setup) and obs mode (intent?).
On disk, the Zarr file is a directory, and each subdirectory is an MSv4 name. Each directory contains .zattrs and .zmetadata, where .zattrs contains the DataTree attributes and .zmetadata contains the consolidated attributes of subtrees. This is important to understand when writing to zarr (as in flagging); each ms_xdt must be written to zarr to write the new data group attributes, then the ps_xdt must be written to zarr to consolidate these new attributes at the ProcessingSet level. Besides these dot files, the MSv4 directories contain directories for each coordinate and data variable.
Accessors:
ps_xdt : Xarray DataTree returned from
open_processing_setps_xdt.xr_ps : XRADIO ProcessingSetXdt
ps_xdt[ms_name] = ms_xdt : Xarray DataTree
ms_xdt.xr_ms : XRADIO MeasurementSetXdt
ms_xdt.ds : correlated data Xarray Dataset (a read-only DatasetView)
ms_xdt.name or ms_xdt[name] : Xarray DataArray for the coordinate or data variable name
xarray.DataArray.values = np.ndarray
xarray.DataArray.data = array of underlying type (e.g. dask.Array)
To understand the data schema, it is highly recommended to review the examples
in the XRADIO MeasurementSet tutorial.
Each ms_xdt contains dimensions, coordinates, data variables, and attributes.
Dimensions are named or labeled 1-dimensional coordinate arrays, so data
values are accessed using the dimension label rather than by an index (for
example, a polarization dimension would be accessed using “XX” rather than
index 0). Coordinates are Xarray DataArrays, which may or may not be dimensions
(signified with a *). Data variables are multidimensional Xarray DataArrays.
These DataArrays have attribute dictionaries holding arbitrary metadata
describing the data (for example, units).
MeasurementSet DataTrees can contain multiple sets of VISIBILITY, FLAG, WEIGHT, and UVW data variables. To maintain the relationship between these variables, the ms_xdts have a “data_groups” attribute. Each data group is a dictionary of “correlated_data”, “flag”, “weight”, and “uvw”. The “base” data group has VISIBILITY correlated data which corresponds to the DATA column in an MSv2. There may also be VISIBILITY_CORRECTED and VISIBILITY_MODEL data variables in “corrected” and “model” data groups (same flag, weight, etc.). The default data group is “base”.
It is easy to see that these data variables can be in multiple data groups, such as when flags are manually set or weights are modified during calibration. All that is required is creating a new data group with the keyword updated to the new data variable, selecting the data group, and plotting it. In fact, this is how MsRaster flagging is done.
Warning
When a data_group is selected using XRADIO ProcessingSetXdt.query(), the
data group is selected in the ms_xdts in the processing set not only in the
returned ps_xdt but also in the source ps_xdt. For example, after
selected_ps=ps_xdt.query(data_group_name='corrected') the selected_ps
and ps_xdt both contain only the corrected data group and an error
results from selecting another data group
selected_ps2=ps_xdt.query(data_group_name='model'). To restore all data
groups for selection, recreate the ps_xdt using open_processing_set.
Visibility data has four dimensions: time, baseline_id, frequency, and polarization.
time (float64): MsRaster converts to datetime64 using the time DataArray attributes.
baseline_id (int64): each id is an index into baseline_antenna1_name and baseline_antenna2_name in MSv4 correlated data. The ids are not consistent across ms_xdts and a specific id may identify different antenna pairs in different ms_xdts. Because of this, MsRaster creates a ‘baseline’ coordinate consisting of the antenna names “ant1 & ant2”, indexes a list of all antenna pairs in all ms_xdts, sorts them, and assigns a consistent baseline_id coordinate. This is necessary since all selected MSv4s are concatenated into a single Xarray Dataset for plotting, so the baseline id must be consistent.
frequency (float64): units are described in this DataArray’s attributes, usually “Hz”. If the frequencies are in the GHz range then the values are converted to GHz and the units are updated accordingly. For consistent data shapes, if the user has not selected a spectral window, the first spw (at the minimum time) is automatically selected.
polarization (unicode string): polarization labels rather than ids. For a plot axis, ids are assigned to the polarizations in casacore Stokes order.
The user selects which two dimensions will be the x- and y-axes of the plot (default time vs. baseline). Numeric dimensions can easily be used as axes, but string dimensions must have an index assigned. The index DataArray becomes the dimension, and the string coordinate labels are used for the axis tick labels.
Since visibility data is 4D and raster plot data is 2D, the dimensions which are not plot axes must be selected or aggregated. MsRaster allows the user to specify this selection or aggregation, else the “first” values are selected automatically. For numeric dimensions, the “first” value is the minimum value in the plot data. Polarization labels are assigned the enum value listed in casacore Stokes.h, then the minimum enum value is used to select its corresponding label. For example, in a time vs. baseline plot, the first frequency and polarization are automatically selected.
The 2D data in all MSv4s is concatenated into one Xarray Dataset, then the complex component for the selected “vis_axis” is applied to the visibility DataArray indicated by the data group “correlated data”. Note that the coordinates and data variables in the Dataset, which are stored in Dask arrays, are lazily evaluated and only computed when the plot is shown or saved. For this reason, creating a plot is very fast but show() and save() are slower while selection and calculations are performed in a dask graph.
However, the values computed for the plot are then discarded and must be recomputed when accessed again. To locate a point in the plot (e.g. the cursor position) the values shown in Cursor Location must be computed again. To avoid this, one could compute the Xarray DataArrays in the Dataset (xda.compute()), but this puts the calculated data in a numpy array in memory. This approach would not work where the data would not fit in memory.
Plot
Raster plots are created in a RasterPlot class using the supplied raster
Xarray Dataset and the stored style parameters (e.g. colormaps). First, a plot
is created for the visibility DataArray with the unflagged colormap, then a new
Xarray DataArray is created for flagged data (VISIBILITY where FLAG==1); values
are nan where FLAG==0. A second plot is created using the flagged colormap.
The data plot is overlaid by the flagged plot in a Holoviews Overlay
plot object which is returned to the application.
The Overlay plot is stored in a list; additional plots may be added to the list using iteration or clear_plots=False. The list of plots may then be combined in a Holoviews Layout according to the number of rows and columns.
Plot operators:
overlay = plot1 * plot2
layout = plot1 + plot2
GUI
The GUI is created in MsRaster using a Holoviews DynamicMap
placeholder for the plot and common Panel components arranged in Panel
tabs, rows and columns. The components are created with callbacks to MsRaster
and MsPlot methods by adding the callback to the layout where the user input
is located.
Locating the cursor, points, and boxes utilizes holoviews streams. These streams are added to the DynamicMap plot to enable the callbacks to locate the values in each stream. The values passed to the callback functions are the coordinate values in the plot Dataset.
For all callbacks, the GUI is not updated until the callback function returns (see Future Work refactoring below).
Future Work
Data
Support MSv2 without conversion. See issue in XRADIO where an MSv4 interface is added to MSv2 so that a ProcessingSet can still be used.
Documentation
MsRaster API requested
Refactoring
MsPlot: too much implementation code added here, features (locate, flagging) should be moved to another file or class. Some code will probably be raster plot-specific (which will be apparent when scatter plots are added) and should be moved from MsPlot to MsRaster or elsewhere.
Locate: To the user, it looks like nothing is happening in the GUI until all points are located, rather than filling in one by one. When flagging, locate must finish before flag button is enabled. This is poor user experience and should be done in a separate thread.
Plotting
Rasterize: hvplot supports plotting with
rasterize=True, but this caused an error in MsRaster (xda.hvplot(…, rasterize=True) in _raster_plot.py). Not sure why. This would be important for future datasets with large time, baseline, or frequency axes.datashade=Truereturns RGB object instead of points (r, g, b in hover)rasterize=Truereturns aggregate object instead of points (x, y, count in hover, maybe you can select another aggregator besides count)
Flagging
More testing needs to be done
Flags only written to Zarr file. Need manual flagging to be available to apply to MSv2 by writing selection strings to file. See casadocs flagdata documentation describing mode=’list’ with inpfile (?).
Weights
Not used in any aggregation, for example mean is the Xarray Dataset mean() of the visibility data along the specified data dimension
Performance
Setting up toolviper Dask client is key (perhaps beyond scope of vis plotting). Will be more critical for scatter plots which use much more data; a scatter plot could include all visibility data in the processing set e.g. amp vs time plot.
Dask diagnostic dashboard
Dask local diagnostics
Usage modes
So far, only done from Python session. Have not tested demo scripts lately.
Notebook:
Scatter plots
While raster plots use dimensions as plot axes, scatter plots can use any coordinate or data variable as axes. These axes can have different and even unrelated dimensions. The easiest way to create (x,y) points is to convert the Xarray Dataset to a Pandas dataframe with tabular data. However, Pandas DataFrames are in-memory data structures and can throw MemoryError exceptions. Possibly plot subsets of data and combine in overlay?
Use Datashader for these large plots – see how Datashader solves Plotting Pitfalls. Datashader is part of the Holoviz libraries including hvPlot, Holoviews, and Panel.