Zarr Encoding Specification¶
In implementing support for the Zarr storage format, Xarray developers made some ad hoc choices about how to store NetCDF data in Zarr. Future versions of the Zarr spec will likely include a more formal convention for the storage of the NetCDF data model in Zarr; see Zarr spec repo for ongoing discussion.
First, Xarray can only read and write Zarr groups. There is currently no support
for reading / writing individual Zarr arrays. Zarr groups are mapped to
Xarray Dataset
objects.
Second, from Xarray’s point of view, the key difference between NetCDF and Zarr is that all NetCDF arrays have dimension names while Zarr arrays do not. Therefore, in order to store NetCDF data in Zarr, Xarray must somehow encode and decode the name of each array’s dimensions.
To accomplish this, Xarray developers decided to define a special Zarr array
attribute: _ARRAY_DIMENSIONS
. The value of this attribute is a list of
dimension names (strings), for example ["time", "lon", "lat"]
. When writing
data to Zarr, Xarray sets this attribute on all variables based on the variable
dimensions. When reading a Zarr group, Xarray looks for this attribute on all
arrays, raising an error if it can’t be found. The attribute is used to define
the variable dimension names and then removed from the attributes dictionary
returned to the user.
Because of these choices, Xarray cannot read arbitrary array data, but only
Zarr data with valid _ARRAY_DIMENSIONS
or
NCZarr attributes
on each array (NCZarr dimension names are defined in the .zarray
file).
After decoding the _ARRAY_DIMENSIONS
or NCZarr attribute and assigning the variable
dimensions, Xarray proceeds to [optionally] decode each variable using its
standard CF decoding machinery used for NetCDF data (see decode_cf()
).
Finally, it’s worth noting that Xarray writes (and attempts to read)
“consolidated metadata” by default (the .zmetadata
file), which is another
non-standard Zarr extension, albeit one implemented upstream in Zarr-Python.
You do not need to write consolidated metadata to make Zarr stores readable in
Xarray, but because Xarray can open these stores much faster, users will see a
warning about poor performance when reading non-consolidated stores unless they
explicitly set consolidated=False
. See Consolidated Metadata
for more details.
As a concrete example, here we write a tutorial dataset to Zarr and then re-open it directly with Zarr:
In [1]: import os
In [2]: import xarray as xr
In [3]: import zarr
In [4]: ds = xr.tutorial.load_dataset("rasm")
---------------------------------------------------------------------------
gaierror Traceback (most recent call last)
File /usr/lib/python3/dist-packages/urllib3/connection.py:198, in HTTPConnection._new_conn(self)
197 try:
--> 198 sock = connection.create_connection(
199 (self._dns_host, self.port),
200 self.timeout,
201 source_address=self.source_address,
202 socket_options=self.socket_options,
203 )
204 except socket.gaierror as e:
File /usr/lib/python3/dist-packages/urllib3/util/connection.py:60, in create_connection(address, timeout, source_address, socket_options)
58 raise LocationParseError(f"'{host}', label empty or too long") from None
---> 60 for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
61 af, socktype, proto, canonname, sa = res
File /usr/lib/python3.13/socket.py:977, in getaddrinfo(host, port, family, type, proto, flags)
976 addrlist = []
--> 977 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
978 af, socktype, proto, canonname, sa = res
gaierror: [Errno -2] Name or service not known
The above exception was the direct cause of the following exception:
NameResolutionError Traceback (most recent call last)
File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
788 conn,
789 method,
790 url,
791 timeout=timeout_obj,
792 body=body,
793 headers=headers,
794 chunked=chunked,
795 retries=retries,
796 response_conn=response_conn,
797 preload_content=preload_content,
798 decode_content=decode_content,
799 **response_kw,
800 )
802 # Everything went great!
File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:488, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
487 new_e = _wrap_proxy_error(new_e, conn.proxy.scheme)
--> 488 raise new_e
490 # conn.request() calls http.client.*.request, not the method in
491 # urllib3.request. It also calls makefile (recv) on the socket.
File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:464, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
463 try:
--> 464 self._validate_conn(conn)
465 except (SocketTimeout, BaseSSLError) as e:
File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:1093, in HTTPSConnectionPool._validate_conn(self, conn)
1092 if conn.is_closed:
-> 1093 conn.connect()
1095 # TODO revise this, see https://github.com/urllib3/urllib3/issues/2791
File /usr/lib/python3/dist-packages/urllib3/connection.py:704, in HTTPSConnection.connect(self)
703 sock: socket.socket | ssl.SSLSocket
--> 704 self.sock = sock = self._new_conn()
705 server_hostname: str = self.host
File /usr/lib/python3/dist-packages/urllib3/connection.py:205, in HTTPConnection._new_conn(self)
204 except socket.gaierror as e:
--> 205 raise NameResolutionError(self.host, self, e) from e
206 except SocketTimeout as e:
NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7370f9cad310>: Failed to resolve 'github.com' ([Errno -2] Name or service not known)
The above exception was the direct cause of the following exception:
MaxRetryError Traceback (most recent call last)
File /usr/lib/python3/dist-packages/requests/adapters.py:667, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
666 try:
--> 667 resp = conn.urlopen(
668 method=request.method,
669 url=url,
670 body=request.body,
671 headers=request.headers,
672 redirect=False,
673 assert_same_host=False,
674 preload_content=False,
675 decode_content=False,
676 retries=self.max_retries,
677 timeout=timeout,
678 chunked=chunked,
679 )
681 except (ProtocolError, OSError) as err:
File /usr/lib/python3/dist-packages/urllib3/connectionpool.py:841, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
839 new_e = ProtocolError("Connection aborted.", new_e)
--> 841 retries = retries.increment(
842 method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
843 )
844 retries.sleep()
File /usr/lib/python3/dist-packages/urllib3/util/retry.py:519, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
518 reason = error or ResponseError(cause)
--> 519 raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
521 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry)
MaxRetryError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /pydata/xarray-data/raw/master/rasm.nc (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7370f9cad310>: Failed to resolve 'github.com' ([Errno -2] Name or service not known)"))
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
Cell In[4], line 1
----> 1 ds = xr.tutorial.load_dataset("rasm")
File /build/python-xarray-IkN0cq/python-xarray-2025.03.1/xarray/tutorial.py:215, in load_dataset(*args, **kwargs)
178 def load_dataset(*args, **kwargs) -> Dataset:
179 """
180 Open, load into memory, and close a dataset from the online repository
181 (requires internet).
(...)
213 load_dataset
214 """
--> 215 with open_dataset(*args, **kwargs) as ds:
216 return ds.load()
File /build/python-xarray-IkN0cq/python-xarray-2025.03.1/xarray/tutorial.py:167, in open_dataset(name, cache, cache_dir, engine, **kws)
164 downloader = pooch.HTTPDownloader(headers=headers)
166 # retrieve the file
--> 167 filepath = pooch.retrieve(
168 url=url, known_hash=None, path=cache_dir, downloader=downloader
169 )
170 ds = _open_dataset(filepath, engine=engine, **kws)
171 if not cache:
File /usr/lib/python3/dist-packages/pooch/core.py:239, in retrieve(url, known_hash, fname, path, processor, downloader, progressbar)
236 if downloader is None:
237 downloader = choose_downloader(url, progressbar=progressbar)
--> 239 stream_download(url, full_path, known_hash, downloader, pooch=None)
241 if known_hash is None:
242 get_logger().info(
243 "SHA256 hash of downloaded file: %s\n"
244 "Use this value as the 'known_hash' argument of 'pooch.retrieve'"
(...)
247 file_hash(str(full_path)),
248 )
File /usr/lib/python3/dist-packages/pooch/core.py:807, in stream_download(url, fname, known_hash, downloader, pooch, retry_if_failed)
803 try:
804 # Stream the file to a temporary so that we can safely check its
805 # hash before overwriting the original.
806 with temporary_file(path=str(fname.parent)) as tmp:
--> 807 downloader(url, tmp, pooch)
808 hash_matches(tmp, known_hash, strict=True, source=str(fname.name))
809 shutil.move(tmp, str(fname))
File /usr/lib/python3/dist-packages/pooch/downloaders.py:220, in HTTPDownloader.__call__(self, url, output_file, pooch, check_only)
218 # pylint: enable=consider-using-with
219 try:
--> 220 response = requests.get(url, timeout=timeout, **kwargs)
221 response.raise_for_status()
222 content = response.iter_content(chunk_size=self.chunk_size)
File /usr/lib/python3/dist-packages/requests/api.py:73, in get(url, params, **kwargs)
62 def get(url, params=None, **kwargs):
63 r"""Sends a GET request.
64
65 :param url: URL for the new :class:`Request` object.
(...)
70 :rtype: requests.Response
71 """
---> 73 return request("get", url, params=params, **kwargs)
File /usr/lib/python3/dist-packages/requests/api.py:59, in request(method, url, **kwargs)
55 # By using the 'with' statement we are sure the session is closed, thus we
56 # avoid leaving sockets open which can trigger a ResourceWarning in some
57 # cases, and look like a memory leak in others.
58 with sessions.Session() as session:
---> 59 return session.request(method=method, url=url, **kwargs)
File /usr/lib/python3/dist-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
584 send_kwargs = {
585 "timeout": timeout,
586 "allow_redirects": allow_redirects,
587 }
588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
591 return resp
File /usr/lib/python3/dist-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs)
700 start = preferred_clock()
702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
705 # Total elapsed time of the request (approximately)
706 elapsed = preferred_clock() - start
File /usr/lib/python3/dist-packages/requests/adapters.py:700, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
696 if isinstance(e.reason, _SSLError):
697 # This branch is for urllib3 v1.22 and later.
698 raise SSLError(e, request=request)
--> 700 raise ConnectionError(e, request=request)
702 except ClosedPoolError as e:
703 raise ConnectionError(e, request=request)
ConnectionError: HTTPSConnectionPool(host='github.com', port=443): Max retries exceeded with url: /pydata/xarray-data/raw/master/rasm.nc (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7370f9cad310>: Failed to resolve 'github.com' ([Errno -2] Name or service not known)"))
In [5]: ds.to_zarr("rasm.zarr", mode="w")
Out[5]: <xarray.backends.zarr.ZarrStore at 0x7370f9c1dab0>
In [6]: zgroup = zarr.open("rasm.zarr")
In [7]: print(os.listdir("rasm.zarr"))
['time', 'zarr.json']
In [8]: print(zgroup.tree())
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
File /usr/lib/python3/dist-packages/zarr/core/_tree.py:9
8 try:
----> 9 import rich
10 import rich.console
ModuleNotFoundError: No module named 'rich'
The above exception was the direct cause of the following exception:
ImportError Traceback (most recent call last)
Cell In[8], line 1
----> 1 print(zgroup.tree())
File /usr/lib/python3/dist-packages/zarr/core/group.py:2300, in Group.tree(self, expand, level)
2281 def tree(self, expand: bool | None = None, level: int | None = None) -> Any:
2282 """
2283 Return a tree-like representation of a hierarchy.
2284
(...)
2298 A pretty-printable object displaying the hierarchy.
2299 """
-> 2300 return self._sync(self._async_group.tree(expand=expand, level=level))
File /usr/lib/python3/dist-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
206 # TODO: refactor this to to take *args and **kwargs and pass those to the method
207 # this should allow us to better type the sync wrapper
--> 208 return sync(
209 coroutine,
210 timeout=config.get("async.timeout"),
211 )
File /usr/lib/python3/dist-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
160 return_result = next(iter(finished)).result()
162 if isinstance(return_result, BaseException):
--> 163 raise return_result
164 else:
165 return return_result
File /usr/lib/python3/dist-packages/zarr/core/sync.py:119, in _runner(coro)
114 """
115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
116 exception, the exception will be returned.
117 """
118 try:
--> 119 return await coro
120 except Exception as ex:
121 return ex
File /usr/lib/python3/dist-packages/zarr/core/group.py:1550, in AsyncGroup.tree(self, expand, level)
1531 async def tree(self, expand: bool | None = None, level: int | None = None) -> Any:
1532 """
1533 Return a tree-like representation of a hierarchy.
1534
(...)
1548 A pretty-printable object displaying the hierarchy.
1549 """
-> 1550 from zarr.core._tree import group_tree_async
1552 if expand is not None:
1553 raise NotImplementedError("'expand' is not yet implemented.")
File /usr/lib/python3/dist-packages/zarr/core/_tree.py:13
11 import rich.tree
12 except ImportError as e:
---> 13 raise ImportError("'rich' is required for Group.tree") from e
16 class TreeRepr:
17 """
18 A simple object with a tree-like repr for the Zarr Group.
19
20 Note that this object and it's implementation isn't considered part
21 of Zarr's public API.
22 """
ImportError: 'rich' is required for Group.tree
In [9]: dict(zgroup["Tair"].attrs)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /usr/lib/python3/dist-packages/zarr/core/group.py:727, in AsyncGroup._getitem_consolidated(self, store_path, key, prefix)
726 try:
--> 727 metadata = metadata.consolidated_metadata.metadata[indexer]
728 except KeyError as e:
729 # The Group Metadata has consolidated metadata, but the key
730 # isn't present. We trust this to mean that the key isn't in
731 # the hierarchy, and *don't* fall back to checking the store.
KeyError: 'Tair'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[9], line 1
----> 1 dict(zgroup["Tair"].attrs)
File /usr/lib/python3/dist-packages/zarr/core/group.py:1842, in Group.__getitem__(self, path)
1815 def __getitem__(self, path: str) -> Array | Group:
1816 """Obtain a group member.
1817
1818 Parameters
(...)
1840
1841 """
-> 1842 obj = self._sync(self._async_group.getitem(path))
1843 if isinstance(obj, AsyncArray):
1844 return Array(obj)
File /usr/lib/python3/dist-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
206 # TODO: refactor this to to take *args and **kwargs and pass those to the method
207 # this should allow us to better type the sync wrapper
--> 208 return sync(
209 coroutine,
210 timeout=config.get("async.timeout"),
211 )
File /usr/lib/python3/dist-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
160 return_result = next(iter(finished)).result()
162 if isinstance(return_result, BaseException):
--> 163 raise return_result
164 else:
165 return return_result
File /usr/lib/python3/dist-packages/zarr/core/sync.py:119, in _runner(coro)
114 """
115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
116 exception, the exception will be returned.
117 """
118 try:
--> 119 return await coro
120 except Exception as ex:
121 return ex
File /usr/lib/python3/dist-packages/zarr/core/group.py:689, in AsyncGroup.getitem(self, key)
687 # Consolidated metadata lets us avoid some I/O operations so try that first.
688 if self.metadata.consolidated_metadata is not None:
--> 689 return self._getitem_consolidated(store_path, key, prefix=self.name)
690 try:
691 return await get_node(
692 store=store_path.store, path=store_path.path, zarr_format=self.metadata.zarr_format
693 )
File /usr/lib/python3/dist-packages/zarr/core/group.py:733, in AsyncGroup._getitem_consolidated(self, store_path, key, prefix)
728 except KeyError as e:
729 # The Group Metadata has consolidated metadata, but the key
730 # isn't present. We trust this to mean that the key isn't in
731 # the hierarchy, and *don't* fall back to checking the store.
732 msg = f"'{key}' not found in consolidated metadata."
--> 733 raise KeyError(msg) from e
735 # update store_path to ensure that AsyncArray/Group.name is correct
736 if prefix != "/":
KeyError: "'Tair' not found in consolidated metadata."