#1134126 file: identify Microsoft XPS (XML Paper Specification) and OpenXPS documents instead of ZIP/OOXML

Package:
file
Source:
file
Description:
Recognize the type of data in a file using "magic" numbers
Submitter:
Md Ayquassar
Date:
2026-05-15 21:15:02 UTC
Severity:
normal
#1134126#5
Date:
2026-04-16 18:49:01 UTC
From:
To:
The file(1) utility currently misclassifies valid legacy XPS (XML Paper Specification) and OpenXPS/ECMA‑388 documents as generic ZIP archives or “Microsoft OOXML”. This is incorrect because XPS files have a well‑defined internal structure that file(1) already has the capability to detect, analogous to how it distinguishes DOCX from a plain ZIP.

Why this matters: file(1) is used in digital preservation, forensic analysis, and automated document processing.  Misclassifying XPS as a plain ZIP or as an Office Open XML file breaks workflows that rely on correct identification. XPS documents saw widespread use for over a decade in Windows and remain common in enterprise archives, government records, and legacy document collections.

Reproducible XPS public samples:

• http://github.com/HiraokaHyperTools/SampleXpsDocuments_1_0 , appears identical to the archived Microsoft distribution at http://web.archive.org/web/20100911214740/http://download.microsoft.com/download/1/6/a/16acc601-1b7a-42ad-8d4e-4f0aa156ec3e/SampleXpsDocuments_1_0.exe
• http://example-files.online-convert.com/document/xps/example.xps
• http://www.xpsdev.com/content/files/xps-document-example.xps
• http://podatki.gov.si/dataset/615a4e72-da2c-47d3-9246-62c60155ba1b/resource/8c895c90-313a-4610-beae-0aad6f6ea66b/download/evidencadolgoronihfinannihnalobjan24.xps
• http://huggingface.co/datasets/banned-historical-archives/banned-historical-archives/resolve/main/todo/%E5%90%84%E7%9C%81/%E9%BB%91%E9%BE%99%E6%B1%9F/%E5%93%88%E5%B8%82%E6%AF%9B%E7%BA%BA%E5%8E%82%E4%B8%93%E6%A1%88%E6%83%85%E5%86%B5%E6%8A%A5%E5%91%8A/1.xps
• http://huggingface.co/datasets/banned-historical-archives/banned-historical-archives/resolve/main/todo/%E5%90%84%E7%9C%81/%E9%BB%91%E9%BE%99%E6%B1%9F/%E5%93%88%E5%B8%82%E6%AF%9B%E7%BA%BA%E5%8E%82%E4%B8%93%E6%A1%88%E6%83%85%E5%86%B5%E6%8A%A5%E5%91%8A/2.xps
• http://huggingface.co/datasets/banned-historical-archives/banned-historical-archives/resolve/main/todo/%E5%90%84%E7%9C%81/%E9%BB%91%E9%BE%99%E6%B1%9F/%E5%93%88%E5%B8%82%E6%AF%9B%E7%BA%BA%E5%8E%82%E4%B8%93%E6%A1%88%E6%83%85%E5%86%B5%E6%8A%A5%E5%91%8A/3.xps
• procurable from http://www.qualitylogic.com/knowledge-center/free-sample-xps-files

Reproducible OpenXPS public sample:

• http://example-files.online-convert.com/document/oxps/example.oxps

Note: The Microsoft sample files in SampleXpsDocuments_1_0/ConformanceViolations are intentionally malformed and should not be used as a basis for positive classification heuristics.

Observed behavior:

Running `file` on .xps files produces inconsistent and misleading results, including (depending on file):

• Microsoft OOXML
• Zip archive data (various versions and compression methods)
• Unicode text, UTF-16, little-endian text, with no line terminators (for a malformed file)

Running `file -i` yields only generic types such as:

• application/octet-stream; charset=binary
• application/zip; charset=binary
• text/plain; charset=utf-16le (for a malformed file)

Running `file -z` produces

• ERROR
• Unicode text, UTF-16, little-endian text, with no line terminators (for a malformed file)

Valid XPS documents should be recognized as such and not reduced to their container format (ZIP). Specifically, file(1) should distinguish XPS from Microsoft OOXML formats and report a meaningful type.

At minimum, the following MIME types should be emitted for valid documents:

• application/vnd.ms-xpsdocument (legacy Microsoft XPS)
• application/oxps (OpenXPS, ECMA-388)

Rationale and implementation direction:

XPS is an Open Packaging Conventions (OPC) container, similar to OOXML, but distinguishable via well-defined structural markers. Reliable detection can be implemented by inspecting:

• Presence and contents of [Content_Types].xml
• Relationship types in _rels/.rels
• FixedDocumentSequence or FixedDocument parts
• XML namespaces specific to XPS (e.g. http://schemas.microsoft.com/xps/2005/06)

This is analogous to how OOXML formats are already differentiated from generic ZIP archives in magic(5). The same approach can and should be extended to XPS.

Importantly, detection should:

• Prefer structural validation over filename extensions
• Avoid confusion with siblings (OOXML: DOCX/XLSX/PPTX)
• Fall back to ZIP only for genuinely unidentifiable or corrupted containers
• Continue falling back to non-XPS types for non-containers

Impact of not fixing: file(1) will continue to report misleading types for a still‑encountered document format, forcing users to implement workarounds (e.g. checking extensions or invoking external tools such as xdg-mime, which have their own issues) and reducing the utility's reliability.

Please extend the OPC detection in file(1) / magic(5) to include XPS (legacy and ECMA-388), following the same logic already used for OOXML. A patch can be modeled after the existing ooxml detection block.

Gratefully,
Md