Extracting OLE documents from Microsoft Office format

Why would I even write about such a topic? Because one very well respected organization decided to document its RESTful API in a form of a Microsoft's OpenXML (the proprietary and ill-documented format). That would not by much of a problem, thanks to LibreOffice, but those guys decided that the best way to show sample request/response bodies would be to embed a binary text files as OLE objects. LibreOffice can not help there - it extracts something out of that object, but it is not readable. But let's get to the action.

So you have a document that have a header "Sample Validate Accession POST". Extract the .docx contents (it is just a zip file) and open document.xml. You better have a pretty-print-XML extension in your text editor - MS Office does not care for human readers. Then you can start to look for the section you need. That is not that easy. Even with the entire text as one header MS Word saves it in a lovely form:

<w:p w:rsidR="00F86F8E" w:rsidRDefault="006A0C7D" w:rsidP="007D5F6C">
  <w:pPr>
    <w:pStyle w:val="Heading4" />
  </w:pPr>
  <w:r>
    <w:t xml:space="preserve">Sample Validate Accession</w:t>
  </w:r>
  <w:r w:rsidR="00F42C5B">
    <w:t>P</w:t>
  </w:r>
  <w:r w:rsidR="000543EB">
    <w:t>OST</w:t>
  </w:r>
</w:p>

From there you should be able to follow the text and eventually find the object you need to extract:

  <w:r>
    <w:object w:dxaOrig="3780" w:dyaOrig="810">
      <v:shape id="_x0000_i1033" type="#_x0000_t75" style="width:187.95pt;height:40.2pt" o:ole="">
        <v:imagedata r:id="rId46" o:title="" />
      </v:shape>
      <o:OLEObject Type="Embed" ProgID="Package" ShapeID="_x0000_i1033" DrawAspect="Content" ObjectID="_1473167340" r:id="rId47" />
    </w:object>
  </w:r>

And you only interested in the relation id. That is this part: "r:id="rId47".

Next you should open file ./_rels/document.xml.rels (it is an XML file, despite that idiosyncratic extension). In that file you can simply search for the id and you will get a line

<Relationship Id="rId47"
  Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/oleObject"
  Target="embeddings/oleObject9.bin"
/>

The last thing you need is to open your "./embeddings/oleObject9.bin" file and strip it of windows line breaks and useless metadata. Or maybe not that useless, I found a fairly unique user name in mine. Just a reminder, one of those may be a huge help:

cat -v oleObject9.bin
cat -T oleObject9.bin
cat -A oleObject9.bin

As a final thought: please, do not write RESTful services documentation in Microsoft Word format.