Does not work with structured arrays #42

jeffpeck10x · 2024-01-08T23:11:05Z

.npy files can contain structured arrays, as described here, fail to open with error:

Uncaught SyntaxError: Unexpected token ( in JSON at position 28

Example, in python create the structured array:

np.save('test/out.npy', np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)],
             dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')]))

Serve the out.npy file:

cd test
npx serve

Try to load that out.npy file:

> np.load("http://localhost:3000/out.npy")
Promise {
  <pending>,
  [Symbol(async_id_symbol)]: 4454,
  [Symbol(trigger_async_id_symbol)]: 5
}
> Uncaught SyntaxError: Unexpected token ( in JSON at position 28

The text was updated successfully, but these errors were encountered:

jeffpeck10x · 2024-01-09T07:27:32Z

Here is a rough idea of how to parse a structured array. It changes things a bit with the library, but this stand-alone example worked with a structured array that I was working with. Maybe this will be helpful for others.

const dtypes = {
  u1: {
    bytesPerElement: Uint8Array.BYTES_PER_ELEMENT,
    dvFnName: 'getUint8'
  },
  u2: {
    bytesPerElement: Uint16Array.BYTES_PER_ELEMENT,
    dvFnName: 'getUint16'
  },
  i1: {
    bytesPerElement: Int8Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt8'
  },
  i2: {
    bytesPerElement: Int16Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt16'
  },
  u4: {
    bytesPerElement: Int32Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt32'
  },
  i4: {
    bytesPerElement: Int32Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt32'
  },
  u8: {
    bytesPerElement: BigUint64Array.BYTES_PER_ELEMENT,
    dvFnName: 'getBigUint64'
  },
  i8: {
    bytesPerElement: BigInt64Array.BYTES_PER_ELEMENT,
    dvFnName: 'getBigInt64'
  },
  f4: {
    bytesPerElement: Float32Array.BYTES_PER_ELEMENT,
    dvFnName: 'getFloat32'
  },
  f8: {
    bytesPerElement: Float64Array.BYTES_PER_ELEMENT,
    dvFnName: 'getFloat64'
  },
};

function parse(arrayBufferContents) {
  const dv = new DataView(arrayBufferContents);

  const headerLength = dv.getUint16(8, true);
  const offsetBytes = 10 + headerLength;

  const hcontents = new TextDecoder("utf-8").decode(
    new Uint8Array(arrayBufferContents.slice(10, 10 + headerLength))
  );

  const [, descr, fortranOrder, shape] = hcontents.match(
    /{'descr': (.*), 'fortran_order': (.*), 'shape': (.*), }/
  );

  const columns = [...descr.matchAll(/\('([^']+)', '([\|<>])([^']+)'\)/g)].map(
    ([, columnName, endianess, dtype]) => ({
      columnName,
      littleEndian: endianess === "<",
      bytesPerElement: dtypes[dtype].bytesPerElement,
      dvFn: (...args) => dv[dtypes[dtype].dvFnName](...args),
    })
  );

  const [, numRows] = shape.match(/\((\d+),\)/);

  const stride = columns
    .map((c) => c.bytesPerElement)
    .reduce((sum, numBytes) => sum + numBytes, 0);

  const data = [];
  let i, c, offset, dataIdx, row;
  const numColumns = columns.length;
  for (i = offsetBytes; i < numRows; i++) {
    offset = 0;
    row = {};
    dataIdx = i * stride;
    for (c = 0; c < numColumns; c++) {
      row[columns[c].columnName] = columns[c].dvFn(dataIdx + offset, columns[c].littleEndian);
      offset += columns[c].bytesPerElement;
    }
    data.push(row);
  }

  return data;
}

j6k4m8 · 2024-02-13T16:37:00Z

Wow thank you for the thoughtful comments @jeffpeck10x!! Would you be interested in turning this into a PR? Otherwise I'm happy to incorporate these suggestions in my next dev push :)

jeffpeck10x · 2024-02-13T18:36:26Z

Thanks @j6k4m8 . I don't think I have the time to generalize this into a more complete approach. I did end up taking this a little farther, although ended up ultimately using Apache Arrow for my use-case. That said, here is some helpful code to get this started.

The following function should reliably fetch columns and numRow from the header:

function parseHeaderContents(hcontents) {
  const [, descr, fortranOrder, shape] = hcontents.match(
    /{'descr': (.*), 'fortran_order': (.*), 'shape': (.*), }/
  );

  let offset = 0;
  const columns = [...descr.matchAll(/\('([^']+)', '([\|<>])([^']+)'\)/g)].map(
    ([, columnName, endianess, dtype]) => {
      const littleEndian = endianess === "<";
      const ret = {
        columnName,
        dvFnName: dtypes[dtype].dvFnName,
        TypedArray: dtypes[dtype].TypedArray,
      };
      offset += dtypes[dtype].bytesPerElement;
      return ret;
    }
  );

  const [, numRows] = shape.match(/\((\d+),\)/);

  return { columns, numRows };
}

I ended up writing the following parse function to get at the data when I was experimenting:

function parse(arrayBufferContents) {
  const dv = new DataView(arrayBufferContents);

  const headerLength = dv.getUint16(8, true);
  const offsetBytes = 10 + headerLength;

  const hcontents = new TextDecoder("utf-8").decode(
    new Uint8Array(arrayBufferContents.slice(10, 10 + headerLength))
  );

  const { columns, numRows } = parseHeaderContents(hcontents);
  const columnNames = columns.map((c) => c.columnName);
  const offsets = columns
    .map((c) => c.TypedArray.BYTES_PER_ELEMENT)
    .reduce((arr, v, i) => [...arr, v + arr[i]], [0]);
  const stride = offsets.pop();
  const dvFnNames = columns.map((c) => c.dvFnName);
  const dataViews = offsets.map(
    (offset) => new DataView(arrayBufferContents, offsetBytes + offset)
  );
  const dataViewGetters = dataViews.map((dv, i) => dv[dvFnNames[i]].bind(dv));
  const data = columnNames.reduce(
    (obj, columnName, i) => ({
      ...obj,
      [columnName]: new columns[i].TypedArray(numRows),
    }),
    {}
  );

  for (let j = 0; j < columnNames.length; j++) {
    const columnName = columnNames[j];
    const getter = dataViewGetters[j];
    const column = data[columnName];

    for (let i = 0; i < numRows; i++) {
      column[i] = getter(i * stride, true);
    }
  }

  return data;
}

I think there are ideas from that which can be used to generalize the approach.

jeffpeck10x · 2024-02-13T18:40:24Z

I know this all goes in a slightly different direction than your library, although using your library as the base gave me a really good headstart, just to even realize that I can use DataViews and offsets, and where to find the metadata in the headers. So, if any of this helps to improve the versatility of your library, great! And full circle. Don't feel like any pressure to do this though. In my particular use-case, as mentioned, I am all set with Apache Arrow IPC (ultimately needed to send this data between two processes, so more suited anyway).

j6k4m8 self-assigned this Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does not work with structured arrays #42

Does not work with structured arrays #42

jeffpeck10x commented Jan 8, 2024 •

edited

jeffpeck10x commented Jan 9, 2024

j6k4m8 commented Feb 13, 2024

jeffpeck10x commented Feb 13, 2024 •

edited

jeffpeck10x commented Feb 13, 2024

Does not work with structured arrays #42

Does not work with structured arrays #42

Comments

jeffpeck10x commented Jan 8, 2024 • edited

jeffpeck10x commented Jan 9, 2024

j6k4m8 commented Feb 13, 2024

jeffpeck10x commented Feb 13, 2024 • edited

jeffpeck10x commented Feb 13, 2024

jeffpeck10x commented Jan 8, 2024 •

edited

jeffpeck10x commented Feb 13, 2024 •

edited