Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not work with structured arrays #42

Open
jeffpeck10x opened this issue Jan 8, 2024 · 4 comments
Open

Does not work with structured arrays #42

jeffpeck10x opened this issue Jan 8, 2024 · 4 comments
Assignees

Comments

@jeffpeck10x
Copy link

jeffpeck10x commented Jan 8, 2024

.npy files can contain structured arrays, as described here, fail to open with error:

Uncaught SyntaxError: Unexpected token ( in JSON at position 28

Example, in python create the structured array:

np.save('test/out.npy', np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)],
             dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')]))

Serve the out.npy file:

cd test
npx serve

Try to load that out.npy file:

> np.load("http://localhost:3000/out.npy")
Promise {
  <pending>,
  [Symbol(async_id_symbol)]: 4454,
  [Symbol(trigger_async_id_symbol)]: 5
}
> Uncaught SyntaxError: Unexpected token ( in JSON at position 28
@jeffpeck10x
Copy link
Author

Here is a rough idea of how to parse a structured array. It changes things a bit with the library, but this stand-alone example worked with a structured array that I was working with. Maybe this will be helpful for others.

const dtypes = {
  u1: {
    bytesPerElement: Uint8Array.BYTES_PER_ELEMENT,
    dvFnName: 'getUint8'
  },
  u2: {
    bytesPerElement: Uint16Array.BYTES_PER_ELEMENT,
    dvFnName: 'getUint16'
  },
  i1: {
    bytesPerElement: Int8Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt8'
  },
  i2: {
    bytesPerElement: Int16Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt16'
  },
  u4: {
    bytesPerElement: Int32Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt32'
  },
  i4: {
    bytesPerElement: Int32Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt32'
  },
  u8: {
    bytesPerElement: BigUint64Array.BYTES_PER_ELEMENT,
    dvFnName: 'getBigUint64'
  },
  i8: {
    bytesPerElement: BigInt64Array.BYTES_PER_ELEMENT,
    dvFnName: 'getBigInt64'
  },
  f4: {
    bytesPerElement: Float32Array.BYTES_PER_ELEMENT,
    dvFnName: 'getFloat32'
  },
  f8: {
    bytesPerElement: Float64Array.BYTES_PER_ELEMENT,
    dvFnName: 'getFloat64'
  },
};

function parse(arrayBufferContents) {
  const dv = new DataView(arrayBufferContents);

  const headerLength = dv.getUint16(8, true);
  const offsetBytes = 10 + headerLength;

  const hcontents = new TextDecoder("utf-8").decode(
    new Uint8Array(arrayBufferContents.slice(10, 10 + headerLength))
  );

  const [, descr, fortranOrder, shape] = hcontents.match(
    /{'descr': (.*), 'fortran_order': (.*), 'shape': (.*), }/
  );

  const columns = [...descr.matchAll(/\('([^']+)', '([\|<>])([^']+)'\)/g)].map(
    ([, columnName, endianess, dtype]) => ({
      columnName,
      littleEndian: endianess === "<",
      bytesPerElement: dtypes[dtype].bytesPerElement,
      dvFn: (...args) => dv[dtypes[dtype].dvFnName](...args),
    })
  );

  const [, numRows] = shape.match(/\((\d+),\)/);

  const stride = columns
    .map((c) => c.bytesPerElement)
    .reduce((sum, numBytes) => sum + numBytes, 0);

  const data = [];
  let i, c, offset, dataIdx, row;
  const numColumns = columns.length;
  for (i = offsetBytes; i < numRows; i++) {
    offset = 0;
    row = {};
    dataIdx = i * stride;
    for (c = 0; c < numColumns; c++) {
      row[columns[c].columnName] = columns[c].dvFn(dataIdx + offset, columns[c].littleEndian);
      offset += columns[c].bytesPerElement;
    }
    data.push(row);
  }

  return data;
}

@j6k4m8
Copy link
Member

j6k4m8 commented Feb 13, 2024

Wow thank you for the thoughtful comments @jeffpeck10x!! Would you be interested in turning this into a PR? Otherwise I'm happy to incorporate these suggestions in my next dev push :)

@j6k4m8 j6k4m8 self-assigned this Feb 13, 2024
@jeffpeck10x
Copy link
Author

jeffpeck10x commented Feb 13, 2024

Thanks @j6k4m8 . I don't think I have the time to generalize this into a more complete approach. I did end up taking this a little farther, although ended up ultimately using Apache Arrow for my use-case. That said, here is some helpful code to get this started.

The following function should reliably fetch columns and numRow from the header:

function parseHeaderContents(hcontents) {
  const [, descr, fortranOrder, shape] = hcontents.match(
    /{'descr': (.*), 'fortran_order': (.*), 'shape': (.*), }/
  );

  let offset = 0;
  const columns = [...descr.matchAll(/\('([^']+)', '([\|<>])([^']+)'\)/g)].map(
    ([, columnName, endianess, dtype]) => {
      const littleEndian = endianess === "<";
      const ret = {
        columnName,
        dvFnName: dtypes[dtype].dvFnName,
        TypedArray: dtypes[dtype].TypedArray,
      };
      offset += dtypes[dtype].bytesPerElement;
      return ret;
    }
  );

  const [, numRows] = shape.match(/\((\d+),\)/);

  return { columns, numRows };
}

I ended up writing the following parse function to get at the data when I was experimenting:

function parse(arrayBufferContents) {
  const dv = new DataView(arrayBufferContents);

  const headerLength = dv.getUint16(8, true);
  const offsetBytes = 10 + headerLength;

  const hcontents = new TextDecoder("utf-8").decode(
    new Uint8Array(arrayBufferContents.slice(10, 10 + headerLength))
  );

  const { columns, numRows } = parseHeaderContents(hcontents);
  const columnNames = columns.map((c) => c.columnName);
  const offsets = columns
    .map((c) => c.TypedArray.BYTES_PER_ELEMENT)
    .reduce((arr, v, i) => [...arr, v + arr[i]], [0]);
  const stride = offsets.pop();
  const dvFnNames = columns.map((c) => c.dvFnName);
  const dataViews = offsets.map(
    (offset) => new DataView(arrayBufferContents, offsetBytes + offset)
  );
  const dataViewGetters = dataViews.map((dv, i) => dv[dvFnNames[i]].bind(dv));
  const data = columnNames.reduce(
    (obj, columnName, i) => ({
      ...obj,
      [columnName]: new columns[i].TypedArray(numRows),
    }),
    {}
  );

  for (let j = 0; j < columnNames.length; j++) {
    const columnName = columnNames[j];
    const getter = dataViewGetters[j];
    const column = data[columnName];

    for (let i = 0; i < numRows; i++) {
      column[i] = getter(i * stride, true);
    }
  }

  return data;
}

I think there are ideas from that which can be used to generalize the approach.

@jeffpeck10x
Copy link
Author

I know this all goes in a slightly different direction than your library, although using your library as the base gave me a really good headstart, just to even realize that I can use DataViews and offsets, and where to find the metadata in the headers. So, if any of this helps to improve the versatility of your library, great! And full circle. Don't feel like any pressure to do this though. In my particular use-case, as mentioned, I am all set with Apache Arrow IPC (ultimately needed to send this data between two processes, so more suited anyway).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants