csv input data - sorting impossible with not used columns in mapping #50

peterborkuti · 2019-02-28T15:37:43Z

Dear Matt,

When I am using csv file input for a unit test which contains two columns (for example "id" and "a"), but I am using only one of them in the mapping (for example "a") and I choose the other ("id") for sorting, an exception occurs:

2019/02/28 15:07:40 - Spoon - Caused by: org.pentaho.di.core.exception.KettleException: 
2019/02/28 15:07:40 - Spoon - Unable to get all rows for database data set 'addnumbers as text'
2019/02/28 15:07:40 - Spoon - -1
2019/02/28 15:07:40 - Spoon - 
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetCsvGroup.getAllRows(DataSetCsvGroup.java:226)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetGroup.getAllRows(DataSetGroup.java:133)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSet.getAllRows(DataSet.java:140)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.spoon.xtpoint.InjectDataSetIntoTransExtensionPoint.injectDataSetIntoStep(InjectDataSetIntoTransExtensionPoint.java:198)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.spoon.xtpoint.InjectDataSetIntoTransExtensionPoint.callExtensionPoint(InjectDataSetIntoTransExtensionPoint.java:126)
2019/02/28 15:07:40 - Spoon - 	... 8 more
2019/02/28 15:07:40 - Spoon - Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.core.row.RowMeta.compare(RowMeta.java:915)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetCsvGroup$1.compare(DataSetCsvGroup.java:214)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetCsvGroup$1.compare(DataSetCsvGroup.java:211)
2019/02/28 15:07:40 - Spoon - 	at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
2019/02/28 15:07:40 - Spoon - 	at java.util.TimSort.sort(TimSort.java:220)
2019/02/28 15:07:40 - Spoon - 	at java.util.Arrays.sort(Arrays.java:1512)
2019/02/28 15:07:40 - Spoon - 	at java.util.ArrayList.sort(ArrayList.java:1462)
2019/02/28 15:07:40 - Spoon - 	at java.util.Collections.sort(Collections.java:175)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetCsvGroup.getAllRows(DataSetCsvGroup.java:211)
2019/02/28 15:07:40 - Spoon - 	... 12 more

I debugged it and I think, here is the spot in the code:
(DataSetCsvGroup.java from line 200)

      // Which fields are we sorting on (if any)
      //
      int[] sortIndexes = new int[ sortFields.size() ];
      for ( int i = 0; i < sortIndexes.length; i++ ) {
        sortIndexes[ i ] = outputRowMeta.indexOfValue( sortFields.get( i ) );
      }

      if ( !sortFields.isEmpty() ) {

        // Sort the rows...
        //
        Collections.sort( rows, new Comparator<Object[]>() {
          @Override public int compare( Object[] o1, Object[] o2 ) {
            try {
              return outputRowMeta.compare( o1, o2, sortIndexes );
            } catch ( KettleValueException e ) {
              throw new RuntimeException( "Unable to compare 2 rows", e );
            }
          }
        } );
      }

sortIndexes will not be empty, but sortIndexes[0] will be -1 and this will cause and ArrayIndexOutOfBounds exception in outputRowMeta.compare.

You may ask, why want I sorting the csv file base on a field, which is not in the mapping, but it seemed to
me a normal use case. For example, I wanted to test a transformation which adds two numbers together:

id	a	b	c
1	0	0	0
2	1	0	1

The input mapping would be the columns "a" and "b", sorted by "id"
The golden mapping would be the columns "a", "b" and "c" sorted by "id".

I put all the files to reproduce this here:
https://github.com/peterborkuti/pentaho-pdi-dataset-bug-01

Thank you for your wonderful plugin
Péter

The text was updated successfully, but these errors were encountered:

mattcasters · 2019-06-20T10:34:26Z

Hi Péter,

Thank you very much for the use case. It's true that I hadn't considered it yet.
I think we'll need to do something novel here like adding the sort columns temporarily until after sorting after which we should remove them again, just to make sure the columns don't end up in the test-transformation.
Cheers,
Matt

JenniferJohnson89 · 2020-08-26T11:24:22Z

I noticed that there is a similar problem at https://github.com/mattcasters/pentaho-pdi-dataset. Perhaps we can refer to this issue to find more context about the bug.

mattcasters self-assigned this Jun 20, 2019

mattcasters added bug enhancement labels Jun 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csv input data - sorting impossible with not used columns in mapping #50

csv input data - sorting impossible with not used columns in mapping #50

peterborkuti commented Feb 28, 2019 •

edited

mattcasters commented Jun 20, 2019

JenniferJohnson89 commented Aug 26, 2020

csv input data - sorting impossible with not used columns in mapping #50

csv input data - sorting impossible with not used columns in mapping #50

Comments

peterborkuti commented Feb 28, 2019 • edited

mattcasters commented Jun 20, 2019

JenniferJohnson89 commented Aug 26, 2020

peterborkuti commented Feb 28, 2019 •

edited