Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

csv input data - sorting impossible with not used columns in mapping #50

Open
peterborkuti opened this issue Feb 28, 2019 · 2 comments
Open
Assignees

Comments

@peterborkuti
Copy link
Collaborator

peterborkuti commented Feb 28, 2019

Dear Matt,

When I am using csv file input for a unit test which contains two columns (for example "id" and "a"), but I am using only one of them in the mapping (for example "a") and I choose the other ("id") for sorting, an exception occurs:

2019/02/28 15:07:40 - Spoon - Caused by: org.pentaho.di.core.exception.KettleException: 
2019/02/28 15:07:40 - Spoon - Unable to get all rows for database data set 'addnumbers as text'
2019/02/28 15:07:40 - Spoon - -1
2019/02/28 15:07:40 - Spoon - 
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetCsvGroup.getAllRows(DataSetCsvGroup.java:226)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetGroup.getAllRows(DataSetGroup.java:133)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSet.getAllRows(DataSet.java:140)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.spoon.xtpoint.InjectDataSetIntoTransExtensionPoint.injectDataSetIntoStep(InjectDataSetIntoTransExtensionPoint.java:198)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.spoon.xtpoint.InjectDataSetIntoTransExtensionPoint.callExtensionPoint(InjectDataSetIntoTransExtensionPoint.java:126)
2019/02/28 15:07:40 - Spoon - 	... 8 more
2019/02/28 15:07:40 - Spoon - Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.core.row.RowMeta.compare(RowMeta.java:915)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetCsvGroup$1.compare(DataSetCsvGroup.java:214)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetCsvGroup$1.compare(DataSetCsvGroup.java:211)
2019/02/28 15:07:40 - Spoon - 	at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
2019/02/28 15:07:40 - Spoon - 	at java.util.TimSort.sort(TimSort.java:220)
2019/02/28 15:07:40 - Spoon - 	at java.util.Arrays.sort(Arrays.java:1512)
2019/02/28 15:07:40 - Spoon - 	at java.util.ArrayList.sort(ArrayList.java:1462)
2019/02/28 15:07:40 - Spoon - 	at java.util.Collections.sort(Collections.java:175)
2019/02/28 15:07:40 - Spoon - 	at org.pentaho.di.dataset.DataSetCsvGroup.getAllRows(DataSetCsvGroup.java:211)
2019/02/28 15:07:40 - Spoon - 	... 12 more

I debugged it and I think, here is the spot in the code:
(DataSetCsvGroup.java from line 200)

      // Which fields are we sorting on (if any)
      //
      int[] sortIndexes = new int[ sortFields.size() ];
      for ( int i = 0; i < sortIndexes.length; i++ ) {
        sortIndexes[ i ] = outputRowMeta.indexOfValue( sortFields.get( i ) );
      }

      if ( !sortFields.isEmpty() ) {

        // Sort the rows...
        //
        Collections.sort( rows, new Comparator<Object[]>() {
          @Override public int compare( Object[] o1, Object[] o2 ) {
            try {
              return outputRowMeta.compare( o1, o2, sortIndexes );
            } catch ( KettleValueException e ) {
              throw new RuntimeException( "Unable to compare 2 rows", e );
            }
          }
        } );
      }

sortIndexes will not be empty, but sortIndexes[0] will be -1 and this will cause and ArrayIndexOutOfBounds exception in outputRowMeta.compare.

You may ask, why want I sorting the csv file base on a field, which is not in the mapping, but it seemed to
me a normal use case. For example, I wanted to test a transformation which adds two numbers together:

id a b c
1 0 0 0
2 1 0 1

The input mapping would be the columns "a" and "b", sorted by "id"
The golden mapping would be the columns "a", "b" and "c" sorted by "id".

I put all the files to reproduce this here:
https://github.com/peterborkuti/pentaho-pdi-dataset-bug-01

Thank you for your wonderful plugin
Péter

@mattcasters
Copy link
Owner

Hi Péter,

Thank you very much for the use case. It's true that I hadn't considered it yet.
I think we'll need to do something novel here like adding the sort columns temporarily until after sorting after which we should remove them again, just to make sure the columns don't end up in the test-transformation.
Cheers,
Matt

@JenniferJohnson89
Copy link

I noticed that there is a similar problem at https://github.com/mattcasters/pentaho-pdi-dataset. Perhaps we can refer to this issue to find more context about the bug.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants