Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run IIS experiments by relying on spark 3.4 version #1426

Open
marekhorst opened this issue Aug 10, 2023 · 0 comments
Open

Run IIS experiments by relying on spark 3.4 version #1426

marekhorst opened this issue Aug 10, 2023 · 0 comments
Assignees

Comments

@marekhorst
Copy link
Member

marekhorst commented Aug 10, 2023

This task is related to running a subset of IIS modules currently written in spark 2.4 on the newly available spark 3.4 version.

This may require:

  • altering oozie workflow definition by specifying a new sharelib name and defining other spark related configuration parameters such as app/query listeners, history server address, event log directory etc
  • changing dependencies
  • new version of spark-utils: 2.11 will have to be released, compatible with scala 2.12: Release artifact for scala binary version 2.12 CeON/spark-utils#16
  • altering java code
  • dropping iis-common scala sources which are difficult to maintain (and results in false positive errors indicated by the Eclipse IDE) and replacing those simple utility classes with java equivalents
@marekhorst marekhorst self-assigned this Aug 10, 2023
marekhorst added a commit that referenced this issue Sep 22, 2023
Upgrading dependencies in pom.xml files, aligning with scala 2.12 and spark 3.4.1.
marekhorst added a commit that referenced this issue Sep 27, 2023
WIP.

Commenting out avro related methods from scala sources which were relying on avro deserializer class which was made private in spark3.
Aligning sources with those changes by changing the way dataframes are constructed from collections of avro records and other refactoring required to compile IIS sources successsfully. This does not mean the code is already operational, some tests fail and still need to be fixed.

Upgrading logging system dependencies to match sharelib log4j dependencies version.
Upgrading maven-plugin-plugin version to solve build bug induced by upgraded log4j version.
marekhorst added a commit that referenced this issue Sep 27, 2023
WIP.

Commenting out avro related methods from scala sources which were relying on avro deserializer class which was made private in spark3.
Aligning sources with those changes by changing the way dataframes are constructed from collections of avro records and other refactoring required to compile IIS sources successsfully. This does not mean the code is already operational, some tests fail and still need to be fixed.

Upgrading logging system dependencies to match sharelib log4j dependencies version.
Upgrading maven-plugin-plugin version to solve build bug induced by upgraded log4j version.
marekhorst added a commit that referenced this issue Sep 29, 2023
WIP.

Fixing task serialization issue by upgrading avro dependency from 1.8.10 to 1.11.1 which is already a part of sharelib342. This induced the requirement to align JsonConverter with the new code and one of the requirements to move it to a different package due to limited visibility of one of the crucial methods.

Further logging system dependency alignment to make unit tests output produced on console visible.
marekhorst added a commit that referenced this issue Oct 2, 2023
WIP.

Replacing scala source code in iis-common module with java-based counterpart. Simplifying the code, aligning other classes with changes in avro read/write code.
marekhorst added a commit that referenced this issue Oct 3, 2023
WIP.

Removing `provided` scope from the `spark-avro_2.12` dependency until making it part of sharelib342.
Introducing required fixes for `eu/dnetlib/iis/wf/export/actionmanager/relation/citation/default` integration test to let it run relying on spark3:
* setting `spark.extraListeners` and `spark.sql.queryExecutionListeners` explicitly to empty values in order to avoid relying on incompatible, spark2 compliant, cloudera listeners
* setting `spark.shuffle.useOldFetchProtocol=true` to address `2.4 to 3.0 migration guide` requirement regarding protocol for fetching shuffle blocks backward compatibility (and avoiding `IllegalArgumentException: Unexpected message type: <number>` kind of errors)
marekhorst added a commit that referenced this issue Oct 10, 2023
WIP.

Fixing the changed results order in patent and software entity exporter integration tests.

Introducing required fixes for various `iis-wf-export-actionmanager` exporters relying on spark3 to let their integration tests to succeed:
* setting `spark.extraListeners` and `spark.sql.queryExecutionListeners` explicitly to empty values in order to avoid relying on incompatible, spark2 compliant, cloudera listeners
* setting `spark.shuffle.useOldFetchProtocol=true` in order to address `2.4 to 3.0 migration guide` requirement regarding protocol for fetching shuffle blocks backward compatibility (and avoiding `IllegalArgumentException: Unexpected message type: <number>` kind of errors)
The following modules were covered with similar workflow.xml related changes but their spark3 compatibility was not fully tested yet:
* `iis-wf-affmatching`
* `iis-wf-citationmatching-direct`
* `iis-wf-citationmatching`
* `iis-wf-documentsclassification`
* `iis-wf-import` (`content_url/core_parquet`, `infospace`, `patent`)
* `iis-wf-referenceextraction` (`community`, `concept`, `covid19`, `patent`, `project/funder_report`, `researchinitiative`, `softwareurl`)
* `iis-wf-transformers` (`avro2json`)
marekhorst added a commit that referenced this issue Oct 11, 2023
WIP.

Introducing required workflow.xml fixes for various workflows relying on spark3 to let their integration tests to succeed:
* setting `spark.extraListeners` and `spark.sql.queryExecutionListeners` explicitly to empty values in order to avoid relying on incompatible, spark2 compliant, cloudera listeners
* setting `spark.shuffle.useOldFetchProtocol=true` in order to address `2.4 to 3.0 migration guide` requirement regarding protocol for fetching shuffle blocks backward compatibility (and avoiding `IllegalArgumentException: Unexpected message type: <number>` kind of errors)
The following modules were covered with workflow.xml related changes which resulted in successful integration tests execution:
* `iis-wf-affmatching`
* `iis-wf-citationmatching-direct`
* `iis-wf-documentsclassification`

This was introduced to avoid the following exception: java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class

Adding `hadoop-mapreduce-client-core` and `hadoop-common` dependencies in `iis-wf-affmatching` and `iis-wf-citationmatching-direct` modules to reflect dependencies set from `iis-wf-export-actionmanager` and to avoid exception:

IncompatibleClassChangeError: Class org.apache.hadoop.fs.AvroFSInput does not implement the requested interface org.apache.avro.file.SeekableInput
marekhorst added a commit that referenced this issue Oct 13, 2023
WIP.

Introducing required workflow.xml fixes for various workflows relying on spark3 to let their integration tests to succeed:
* setting `spark.extraListeners` and `spark.sql.queryExecutionListeners` explicitly to empty values in order to avoid relying on incompatible, spark2 compliant, cloudera listeners
* setting `spark.shuffle.useOldFetchProtocol=true` in order to address `2.4 to 3.0 migration guide` requirement regarding protocol for fetching shuffle blocks backward compatibility (and avoiding `IllegalArgumentException: Unexpected message type: <number>` kind of errors)
The following modules were covered with workflow.xml related changes which resulted in successful integration tests execution:
* `iis-wf-referenceextraction`
marekhorst added a commit that referenced this issue Oct 16, 2023
 WIP.

Introducing required workflow.xml fixes for various workflows relying on spark3 to let their integration tests to succeed:
* setting `spark.extraListeners` and `spark.sql.queryExecutionListeners` explicitly to empty values in order to avoid relying on incompatible, spark2 compliant, cloudera listeners
* setting `spark.shuffle.useOldFetchProtocol=true` in order to address `2.4 to 3.0 migration guide` requirement regarding protocol for fetching shuffle blocks backward compatibility (and avoiding `IllegalArgumentException: Unexpected message type: <number>` kind of errors)
The following modules were covered with workflow.xml related changes which resulted in successful integration tests execution:
* `iis-wf-documentssimilarity` (explicitly excluded `hadoop-mapreduce-client-app` is still among spark342 sharelib dependencies what causes test failres)
* `iis-wf-import` (infospace importer still fails due to spark3 regression, more details in #8941#note-35)
marekhorst added a commit that referenced this issue Jan 16, 2024
WIP.

Upgrading spark dependency version from 3.4.1 to 3.4.2.
marekhorst added a commit that referenced this issue May 8, 2024
Upgrading dependencies in pom.xml files, aligning with scala 2.12 and spark 3.4.1.
marekhorst added a commit that referenced this issue May 8, 2024
WIP.

Commenting out avro related methods from scala sources which were relying on avro deserializer class which was made private in spark3.
Aligning sources with those changes by changing the way dataframes are constructed from collections of avro records and other refactoring required to compile IIS sources successsfully. This does not mean the code is already operational, some tests fail and still need to be fixed.

Upgrading logging system dependencies to match sharelib log4j dependencies version.
Upgrading maven-plugin-plugin version to solve build bug induced by upgraded log4j version.
marekhorst added a commit that referenced this issue May 8, 2024
WIP.

Fixing task serialization issue by upgrading avro dependency from 1.8.10 to 1.11.1 which is already a part of sharelib342. This induced the requirement to align JsonConverter with the new code and one of the requirements to move it to a different package due to limited visibility of one of the crucial methods.

Further logging system dependency alignment to make unit tests output produced on console visible.
marekhorst added a commit that referenced this issue May 8, 2024
WIP.

Replacing scala source code in iis-common module with java-based counterpart. Simplifying the code, aligning other classes with changes in avro read/write code.
marekhorst added a commit that referenced this issue May 8, 2024
WIP.

Removing `provided` scope from the `spark-avro_2.12` dependency until making it part of sharelib342.
Introducing required fixes for `eu/dnetlib/iis/wf/export/actionmanager/relation/citation/default` integration test to let it run relying on spark3:
* setting `spark.extraListeners` and `spark.sql.queryExecutionListeners` explicitly to empty values in order to avoid relying on incompatible, spark2 compliant, cloudera listeners
* setting `spark.shuffle.useOldFetchProtocol=true` to address `2.4 to 3.0 migration guide` requirement regarding protocol for fetching shuffle blocks backward compatibility (and avoiding `IllegalArgumentException: Unexpected message type: <number>` kind of errors)
marekhorst added a commit that referenced this issue May 8, 2024
WIP.

Fixing the changed results order in patent and software entity exporter integration tests.

Introducing required fixes for various `iis-wf-export-actionmanager` exporters relying on spark3 to let their integration tests to succeed:
* setting `spark.extraListeners` and `spark.sql.queryExecutionListeners` explicitly to empty values in order to avoid relying on incompatible, spark2 compliant, cloudera listeners
* setting `spark.shuffle.useOldFetchProtocol=true` in order to address `2.4 to 3.0 migration guide` requirement regarding protocol for fetching shuffle blocks backward compatibility (and avoiding `IllegalArgumentException: Unexpected message type: <number>` kind of errors)
The following modules were covered with similar workflow.xml related changes but their spark3 compatibility was not fully tested yet:
* `iis-wf-affmatching`
* `iis-wf-citationmatching-direct`
* `iis-wf-citationmatching`
* `iis-wf-documentsclassification`
* `iis-wf-import` (`content_url/core_parquet`, `infospace`, `patent`)
* `iis-wf-referenceextraction` (`community`, `concept`, `covid19`, `patent`, `project/funder_report`, `researchinitiative`, `softwareurl`)
* `iis-wf-transformers` (`avro2json`)
marekhorst added a commit that referenced this issue May 8, 2024
WIP.

Introducing required workflow.xml fixes for various workflows relying on spark3 to let their integration tests to succeed:
* setting `spark.extraListeners` and `spark.sql.queryExecutionListeners` explicitly to empty values in order to avoid relying on incompatible, spark2 compliant, cloudera listeners
* setting `spark.shuffle.useOldFetchProtocol=true` in order to address `2.4 to 3.0 migration guide` requirement regarding protocol for fetching shuffle blocks backward compatibility (and avoiding `IllegalArgumentException: Unexpected message type: <number>` kind of errors)
The following modules were covered with workflow.xml related changes which resulted in successful integration tests execution:
* `iis-wf-affmatching`
* `iis-wf-citationmatching-direct`
* `iis-wf-documentsclassification`

This was introduced to avoid the following exception: java.lang.NoClassDefFoundError: org/apache/spark/internal/Logging$class

Adding `hadoop-mapreduce-client-core` and `hadoop-common` dependencies in `iis-wf-affmatching` and `iis-wf-citationmatching-direct` modules to reflect dependencies set from `iis-wf-export-actionmanager` and to avoid exception:

IncompatibleClassChangeError: Class org.apache.hadoop.fs.AvroFSInput does not implement the requested interface org.apache.avro.file.SeekableInput
marekhorst added a commit that referenced this issue May 8, 2024
WIP.

Introducing required workflow.xml fixes for various workflows relying on spark3 to let their integration tests to succeed:
* setting `spark.extraListeners` and `spark.sql.queryExecutionListeners` explicitly to empty values in order to avoid relying on incompatible, spark2 compliant, cloudera listeners
* setting `spark.shuffle.useOldFetchProtocol=true` in order to address `2.4 to 3.0 migration guide` requirement regarding protocol for fetching shuffle blocks backward compatibility (and avoiding `IllegalArgumentException: Unexpected message type: <number>` kind of errors)
The following modules were covered with workflow.xml related changes which resulted in successful integration tests execution:
* `iis-wf-referenceextraction`
marekhorst added a commit that referenced this issue May 8, 2024
 WIP.

Introducing required workflow.xml fixes for various workflows relying on spark3 to let their integration tests to succeed:
* setting `spark.extraListeners` and `spark.sql.queryExecutionListeners` explicitly to empty values in order to avoid relying on incompatible, spark2 compliant, cloudera listeners
* setting `spark.shuffle.useOldFetchProtocol=true` in order to address `2.4 to 3.0 migration guide` requirement regarding protocol for fetching shuffle blocks backward compatibility (and avoiding `IllegalArgumentException: Unexpected message type: <number>` kind of errors)
The following modules were covered with workflow.xml related changes which resulted in successful integration tests execution:
* `iis-wf-documentssimilarity` (explicitly excluded `hadoop-mapreduce-client-app` is still among spark342 sharelib dependencies what causes test failres)
* `iis-wf-import` (infospace importer still fails due to spark3 regression, more details in #8941#note-35)
marekhorst added a commit that referenced this issue May 8, 2024
WIP.

Upgrading spark dependency version from 3.4.1 to 3.4.2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant