Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Python 3.11 for Google Provider (upgrading all dependencies) #27292

Closed
4 of 13 tasks
potiuk opened this issue Oct 26, 2022 · 21 comments · Fixed by #30067
Closed
4 of 13 tasks

Support for Python 3.11 for Google Provider (upgrading all dependencies) #27292

potiuk opened this issue Oct 26, 2022 · 21 comments · Fixed by #30067
Labels
good first issue kind:meta High-level information important to the community

Comments

@potiuk
Copy link
Member

potiuk commented Oct 26, 2022

I know it is eaarly (Python 3.11 has just been released yesterday) but we are hoping in Apache Airflow to a much faster cycle of adding new Python releases - especially that Pyhon 3.11 introduces huge performance improvements (25% is the average number claimed) due to a very focused effort to increase single-threaded Python performance (Specialized interpreter being the core of it but also many other improvements) without actually changing any of the Python code.

The google provider will be a huge drag on Airlfow's compatibility for Python 3.11 and we might even decide to release Airlfow without Google Provider support for 3.11. Though it would be great to avoid that.

Google Provider (as it was originally mentioned in #12116) has still a number of old google cloud libraries < 2.0.0 that for sure will not get good. Also support for libraries such as bigquery for Python 3.11 has to be added (but this is external to it and tracked in googleapis/python-bigquery#1386)

Nice summary of Py3.11 support is here: https://pyreadiness.org/3.11/ - it's not very green obviously, but I hope it gets greener soon.

I just opened the PR to add 3.11 support yesterday and plan to keep it open until it gets green :)

#27264

I think it would be fantastic if we could work out all the problems and add migrate all the old dependencies:

Looking forward to cooperation on that one :)

Committer

  • I acknowledge that I am a maintainer/committer of the Apache Airflow project.
@potiuk potiuk added kind:meta High-level information important to the community good first issue labels Oct 26, 2022
@potiuk potiuk mentioned this issue Oct 26, 2022
11 tasks
potiuk added a commit that referenced this issue Oct 26, 2022
Python 3.11 has been released as scheduled on October 25, 2022 and
this is the first attempt to see how far Airflow (mostly dependencies)
are from being ready to officially support 3.11.

So far we had to exclude the following dependencies:

- [ ] Pyarrow dependency: apache/arrow#14499
- [ ] Google Provider: #27292
  and googleapis/python-bigquery#1386
- [ ] Databricks Provider:
  databricks/databricks-sql-python#59
- [ ] Papermill Provider: nteract/papermill#700
- [ ] Azure Provider: Azure/azure-uamqp-python#334
  and Azure/azure-sdk-for-python#27066
- [ ] Apache Beam Provider: apache/beam#23848
- [ ] Snowflake Provider:
  snowflakedb/snowflake-connector-python#1294
- [ ] JDBC Provider: jpype-project/jpype#1087
- [ ] Hive Provider: cloudera/python-sasl#30

We might decide to release Airflow in 3.11 with those providers
disabled in case they are lagging behind eventually, but for the
moment we want to work with all the projects in concert to be
able to release all providers (Google Provider requires quite
a lot of work and likely Google Team stepping up and community helping
with migration to latest Goofle cloud libraries)
potiuk added a commit that referenced this issue Oct 27, 2022
Python 3.11 has been released as scheduled on October 25, 2022 and
this is the first attempt to see how far Airflow (mostly dependencies)
are from being ready to officially support 3.11.

So far we had to exclude the following dependencies:

- [ ] Pyarrow dependency: apache/arrow#14499
- [ ] Google Provider: #27292
  and googleapis/python-bigquery#1386
- [ ] Databricks Provider:
  databricks/databricks-sql-python#59
- [ ] Papermill Provider: nteract/papermill#700
- [ ] Azure Provider: Azure/azure-uamqp-python#334
  and Azure/azure-sdk-for-python#27066
- [ ] Apache Beam Provider: apache/beam#23848
- [ ] Snowflake Provider:
  snowflakedb/snowflake-connector-python#1294
- [ ] JDBC Provider: jpype-project/jpype#1087
- [ ] Hive Provider: cloudera/python-sasl#30

We might decide to release Airflow in 3.11 with those providers
disabled in case they are lagging behind eventually, but for the
moment we want to work with all the projects in concert to be
able to release all providers (Google Provider requires quite
a lot of work and likely Google Team stepping up and community helping
with migration to latest Goofle cloud libraries)
potiuk added a commit that referenced this issue Oct 27, 2022
Python 3.11 has been released as scheduled on October 25, 2022 and
this is the first attempt to see how far Airflow (mostly dependencies)
are from being ready to officially support 3.11.

So far we had to exclude the following dependencies:

- [ ] Pyarrow dependency: apache/arrow#14499
- [ ] Google Provider: #27292
  and googleapis/python-bigquery#1386
- [ ] Databricks Provider:
  databricks/databricks-sql-python#59
- [ ] Papermill Provider: nteract/papermill#700
- [ ] Azure Provider: Azure/azure-uamqp-python#334
  and Azure/azure-sdk-for-python#27066
- [ ] Apache Beam Provider: apache/beam#23848
- [ ] Snowflake Provider:
  snowflakedb/snowflake-connector-python#1294
- [ ] JDBC Provider: jpype-project/jpype#1087
- [ ] Hive Provider: cloudera/python-sasl#30

We might decide to release Airflow in 3.11 with those providers
disabled in case they are lagging behind eventually, but for the
moment we want to work with all the projects in concert to be
able to release all providers (Google Provider requires quite
a lot of work and likely Google Team stepping up and community helping
with migration to latest Goofle cloud libraries)
potiuk added a commit that referenced this issue Oct 31, 2022
Python 3.11 has been released as scheduled on October 25, 2022 and
this is the first attempt to see how far Airflow (mostly dependencies)
are from being ready to officially support 3.11.

So far we had to exclude the following dependencies:

- [ ] Pyarrow dependency: apache/arrow#14499
- [ ] Google Provider: #27292
  and googleapis/python-bigquery#1386
- [ ] Databricks Provider:
  databricks/databricks-sql-python#59
- [ ] Papermill Provider: nteract/papermill#700
- [ ] Azure Provider: Azure/azure-uamqp-python#334
  and Azure/azure-sdk-for-python#27066
- [ ] Apache Beam Provider: apache/beam#23848
- [ ] Snowflake Provider:
  snowflakedb/snowflake-connector-python#1294
- [ ] JDBC Provider: jpype-project/jpype#1087
- [ ] Hive Provider: cloudera/python-sasl#30

We might decide to release Airflow in 3.11 with those providers
disabled in case they are lagging behind eventually, but for the
moment we want to work with all the projects in concert to be
able to release all providers (Google Provider requires quite
a lot of work and likely Google Team stepping up and community helping
with migration to latest Goofle cloud libraries)
@kosteev
Copy link
Contributor

kosteev commented Nov 1, 2022

Upgrading dependencies for google provider package can be tested with Airflow System Tests and CI that is in under construction atm. FYI @bhirsz

potiuk added a commit that referenced this issue Nov 24, 2022
Python 3.11 has been released as scheduled on October 25, 2022 and
this is the first attempt to see how far Airflow (mostly dependencies)
are from being ready to officially support 3.11.

So far we had to exclude the following dependencies:

- [ ] Pyarrow dependency: apache/arrow#14499
- [ ] Google Provider: #27292
  and googleapis/python-bigquery#1386
- [ ] Databricks Provider:
  databricks/databricks-sql-python#59
- [ ] Papermill Provider: nteract/papermill#700
- [ ] Azure Provider: Azure/azure-uamqp-python#334
  and Azure/azure-sdk-for-python#27066
- [ ] Apache Beam Provider: apache/beam#23848
- [ ] Snowflake Provider:
  snowflakedb/snowflake-connector-python#1294
- [ ] JDBC Provider: jpype-project/jpype#1087
- [ ] Hive Provider: cloudera/python-sasl#30

We might decide to release Airflow in 3.11 with those providers
disabled in case they are lagging behind eventually, but for the
moment we want to work with all the projects in concert to be
able to release all providers (Google Provider requires quite
a lot of work and likely Google Team stepping up and community helping
with migration to latest Goofle cloud libraries)
@rafalbiegacz
Copy link

It seems that python-bigquery-sqlalchemy already supports Python 3.11

@rafalbiegacz
Copy link

It seems that google-api-python-client also supports 3.11.

@potiuk
Copy link
Member Author

potiuk commented Jan 17, 2023

I will make a round of rebase/check again :)

potiuk added a commit to potiuk/airflow that referenced this issue Jan 19, 2023
Python 3.11 has been released as scheduled on October 25, 2022 and
this is the first attempt to see how far Airflow (mostly dependencies)
are from being ready to officially support 3.11.

So far we had to exclude the following dependencies:

- [ ] Pyarrow dependency: apache/arrow#14499
- [ ] Google Provider: apache#27292
  and googleapis/python-bigquery#1386
- [ ] Databricks Provider:
  databricks/databricks-sql-python#59
- [ ] Papermill Provider: nteract/papermill#700
- [ ] Azure Provider: Azure/azure-uamqp-python#334
  and Azure/azure-sdk-for-python#27066
- [ ] Apache Beam Provider: apache/beam#23848
- [ ] Snowflake Provider:
  snowflakedb/snowflake-connector-python#1294
- [ ] JDBC Provider: jpype-project/jpype#1087
- [ ] Hive Provider: cloudera/python-sasl#30

We might decide to release Airflow in 3.11 with those providers
disabled in case they are lagging behind eventually, but for the
moment we want to work with all the projects in concert to be
able to release all providers (Google Provider requires quite
a lot of work and likely Google Team stepping up and community helping
with migration to latest Goofle cloud libraries)
@potiuk
Copy link
Member Author

potiuk commented Mar 2, 2023

Cool. Time to try 3.11 build back then.

@raphaelauv
Copy link
Contributor

raphaelauv commented Mar 2, 2023

I din't found a working constraint of google-cloud-aiplatform for apache-airflow-providers-google==8.10.0 with python 3.11

@r-richmond
Copy link
Contributor

As of 2023-03-04 google-cloud-aiplatform is not marked as supporting python 3.11 source.

I've opened googleapis/python-aiplatform#2006 as it appears to have been unrequested.

@potiuk Question for you that I've wondered about after chasing down a few of these updates. Has there been any thought given to breaking apart the google-provider into extras?

i.e. apache-airflow-provider-google[bigquery,cloudstorage] or apache-airflow-provider-google[aiplatform] etc..

Rationale: It would allow users of the google provider to pick and chose which sub features they want to use and introduce less dependencies. It would also let us leave certain sub features behind in case google supports them less or deprecates them (as they've been known to do)..

@potiuk
Copy link
Member Author

potiuk commented Mar 4, 2023

Not only thoughts. There is an issue for actually spliting the provider : #15933 - but this one is complex because of common parts so maintaining such split provider would be difficult to maintain (we learned a lot about it when we added common.sql.

However- when it comes to extras, it could be a better solution indeed, I have not thought about it, but it might actually make it much easier for users and would let us pick and choose which extras in google provider we might want to have enabled for which python versions.

We already even have AirflowOptionalProviderFeatureException which would be nice in this case - we could throw appropriate error explaining that this and that extra is needed for this and that module.

I think I like the idea better than splitting the provider. But I have to think a bit about it, from the first glance it looks like an easy solution to this problem.

@eladkal
Copy link
Contributor

eladkal commented Mar 4, 2023

the issue with spliting the provider is mostly that no one from Google picked it. Once someone picks it and start working on it we will be able to overcome the tech difficulties. We don't know yet how the provider will be split but we do know it must be done.

@potiuk
Copy link
Member Author

potiuk commented Mar 4, 2023

the issue with spliting the provider is mostly that no one from Google picked it. Once someone picks it and start working on it we will be able to overcome the tech difficulties. We don't know yet how the provider will be split but we do know it must be done.

I am not so sure. Actually - using extras might be way simpler approach that is going to solve most of the pains with getting all the libraries in I think, without introducing the huge hassle of extracting common code and using it from multiple "google" providers. If we do split google provider, then the maintenance pain of common.sql willl absolutely pale in comparision comparing to problems we are going to have - and there were at least 4 or 5 traps of the common code extraction and maintenance which were really painful to protect against and fix. If we find a way to solve most of the user problems about dependencies with extras as suggested by @r-richmond (which I think is actually possible) then I see no reason to split the provider to be honest.

Splitting the google provider will be massive undertaking and if we do that, it will take us more than a few iterations on multiple providers to solve most of the teething problems that we will not realise when splitting and those problems will keep on coming back for as long as the common part of the google provider will keep on evolving - we will keep on breaking things with older versions of "specific" providers when we will release the new common code. This is all but given that it will happen and we have almost no way to protect against it.

Look how small common.sql "API" surface was and how many problems we had:

Not all of those - but most were directly caused by decision to extract common code for a number of SQL operators. And the main problem why those errors affected users was because there is no way to test new release of "common" code with all possble releases of all possible providers that are using it. You can at most test semi-thoroughly the latest versions of the providers and common code together. This is what we do. Thats' why splitting google provider is SCARY. because you will have order of magnitude more of similar problems and we will have no way to avoid them. And even more. Google common code will keep evolving in much faster rate than common.sql code. Our problem wiht common.sql stopped at the moment it stopped changing. But Google common code will never stop changing. So decision about splitting gogole provider is not as "light" as you think.

And that's why I am very, very sceptical about splitting it (otherwise I would have done that myself a long time ago). Of course using extras does not solve "all" problems - but I think most, It won't solve the case where you would like to use different provider version for one Google service and different for another. But - to be honest - if we get to the point where someone needs to do it, then we have bigger issue and this is one of those problems that leads to more issues than it solves. I would very strongly prefer the situation that user has to modify their dags for google - if they want to (for example) use new features from another service. Yes, it's a bit of pain for them - but far, far, less pain for everyone else (including them) in the future, where some incompatibilities in the common code will cause even more problems.

@eladkal
Copy link
Contributor

eladkal commented Mar 4, 2023

I don't fully agree and I don't think it's the same case.
common.sql added a whole new functionality of a new generic operator. The issues were mostly around the new functionality and not around the new provider by itself.
The lesson I learned there is add the new functionality first and give it some time before converting all other providers to use it.

Back to the Google case. We are not adding anything new. This is more about re-organizing the existed code. To me it seems that the main reason it's not split is the the common folder which is being used by almost all google space and it will be hard to break it to individual providers. However this folder is not changing that much. Check the commit history and when it does change most of the commits are about styling

@potiuk
Copy link
Member Author

potiuk commented Mar 4, 2023

Back to the Google case. We are not adding anything new. This is more about re-organizing the existed code. To me it seems that the main reason it's not split is the the common folder which is being used by almost all google space and it will be hard to break it to individual providers. However this folder is not changing that much. Check the commit history and when it does change most of the commits are about styling

Those are all non-styling potentially breking changes for the common part of Google. Seems like we have a substantial change in it almost every month.

Move help message to the google auth code (#29888) Jarek Potiuk* Yesterday 11:11
Keyfile dict can be dict not str (#29135) Daniel Standish* 25/01/2023, 19:39
Deprecate `delegate_to` param in GCP operators and update docs (#29088) Shahar Epstein* 23/01/2023, 23:19
Update old style typing (#26872) Pierre Jeambrun* 27/10/2022, 04:39
Enable string normalization in python formatting - providers (#27205) Daniel Standish* 23/10/2022, 22:17
Update google hooks to prefer non-prefixed extra fields (#27023) Daniel Standish* 22/10/2022, 21:41
Apply PEP-563 (Postponed Evaluation of Annotations) to non-core airflow (#26289) Jarek Potiuk* 13/09/2022, 19:20
Add deferrable big query operators and sensors (#26156) Phani Kumar* 08/09/2022, 23:17
Make GoogleBaseHook credentials functions public (#25785) Felix Uellendall* 19/08/2022, 11:54
Fix Flask Login user setting for Flask 2.2 and Flask-Login 0.6.2 (#25318) Jarek Potiuk* 27/07/2022, 00:01
Add test_connection method to `GoogleBaseHook` (#24682) Phani Kumar* 06/07/2022, 16:57
Upgrade FAB to 4.1.1 (#24399) Jarek Potiuk* 22/06/2022, 23:26
Cloud Storage assets & StorageLink update (#23865) Wojciech Januszek* 06/06/2022, 15:02
Add key_secret_project_id parameter which specifies a project with KeyFile (#23930) Maksim* 04/06/2022, 23:28
Update credentials when using ADC in Compute Engine (#23773) Maksim* 03/06/2022, 13:28
Ensure @contextmanager decorates generator func (#23103) Tzu-ping Chung* 30/05/2022, 09:24
TextToSpeech assets & system tests migration (AIP-47) (#23247) Bartłomiej Hirsz* 04/05/2022, 22:40
Change BaseOperatorLink interface to take a ti_key, not a datetime (#21798) Ash Berlin-Taylor* 01/03/2022, 15:29
Extract ClientInfo to module level (#21554) pierrejeambrun* 15/02/2022, 22:38
Dataproc metastore assets (#21267) Wojciech Januszek* 15/02/2022, 20:09
Google Cloud Composer opearators (#21251) Łukasz Wyszomirski* 11/02/2022, 13:41
Add optional features in providers. (#21074) Jarek Potiuk* 27/01/2022, 13:58
Fix setting of project ID in ``provide_authorized_gcloud`` (#20428) Jonas Grabber* 31/12/2021, 17:32
Add support in GCP connection for reading key from Secret Manager (#19164) Dragan Kesic* 14/11/2021, 22:49

@felicienveldema
Copy link

I'm experiencing some issues upgrading the Google-ads python package. Version 18 is deprecated since the beginning of this week and higher versions on protobuf > 4.5.x .
This is all well and good but I'm piggybacking this issue ticket as I eventually get stuck with apache-airflow-providers-google depending on google-cloud-secret-manager < 2.x . Which depend on protobuf 3 which causes my predicament.

Is the google-cloud-secret-manager dependency needed or could it be easily upgraded to the newer 2.x versions?
If a new issue needs to be opened please let me know.

@r-richmond
Copy link
Contributor

Is the google-cloud-secret-manager dependency needed or could it be easily upgraded to the newer 2.x versions?

Im sure it is still needed. I'd recommend trying to upgrade that package first in a separate pr. FWIW I've had several of these situations where I want package a upgraded but have to do package b & c first.

@potiuk
Copy link
Member Author

potiuk commented Mar 15, 2023

There is a WIP from Google team to upgrade the SDKs #30067

@r-richmond
Copy link
Contributor

@potiuk Given #30067 (comment) I was curious if there has been any additional conversations around extras vs provider breakout. (I have a small preference towards extras since it seems easier / faster to implement given the conversations above).

@potiuk
Copy link
Member Author

potiuk commented May 16, 2023

@potiuk Given #30067 (comment) I was curious if there has been any additional conversations around extras vs provider breakout. (I have a small preference towards extras since it seems easier / faster to implement given the conversations above).

No - no discussions. And I think they are not needed.

I personally think once we get it updated now and keep on updating to the new versions (which should happen pretty much automatically as soon as we remove pretty much all the upper-binding dependencies) the problem will all but disappear.

Vast majority of the problem came from the fact that we were half before and half after a huge backwards incompatible change introduced accross all google python libraries some 4 years ago. The #30067 puts that dychotomy to an end.

I am actually going to actively chase and remove all the upper-bind limitations that we have elsewhere, becuase this is IMHO the only way we can long term keep sanity. We already have in place the system that checks if there are no breaking changes in deps released in main and for a long time we are faster to detect and fix them than anyone else (see for example this issue from today where our canary builds detected and we fixed alembic incompatibility before the first user reported it to us: #31313

With google eventually implementing (discussion on this are in progress) the System Dashboard similar to the Amazon one, we will get even further than that becasue we will start detecting errors that impacts working with the actual GCS services.

@potiuk
Copy link
Member Author

potiuk commented May 16, 2023

Having all that in place, I do not see really the need to split Google at all - maybe the extras will save a bit of space when installing the provider, but there will be very little need to split it IMHO.

@potiuk
Copy link
Member Author

potiuk commented May 16, 2023

Just also to explain why - there was a bit of a story few days back about Aamazon going back to smarter monolyth from microservices https://thenewstack.io/return-of-the-monolith-amazon-dumps-microservices-for-video-monitoring/ and this goes hand-in-hand with my observations (and the reason why we still have monorepo for airflow and providers).

Splitting up into pieces looks cool but in a number of cases it is not a "golden bullet" while it adds isolation and decouples stuff, when there are hidden couplins, it might bring way more cost than it brings benefits- maintaining and solving problems coming uf with such a split might easily cost more than potential benefits.

So once we get rid of the root cause of the problem (which in fact was not very related to the internal google package structure but more to the fact that we had "half-baked" cake, then we should carefully see what are the needs and cost of any splitting approach and see if any of that is needed.

IMHO we should not discuss splitting vs extras but extras vs. doing nothing at most.

@r-richmond
Copy link
Contributor

IMHO we should not discuss splitting vs extras but extras vs. doing nothing at most.

Makes 💯 sense to me

maybe the extras will save a bit of space when installing the provider

Yes my main interest stems from the desire to save space & more importantly ignore google libraries I don't use. Particularly the ones that lag behind python version and other dependency updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue kind:meta High-level information important to the community
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants