Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SSDF Issue] PW2.1: Review the security architecture #144

Open
zdtsw opened this issue May 5, 2022 · 13 comments
Open

[SSDF Issue] PW2.1: Review the security architecture #144

zdtsw opened this issue May 5, 2022 · 13 comments
Assignees
Labels

Comments

@zdtsw
Copy link
Contributor

zdtsw commented May 5, 2022

Ref: [SSDF Epic] PW: Produce well secured software

Recording work has been done for PW2.1:

  • Task: Have 1) a qualified person (or people) who were not involved with the design and/or 2) automated processes instantiated in the toolchain review the
    software design to confirm and enforce that it meets all of the security requirements and satisfactorily addresses the identified risk information.
  • Examples:
    Example 1: Review the software design to confirm that it addresses applicable security requirements.
    Example 2: Review the risk models created during software design to determine if they appear to adequately identify the risks.
    Example 3: Review the software design to confirm that it satisfactorily addresses the risks identified by the risk models.
    Example 4: Have the software’s designer correct failures to meet the requirements.
    Example 5: Change the design and/or the risk response strategy if the security requirements cannot be met.
    Example 6: Record the findings of design reviews to serve as artifacts (e.g., in the software specification, in the issue tracking system, in the threat model).

Detail see: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-218.pdf Page 21

@zdtsw
Copy link
Contributor Author

zdtsw commented May 6, 2022

i am working on it : self assign

@zdtsw
Copy link
Contributor Author

zdtsw commented May 10, 2022

PW.2.1

  • Task: Have 1) a qualified person (or people) who were not involved with the design and/or 2) automated processes instantiated in the toolchain review the
    software design to confirm and enforce that it meets all of the security requirements and satisfactorily addresses the identified risk information.

  • Examples:
    Example 1: Review the software design to confirm that it addresses applicable security requirements.
    Example 2: Review the risk models created during software design to determine if they appear to adequately identify the risks.
    Example 3: Review the software design to confirm that it satisfactorily addresses the risks identified by the risk models.
    Example 4: Have the software’s designer correct failures to meet the requirements.
    Example 5: Change the design and/or the risk response strategy if the security requirements cannot be met.
    Example 6: Record the findings of design reviews to serve as artifacts (e.g., in the software specification, in the issue tracking system, in the threat model).

  • Work:

    • a list of diagrams in different systems under project "adoptium"
    • Findings/questions on Build and Release:
      • No risk models exist when temurin work was done (confirmed)
      • Systems we use and developing are complex => more risky in the release chain to cause problem
        • jenkins and GitHub Action both for building and testing => GitHub Action has the same ability to build something and push out to user?
        • Jfrog + DockerHub + GitHub all use as storage for release
        • old openjdk and new temurin adoptium both need to support (mainly for docker image)
        • old release version(e.g jdk15, jdk16) still left in build system even not actively in use
      • project is practicing CI/CD but lack of traceability
        • no versioning on source code: no tags on git repo under projects(adoptium, temurin-compliance, etc) per each release
        • no commit SHA can be traced back per old release: release notes internally (what bug we fixed, what test cases are skipped etc) and externally(what have been fixed in upstream, what we have added upon that, what is exemption etc): [confirmed]: when migrate to website2 the old feature of release notes are gone, there is a ticket on this.
        • release tags made in temurin*-binaries repo are not useful
        • jenkins build result + AQA test result are in different ways as feedback(slack notification, TRSS) but not get enough attention => later stage spot on real issue
      • Most of the agents deploy/udpate with ansible but not lock down SW version (mostly latest dependency), hard to reproduce exactly the same old build.
      • How to know docker image we use has no vulnerability e.g based image + "update all" done in "docker build"
      • Tests on changes made into adoptium repos need to be reviewed:
      • We do not have a "staging" env, everything is running in "production" e.g ci.adoptopenjdk.net
        • jobs created from source code in git repo: ci-pipeline-jenkins, aqa-test
        • jobs manually created for testing purpose
        • jobs are not needed but kept there
      • what is the backup plan? [SXA: ThinBackup plugin used nightly, along with backups taken onto another server]
        • geo-redundancy enabled? No
        • storage reguarly snapshot, Yes: nightly on adoptium jenkins controller; No: Jenkins agents, re-create by ansible
        • do we have any build dependecy on the env.? (e.g build artifacts rely on previous builds, URL etc)
        • jfrog artifactory: any SLA support? Need confirm
        • website2: if it is down, how can user download tarballs? Need confirm [SXA: Only by going to github or pulling from homebrew or JFrog for rpms/debs. API is also separate and does not require website (Although website requires the API)]
        • if we store all tarballs in GH including nightly and offical release, do we have SLA with GH? Need confirm
        • dockerhub: any SLA support? vulnability scanner done automatically by dockerhub? Need confirm
    • Findings on AQA(test related system):
      • TODO
    • Findings on Adoptium part (git repo + build + infra):
      • Different level or access is not clear:
        • how to add/modify/remove people from project, git repo, jenkins access, JCK machines; [SXA: public jenkins access is done via git groups - currently in the AdoptOpenJDK space which PMC controls. JCK machine access is controlled by Eclipse when people join or leave as committers to the TC project]
        • what is needed for above access to be granted:
        • what is the process to get vote/approve: [SXA: Committer status to adoptium is required for those repositories, and that is a vote that goes out to existing committers via a standard Eclipse process]
      • For temurin, most of the security related activities are done by various access control:
        1. Access to write permission in GitHub repos(https://github.com/adoptium/), including source code, run Github Action and Issue(self-assigne, review, merge etc)
        2. Access to jenkins(https://ci.adoptopenjdk.net/) divided into 4 groups:
          • public (read-only) for some jobs
          • execution (as build/rebuild) for some jobs [SXA: This is generally the AdoptOpenJDK*build and AdoptOpenJDK*build-triage groups who can run the build jobs]
          • execution for release related jobs: users are manually updated for the job, config is not under source control [SXA: Currentl jenkins admins and named others (Haroon+Sophia) although that should be replaced with https://github.com/orgs/AdoptOpenJDK/teams/build-release in the short term]
          • admin for everything
      • For infrastructure, process is not clear:
        1. Jenkins controller(https://ci.adoptopenjdk.net)
          • Day0 operation not clear: who, how, and what, apart us who else has permission to login these servers? [SXA: A subset of the PMC members - SXA/MV/GA/TE - and some others from IBM who had worked with us (PS/GJ/MW/DG) have admin access to the server currently]
          • Day1 operation: assume done by 2 PMC members: deploy changes on VM to set up baseline(?)
          • Day2 operation: not clear, who, how often, OS and applications including jenkins core is patched 2.263.3 Jenkins core Need more documentation to confirm [SXA: Intention is to start looking at it on a weekly basis but this has not been done for the last year]
          • is it HA? Need confirm [SXA: No]
        2. Jenkins agents:
          • Day0 operation not clear: who, how, and what kind of security has been applied: who else can login to such agent? some sponsors? public cloud providers? [SXA: In some cases we've left the cloud providers on e.g. Marist by request, but usually they are removed and only the infrastructure team can log into the agents]
          • Day1 operation: assume done ansible playbook [SXA: Generally yes, although the dockerHost systems are not strictly in accordance with them, and the dynamic agents we have at Azure/AWS are probably not configured this way]
          • Day2 operation: mostly by Ansible + some manual work done on the machine (e.g docker host volume) which can be hard to trace or reproduce on a new machine?
          • Should public all agents information from https://github.com/adoptium/infrastructure/blob/master/ansible/inventory.yml ? [All machines which would be configured and maintained by ansible are in there as this is used as the source of information for AWX. Machines deployed through the DockerStatic role are not in there - they are a bit more fluid in their creation and do not have the playbooks applied. Also the Linux x64 and aarch64 machines are started on demand using docker images on dockerhub created from the dockerfiles in https://github.com/adoptium/infrastructure/tree/master/ansible/docker]
          • is every type of agent working as single node or a node pool? Single point of failure, need confirm [SXA: No jobs should be running against a specific host - there should be redundancy everywhere to cover outages. Where possible this is across more than one cloud provider. https://github.com/Ensure all builds can run on multiple machines temurin-build#1044 covers some of the remaining operations which do not have redundancy]
  • Findings on External systems:
    1. Monitoring system:
    - nagios (need permission to access it) if anyone uses it? [SXA: Needs work - we can discuss on Wednesday :-)]
    - who is keeping eyes on if agents(VM and containers are dead) [SXA: Pretty much just me as jobs get queued up...]
    2. Logging system:
    - do we have system to logging our build result: console? who access our system? [SXA: Nothing outside jenkins]
    - what is the log rotation: in case a wrong build made from months ago, can we know who did the build with what commits etc [SXA: Only via the metadata that is uploaded to github for nightlies (They are from a timer so no user associated with them). Full console logs from release builds will generally be kept by jenkins as they are locked]
    3. Automation system:
    - AWX: backup postgres? share the same access control to nagios? [SXA: All such services have separate ACLs. AWX should be accessible to anyone in the the infrastructure GitHub team)
    4. Backup plan: https://github.com/adoptium/infrastructure/blob/master/README.md#backups need correction and updates [SXA: See also https://github.com/Collate and document backup strategy for our infrastructure machines infrastructure#1295 - not perfect but we have a plan for each service]
    5. Any other system is not under source control in Adoptium repo ? except compliance part.

@jiekang
Copy link

jiekang commented May 30, 2022

Jfrog account is a "sponsored enterprise" account

@zdtsw
Copy link
Contributor Author

zdtsw commented May 31, 2022

what is the release process we have for services we provide and who is responsible for these services:

@zdtsw
Copy link
Contributor Author

zdtsw commented May 31, 2022

Some AQA related question:
- announce new release of run-aqa? v2 was from last Dec, but our temurin-build stil uses v1 =>how to make sure our internal systems get updated need input
- is PerfNext or SmartMedia running somewhere and accessible by public? as service we provide need confirm
- how TRSS is released with new commits in source code? => when we run ansible to deploy new changes in TRSS?
- TRSS runs on AWS, deployed by "infrastructur" ansible as trss.adoptopenjdk.net
- any system is used to monitor service of TRSS need confirm
@smlambert could you help to answer these parts?

@sxa
Copy link
Member

sxa commented Jun 6, 2022

what is the release process we have for services we provide and who is responsible for these services:
* https://blog.adoptium.net/ still update? => running on Netlify

Updated via PRs to https://github.com/adoptium/blog.adoptium.net

* https://github.com/adoptium/adoptium.net should be archived?

Now that the new version is live and seems to work we do need to do that.

  • any system is used to monitor service of TRSS need confirm

Once Nagios is back in a reasonable state we should add TRSS to that to monitor it's health, as with any other systems not already covered :-)

@zdtsw
Copy link
Contributor Author

zdtsw commented Jun 16, 2022

@gdams @karianna could you give inputs for below items:

  • "No check/scan on docker base image and packages we use to build docker image"
    has this been done in any way? or if plan to implement in the future(existing issue ticket)?

  • "process of release installer for ca certification image"
    if this is a manual process, any documentation describe the steps?

@gdams
Copy link
Member

gdams commented Jun 16, 2022

"No check/scan on docker base image and packages we use to build docker image"
has this been done in any way? or if plan to implement in the future(existing issue ticket)?

The docker images have a synk security run before the published (this is done by the Dockerhub folks rather than us)

@zdtsw
Copy link
Contributor Author

zdtsw commented Jun 16, 2022

Here is the SSDF security review doc for public access: https://docs.google.com/document/d/1w3znf2X4y0yoiK2w1cNxSwu8ok7ibWCbD3t1oEs2wwU/edit#heading=h.vkwl1qx18vjg
Please review it and let me know if anything is missing

@smlambert
Copy link
Contributor

re: #144 (comment)

  • announce new release of run-aqa? v2 was from last Dec, but our temurin-build stil uses v1 =>how to make sure our internal systems get updated need input
  • is PerfNext or SmartMedia running somewhere and accessible by public? as service we provide need confirm
  • how TRSS is released with new commits in source code? => when we run ansible to deploy new changes in TRSS?
  • TRSS runs on AWS, deployed by "infrastructur" ansible as trss.adoptopenjdk.net
  • any system is used to monitor service of TRSS need confirm

Sorry, somehow missed this note earlier:

  • I was not aware that temurin-build uses run-aqa in any shape or form, where/how does it use it? I presume a bot is now monitoring all workflow .yml files for updates, which would include run-aqa updates (and to use immutable SHA vs tag)
  • no public instances of PerfNext or SmartMedia are running anywhere
  • TRSS is updated via a Jenkins job, https://ci.adoptium.net/view/Test_grinder/job/TRSS_Code_Sync/ that runs weekly on Friday, no ansible playbook is required to deploy the latest changes to TRSS (in the future, we hope to move the public TRSS instance into a container, which should make it even easier to manage/deploy/recover, but not currently in the 2Q plan) - this was working very well, but recent changes possibly to user ids and permissions or to the recent Jenkins server upgrade seem to have caused the most recent run to fail
  • Scott Fryer has added TRSS to nagios for monitoring it

@sxa
Copy link
Member

sxa commented Jul 12, 2023

Noting that we intend to have a third party perform an audit on parts of this, so that process will cover part of this.

@sxa
Copy link
Member

sxa commented Oct 17, 2023

Note that we have volunteered for an eclipse project to use an external party to perform an analysis of our project security that will be commencing soon.

@sxa
Copy link
Member

sxa commented Dec 13, 2023

The audit described above started last week (Monday 15th December) and is progressing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

5 participants