PheBee

PheBee is a phenotype-to-cohort query service that integrates structured biomedical ontologies and AWS-native infrastructure to support translational research. It enables researchers and clinicians to ask complex questions about phenotypic data in patient cohorts, such as:

"Which subjects have a specific phenotype or any of its descendants?"
"How frequently does a phenotype occur within a cohort?"

PheBee leverages ontologies like HPO (Human Phenotype Ontology), MONDO (Monarch Disease Ontology), and ECO (Evidence and Conclusion Ontology) to provide deep, hierarchical querying and evidence classification.

Features

Query patient cohorts based on ontological relationships
Graph-based data storage in AWS Neptune
RESTful API with OpenAPI spec and AWS Signature V4 authentication
Serverless architecture powered by AWS SAM and Lambda
Iceberg tables registered in AWS Glue Data Catalog, enabling integration with Lake Formation and other analytics tools
Automated deployment and testing workflows

Architecture

PheBee uses a hybrid architecture combining knowledge graphs with data lake technologies to enable both semantic reasoning and analytical queries at scale.

Core Components

AWS Neptune (Knowledge Graph)

Stores ontology hierarchies (HPO, MONDO, ECO) as RDF triples
Enables SPARQL queries for ontological reasoning and relationship traversal

Apache Iceberg (Data Lake)

Stores subject-term associations and clinical evidence as columnar data
Queryable via AWS Athena for analytical workloads

DynamoDB (Caching Layer)

Caches frequently accessed ontology metadata (term descendants, versions)
Dramatically reduces load for common traversal patterns
Handles versioning for ontology updates

S3 (Object Storage)

Raw data staging for bulk imports (Phenopackets, NDJSON)
Iceberg table storage (Parquet files)
Ontology source files (OWL, OBO)

Data Flow

Ontology Loading: OWL/OBO files → Neptune graph + DynamoDB cache
Bulk Import: S3 NDJSON batches → Step Functions orchestration → Iceberg tables → Neptune graph
Materialization: Evidence data is aggregated into dual-partitioned analytical tables for optimized query patterns
Query Path:
- API Gateway → Lambda → Neptune (ontology traversal) + Athena (data queries)
- Results combined and returned via RESTful API

Why This Architecture?

Semantic reasoning requires graph traversal (Neptune)
Analytical queries at scale need columnar storage (Iceberg/Athena)
Hybrid approach gives best of both worlds: ontology intelligence + data lake performance
Serverless components (Lambda, Step Functions) minimize operational overhead
Open formats (RDF, Iceberg, Parquet) ensure data portability and interoperability

Getting Started

Configuration Setup

This project provides a samconfig.yaml.example file as a template for your deployment configuration.

To get started, copy it to create your own samconfig.yaml:

cp samconfig.yaml.example samconfig.yaml

Then, edit the file with your environment-specific values. Each section of the file contains deployment settings for a stack in a given environment. Here's what each field means:

prod:                                  # The environment name
  deploy:
    parameters:
      stack_name: phebee-prod          # The name of the CloudFormation stack to be created or updated
      capabilities:
        - CAPABILITY_IAM               # Allows creation of IAM resources
        - CAPABILITY_NAMED_IAM         # Allows creation of named IAM roles and policies
      parameter_overrides:
        - VpcId=                       # The ID of your target VPC, which allows your Lambda functions and resources to connect securely within your private network
                                # Learn more: https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html
        - SubnetId1=                   # The first subnet ID, typically in the same availability zone as other services your app needs to access
                                # Learn more: https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html
        - SubnetId2=                   # The second subnet ID, usually in a different availability zone for high availability and fault tolerance
                                # Learn more: https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html
      tags:
        - app=phebee                   # Tags applied to the stack for resource tracking or cost management

Once filled in, this configuration will allow you to run:

sam deploy --config-env prod

This command will use the parameters defined in your samconfig.yaml without needing to specify them manually each time.

Prerequisites

Before building or deploying PheBee, make sure you have:

AWS CLI installed and configured
AWS SAM CLI
Python 3.9+
pip and virtualenv (recommended)
AWS credentials with appropriate IAM permissions for deploying a SAM app

To install the required Python dependencies for deployment:

pip install awscli aws-sam-cli

Then configure AWS:

aws configure

Building and Deploying PheBee

You can manually build and deploy the SAM application using the AWS SAM CLI.

1. Build the SAM Application

sam build

This command compiles the application and its dependencies into .aws-sam/build.

2. Deploy the Application

sam deploy --config-env dev \
           --no-confirm-changeset \
           --resolve-s3 \
           --no-fail-on-empty-changeset

Optional flags:

--profile <your-profile>: Use a named AWS profile
--stack-name <custom-stack>: Deploy under a custom stack name

To check deployment status:

aws cloudformation describe-stacks --stack-name <your-stack-name>

3. Clean Up Resources

sam delete --stack-name <your-stack-name> --no-prompts

Running Integration Tests

Integration tests validate the infrastructure and APIs by deploying the stack and exercising key endpoints.

Prerequisites

Install dependencies:

pip install pytest boto3

Ensure your AWS credentials are configured (aws configure).

Run All Integration Tests

pytest -m integration -v

With profile or environment:

pytest -m integration --profile=dev --config-env=dev -v

Use an existing deployed stack:

# Using command-line flag
pytest -m integration --existing-stack <your-stack-name> -v

# Or create .phebee-test-stack file for persistent configuration
echo "your-stack-name" > .phebee-test-stack
pytest -m integration -v

See Testing Guide for details.

Run a specific test:

pytest tests/integration/test_cloudformation_stack.py::test_cloudformation_stack -m integration -v

Performance Evaluation

PheBee includes comprehensive performance testing infrastructure to evaluate bulk data ingestion throughput and API query latency at scale with realistic clinical data patterns.

Key Features

Realistic synthetic data generation with disease clustering and clinical documentation patterns
Bulk import performance measurement for large-scale data ingestion
API latency testing with 7 query patterns representing real-world use cases
Reproducible benchmark datasets (1K-100K subjects) for manuscript evaluation
Automated performance visualization scripts for publication-ready figures

Documentation

For detailed instructions, see:

Testing Guide - Complete testing documentation
Performance Testing Guide - Performance evaluation methodology

Contributing

We welcome contributions! Please open an issue or submit a pull request for bug reports, feature suggestions, or general improvements.

License

This project is licensed under the BSD 3-Clause License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 303 Commits
functions		functions
layers/phebee-utils		layers/phebee-utils
model		model
scripts		scripts
statemachine		statemachine
tests		tests
utilities		utilities
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
api.yaml		api.yaml
pytest.ini		pytest.ini
samconfig.yaml.example		samconfig.yaml.example
template.yaml		template.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PheBee

Features

Architecture

Core Components

Data Flow

Why This Architecture?

Getting Started

Configuration Setup

Prerequisites

Building and Deploying PheBee

1. Build the SAM Application

2. Deploy the Application

3. Clean Up Resources

Running Integration Tests

Prerequisites

Run All Integration Tests

Performance Evaluation

Key Features

Documentation

Contributing

License

References

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PheBee

Features

Architecture

Core Components

Data Flow

Why This Architecture?

Getting Started

Configuration Setup

Prerequisites

Building and Deploying PheBee

1. Build the SAM Application

2. Deploy the Application

3. Clean Up Resources

Running Integration Tests

Prerequisites

Run All Integration Tests

Performance Evaluation

Key Features

Documentation

Contributing

License

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages