PheBee is a phenotype-to-cohort query service that integrates structured biomedical ontologies and AWS-native infrastructure to support translational research. It enables researchers and clinicians to ask complex questions about phenotypic data in patient cohorts, such as:
- "Which subjects have a specific phenotype or any of its descendants?"
- "How frequently does a phenotype occur within a cohort?"
PheBee leverages ontologies like HPO (Human Phenotype Ontology), MONDO (Monarch Disease Ontology), and ECO (Evidence and Conclusion Ontology) to provide deep, hierarchical querying and evidence classification.
- Query patient cohorts based on ontological relationships
- Graph-based data storage in AWS Neptune
- RESTful API with OpenAPI spec and AWS Signature V4 authentication
- Serverless architecture powered by AWS SAM and Lambda
- Iceberg tables registered in AWS Glue Data Catalog, enabling integration with Lake Formation and other analytics tools
- Automated deployment and testing workflows
PheBee uses a hybrid architecture combining knowledge graphs with data lake technologies to enable both semantic reasoning and analytical queries at scale.
AWS Neptune (Knowledge Graph)
- Stores ontology hierarchies (HPO, MONDO, ECO) as RDF triples
- Enables SPARQL queries for ontological reasoning and relationship traversal
Apache Iceberg (Data Lake)
- Stores subject-term associations and clinical evidence as columnar data
- Queryable via AWS Athena for analytical workloads
DynamoDB (Caching Layer)
- Caches frequently accessed ontology metadata (term descendants, versions)
- Dramatically reduces load for common traversal patterns
- Handles versioning for ontology updates
S3 (Object Storage)
- Raw data staging for bulk imports (Phenopackets, NDJSON)
- Iceberg table storage (Parquet files)
- Ontology source files (OWL, OBO)
- Ontology Loading: OWL/OBO files → Neptune graph + DynamoDB cache
- Bulk Import: S3 NDJSON batches → Step Functions orchestration → Iceberg tables → Neptune graph
- Materialization: Evidence data is aggregated into dual-partitioned analytical tables for optimized query patterns
- Query Path:
- API Gateway → Lambda → Neptune (ontology traversal) + Athena (data queries)
- Results combined and returned via RESTful API
- Semantic reasoning requires graph traversal (Neptune)
- Analytical queries at scale need columnar storage (Iceberg/Athena)
- Hybrid approach gives best of both worlds: ontology intelligence + data lake performance
- Serverless components (Lambda, Step Functions) minimize operational overhead
- Open formats (RDF, Iceberg, Parquet) ensure data portability and interoperability
This project provides a samconfig.yaml.example file as a template for your deployment configuration.
To get started, copy it to create your own samconfig.yaml:
cp samconfig.yaml.example samconfig.yamlThen, edit the file with your environment-specific values. Each section of the file contains deployment settings for a stack in a given environment. Here's what each field means:
prod: # The environment name
deploy:
parameters:
stack_name: phebee-prod # The name of the CloudFormation stack to be created or updated
capabilities:
- CAPABILITY_IAM # Allows creation of IAM resources
- CAPABILITY_NAMED_IAM # Allows creation of named IAM roles and policies
parameter_overrides:
- VpcId= # The ID of your target VPC, which allows your Lambda functions and resources to connect securely within your private network
# Learn more: https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html
- SubnetId1= # The first subnet ID, typically in the same availability zone as other services your app needs to access
# Learn more: https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html
- SubnetId2= # The second subnet ID, usually in a different availability zone for high availability and fault tolerance
# Learn more: https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html
tags:
- app=phebee # Tags applied to the stack for resource tracking or cost managementOnce filled in, this configuration will allow you to run:
sam deploy --config-env prodThis command will use the parameters defined in your samconfig.yaml without needing to specify them manually each time.
Before building or deploying PheBee, make sure you have:
- AWS CLI installed and configured
- AWS SAM CLI
- Python 3.9+
pipandvirtualenv(recommended)- AWS credentials with appropriate IAM permissions for deploying a SAM app
To install the required Python dependencies for deployment:
pip install awscli aws-sam-cliThen configure AWS:
aws configureYou can manually build and deploy the SAM application using the AWS SAM CLI.
sam buildThis command compiles the application and its dependencies into .aws-sam/build.
sam deploy --config-env dev \
--no-confirm-changeset \
--resolve-s3 \
--no-fail-on-empty-changesetOptional flags:
--profile <your-profile>: Use a named AWS profile--stack-name <custom-stack>: Deploy under a custom stack name
To check deployment status:
aws cloudformation describe-stacks --stack-name <your-stack-name>sam delete --stack-name <your-stack-name> --no-promptsIntegration tests validate the infrastructure and APIs by deploying the stack and exercising key endpoints.
Install dependencies:
pip install pytest boto3Ensure your AWS credentials are configured (aws configure).
pytest -m integration -vWith profile or environment:
pytest -m integration --profile=dev --config-env=dev -vUse an existing deployed stack:
# Using command-line flag
pytest -m integration --existing-stack <your-stack-name> -v
# Or create .phebee-test-stack file for persistent configuration
echo "your-stack-name" > .phebee-test-stack
pytest -m integration -vSee Testing Guide for details.
Run a specific test:
pytest tests/integration/test_cloudformation_stack.py::test_cloudformation_stack -m integration -vPheBee includes comprehensive performance testing infrastructure to evaluate bulk data ingestion throughput and API query latency at scale with realistic clinical data patterns.
- Realistic synthetic data generation with disease clustering and clinical documentation patterns
- Bulk import performance measurement for large-scale data ingestion
- API latency testing with 7 query patterns representing real-world use cases
- Reproducible benchmark datasets (1K-100K subjects) for manuscript evaluation
- Automated performance visualization scripts for publication-ready figures
For detailed instructions, see:
- Testing Guide - Complete testing documentation
- Performance Testing Guide - Performance evaluation methodology
We welcome contributions! Please open an issue or submit a pull request for bug reports, feature suggestions, or general improvements.
This project is licensed under the BSD 3-Clause License. See the LICENSE file for details.