DCApiClient(self,
base_url,
client_id,
client_secret,
uaa_url,
url_path_prefix='document-classification/v1/',
polling_threads=5,
polling_sleep=5,
polling_long_sleep=30,
polling_max_attempts=120,
logging_level=30)
This class provides an interface to access SAP Document Classification REST API from a Python application. The structure of values returned by all the methods is documented in the API reference: https://help.sap.com/viewer/ca60cd2ed44f4261a3ae500234c46f37/SHIP/en-US/c1045a561faf4ba0ae2b0e7713f5e6c4.html
- Argument base_url: The service URL taken from the service key (key 'url' in service key JSON)
- Argument client_id: The XSUAA client ID taken from the service key (key 'uaa.clientid' in service key JSON)
- Argument client_secret: The XSUAA client secret taken from the service key (key 'uaa.clientsecret' in service key JSON)
- Argument uaa_url: The XSUAA URL taken from the service key (key 'uaa.url' in service key JSON)
- Argument polling_threads: Number of threads used to poll for asynchronous DC APIs
- Argument polling_sleep: Number of seconds to wait between the polling attempts for most of the APIs, the minimal value is 0.2
- Argument polling_long_sleep: Number of seconds to wait between the polling attempts for model training and deployment operations, the minimal value is 0.2
- Argument polling_max_attempts: Maximum number of attempts used to poll for asynchronous DC APIs
- Argument logging_level: INFO level will log the operations progress, the default level WARNING should not produce any logs
DCApiClient.classify_document(document_path: str,
model_name,
model_version,
reference_id=None,
mime_type=None)
Submits request for document classification, checks the response and returns the reference ID for the uploaded document
- Argument document_path: Path to the PDF file on the disk
- Argument model_name: The name of the model that was successfully deployed to be used for the classification
- Argument model_version: The version of the model that was successfully deployed to be used for the classification
- Argument reference_id: In case the document reference ID has to be managed by the user, it can be specified. In this case the user is responsible for providing unique reference IDs for different documents
- Argument mime_type: The file type of the document uploaded
Returns: Object containing the reference ID of the classified document and the classification results
DCApiClient.classify_documents(documents_paths: typing.List[str],
model_name, model_version)
Submits requests for classification of multiple documents, checks the response and returns the reference ID for the classified documents
- Argument documents_paths: Paths to the PDF files on the disk
- Argument model_name: The name of the model that was successfully deployed to be used for the classification
- Argument model_version: The version of the model that was successfully deployed to be used for the classification
Returns: An iterator of objects containing the reference ID of the classified document and the classification results
DCApiClient.create_dataset()
Creates an empty dataset
Returns: Object containing the dataset id
DCApiClient.delete_dataset(dataset_id)
Deletes a dataset with a given ID
- Argument dataset_id: The ID of the dataset to delete
Returns: Object containing the ID of the deleted dataset and the number of documents deleted
DCApiClient.delete_training_document(dataset_id, document_id)
Deletes a training document from a dataset
- Argument dataset_id: The ID of the dataset where the document is located
- Argument document_id: The reference ID of the document
Returns: An empty object
DCApiClient.get_dataset_info(dataset_id)
Gets statistical information about a dataset with a given ID
- Argument dataset_id: The ID of the dataset
Returns: Summary information about the dataset that includes the number of documents in different processing stages
DCApiClient.get_datasets_info()
Gets summary information about the existing datasets
Returns: An array of datasets corresponding to the 'datasets' part of the json response
DCApiClient.get_dataset_documents_info(dataset_id,
top: int = None,
skip: int = None,
count: bool = None)
Gets the information about all the documents in a specific dataset
- Argument dataset_id: The ID of an existing dataset
- Argument top: Pagination: number of documents to be fetched in the current request
- Argument skip: Pagination: number of documents to skip for the current request
- Argument count: Flag to show count of number of documents in the dataset
Returns: Object that contains an array of the documents
DCApiClient.get_classification_documents_info(model_name, model_version)
Gets the information about recently classified documents
- Argument model_name: The name of the model against which the documents were classified
- Argument model_version: The version of the model against which the documents were classified
Returns: An array of document information correspond to the 'results' part of the json response. Each document information includes its reference ID and the classification status.
DCApiClient.upload_document_to_dataset(
dataset_id,
document_path: str,
ground_truth: typing.Union[str, dict],
document_id=None,
mime_type=None,
stratification_set=None)
Uploads a single document and its ground truth to a specific dataset
- Argument dataset_id: The ID of the dataset
- Argument document_path: The path to the PDF document
- Argument ground_truth: Path to the ground truth JSON file or an object representing the ground truth
- Argument document_id: The reference ID of the document
- Argument mime_type: The file type of the document
- Argument stratification_set: Defines a custom stratification set (training/validation/test)
Returns: Object containing information about the uploaded document
DCApiClient.upload_documents_directory_to_dataset(
dataset_id, path, file_extension='.pdf')
- Argument dataset_id: The ID of the dataset to upload the documents to
- Argument path: The path has to contain document data files and JSON file with GT with corresponding names
- Argument file_extension: The file format of the documents to be uploaded. Default is '.pdf'
Returns: An iterator with the upload results
DCApiClient.upload_documents_to_dataset(
dataset_id, documents_paths: typing.List[str],
ground_truths_paths: typing.List[typing.Union[str, dict]])
- Argument dataset_id: The ID of the dataset to upload the documents to
- Argument documents_paths: The paths of the PDF files
- Argument ground_truths_paths: The paths of the JSON files containing the ground truths
Returns: An iterator with the upload results
DCApiClient.train_model(model_name, dataset_id)
Trigger the process to train a new model version for documents classification, based on the documents in the specific dataset and wait until this process is finished. The process may take significant time to complete depending on the size of the dataset.
- Argument model_name: The name of the new model to train
- Argument dataset_id: The name of existing dataset containing enough documents for training
Returns: Object containing the statistical data about the trained model, including accuracy, recall and precision
DCApiClient.delete_trained_model(model_name, model_version)
Deletes an existing trained model
- Argument model_name: Name of the existing model to delete
- Argument model_version: Version of the existing model to delete
Returns: An empty object
DCApiClient.get_trained_models_info()
Gets information about all trained models
Returns: An array of trained models corresponding to the 'models' part of the json response . Each model information contains training status and training accuracy data.
DCApiClient.get_trained_model_info(model_name, model_version)
Gets information about a specific trained model
- Argument model_name: The name of the model
- Argument model_version: The version of the model
Returns: Object containing the training status and training accuracy data
DCApiClient.deploy_model(model_name, model_version)
Deploys a trained model to be available for inference
- Argument model_name: The name of the trained model
- Argument model_version: The version of the trained model
Returns: Object containing information about the deployed model serving
DCApiClient.get_deployed_models_info()
Gets information about all deployed model servings
Returns: An array of all deployed model servings corresponding to the 'deployments' part if the json response
DCApiClient.get_deployed_model_info(model_name_or_deployment_id,
model_version=None)
Gets information about a specific deployed model serving. This method can be called either with the ID of the deployed model or with the model name and version
- Argument model_name_or_deployment_id: ID of the deployed model or the model name, if the model name is provided, version has to be provided as well
- Argument model_version: The version of the deployed model
Returns: Object containing the information about the deployed model serving
DCApiClient.undeploy_model(model_name_or_deployment_id,
model_version=None)
Removes a deployment of the specific model serving. This method can be called either with the ID of the deployed model or with the model name and version
- Argument model_name_or_deployment_id: ID of the deployed model or the model name, if the model name is provided, version has to be provided as well
- Argument model_version: The version of the deployed model
Returns: An empty object
DoxApiClient(self,
base_url,
client_id,
client_secret,
uaa_url,
url_path_prefix='document-information-extraction/v1/',
polling_threads=5,
polling_sleep=5,
polling_max_attempts=60,
logging_level=30)
This class provides an interface to access SAP Document Information Extraction REST API from a Python application. The structure of the values returned by all the methods is documented in the API reference: https://help.sap.com/viewer/5fa7265b9ff64d73bac7cec61ee55ae6/SHIP/en-US/ded7d34e60f1422ba2e04e892a7f0e25.html
- Argument base_url: The service URL taken from the service key (key 'url' in service key JSON)
- Argument client_id: The XSUAA client ID taken from the service key (key 'uaa.clientid' in service key JSON)
- Argument client_secret: The XSUAA client secret taken from the service key (key 'uaa.clientsecret' in service key JSON)
- Argument uaa_url: The XSUAA URL taken from the service key (key 'uaa.url' in the service key JSON)
- Argument polling_threads: Number of threads used to poll for asynchronous APIs
- Argument polling_sleep: Number of seconds to wait between the polling attempts for APIs, the minimal value is 0.2
- Argument polling_max_attempts: Maximum number of attempts used to poll for asynchronous APIs
- Argument logging_level: INFO level will log the operations progress, the default level WARNING should not produce any logs
DoxApiClient.get_capabilities()
Gets the capabilities available for the service instance.
Returns: Dictionary with available extraction fields, enrichment and document types.
DoxApiClient.create_client(client_id, client_name)
Creates a new client for whom a document can be uploaded
- Argument client_id: The ID of the new client
- Argument client_name: The name of the new client
Returns: The API endpoint response as dictionary
DoxApiClient.create_clients(clients: list)
Creates one or more clients for whom documents can be uploaded
- Argument clients: A list of clients to be created. For the format of a client see documentation
Returns: The API endpoint response as dictionary
DoxApiClient.get_clients(top: int = 100,
skip: int = None,
client_id_starts_with: str = None)
Gets all existing clients filtered by the parameters
- Argument top: The maximum number of clients to get. Default is 100
- Argument skip: (optional) Index of the first client to get
- Argument client_id_starts_with: (optional) Filters the clients by the characters the ID starts with
Returns: List of existing clients as dictionaries corresponding to the 'payload' part of the json response
DoxApiClient.delete_client(client_id)
Deletes a client with the given client ID
- Argument client_id: The ID of the client to be deleted
Returns: The API endpoint response as dictionary
DoxApiClient.delete_clients(client_ids: list = None)
Deletes multiple clients with the given client IDs
- Argument client_ids: (optional) List of IDs of clients to be deleted. If no IDs are provided, all clients will be deleted
Returns: The API endpoint response as dictionary
DoxApiClient.post_client_capability_mapping(
client_id,
document_type: str = 'paymentAdvice',
file_type: str = 'Excel',
header_fields: typing.Union[str, typing.List[str]] = None,
line_item_fields: typing.Union[str, typing.List[str]] = None)
Post the client capability mapping for the Identifier API provided by the customer to the entity client DB
- Argument client_id: The ID of the client for which to upload the capability mapping
- Argument document_type: The document type of for which the mapping applies. Default is 'paymentAdvice'
- Argument file_type: The file type for which the mapping applies. Default is 'Excel'
- Argument header_fields: A list of mappings for the header fields. For the format see documentation
- Argument line_item_fields: A list of mappings for the line item fields. For the format see documentation
Returns: The API endpoint response as dictionary
DoxApiClient.post_client_capability_mapping_with_options(
client_id, options: dict)
Post the client capability mapping for the Identifier API provided by the customer to the entity client DB
- Argument client_id: The ID of the client for which to upload the capability mapping
- Argument options: The mapping that should be uploaded. For the format see documentation
Returns: The API endpoint response as dictionary
DoxApiClient.extract_information_from_document(
document_path: str,
client_id,
document_type: str,
mime_type: str = 'application/pdf',
header_fields: typing.Union[str, typing.List[str]] = None,
line_item_fields: typing.Union[str, typing.List[str]] = None,
template_id=None,
schema_id=None,
received_date=None,
enrichment=None,
return_null_values: bool = False)
Extracts the information from a document. The function will run until a processing result can be returned or a timeout is reached
- Argument document_path: The path to the document
- Argument client_id: The client ID for which the document should be uploaded
- Argument document_type: The type of the document being uploaded. For available document types see documentation
- Argument mime_type: Content type of the uploaded file. If 'unknown' is given, the content type is fetched automatically. Default is 'application/pdf'. The 'constants.py' file contains CONTENT_TYPE_[JPEG, PDF, PNG, TIFF, UNKNOWN] that can be used here.
- Argument header_fields: A list of header fields to be extracted. Can be given as list of strings or as comma separated string. If none are given, no header fields will be extracted. Will be ignored, if schema_id is provided
- Argument line_item_fields: A list of line item fields to be extracted. Can be given as list of strings or as comma separated string. If none are given, no line item fields will be extracted. Will be ignored, if schema_id is provided.
- Argument template_id: (optional) The ID of the template to be used for this document
- Argument schema_id: (optional) The ID of the schema to be used for the document. Only schema_id OR header_fields and line_item_fields can be used. If given, header_fields and line_item_fields are ignored
- Argument received_date: (optional) The date the document was received
- Argument enrichment: (optional) A dictionary of entities that should be used for entity matching
- Argument return_null_values: Flag if fields with null as value should be included in the response or not. Default is False
Returns: The extracted information of the document as dictionary
DoxApiClient.extract_information_from_document_with_options(
document_path: str,
options: dict,
mime_type: str = 'application/pdf',
return_null_values: bool = False)
Extracts the information from a document. The function will run until a processing result can be returned or a timeout is reached.
- Argument document_path: The path to the document
- Argument options: The options for processing the document as dictionary. It has to include at least a valid client ID and document type
- Argument mime_type: Content type of the uploaded file. If 'unknown' is given, the content type is fetched automatically. Default is 'application/pdf'. The 'constants.py' file contains CONTENT_TYPE_[JPEG, PDF, PNG, TIFF, UNKNOWN] that can be used here.
- Argument return_null_values: Flag if fields with null as value should be included in the response or not. Default is False
Returns: The extracted information of the document as dictionary
DoxApiClient.extract_information_from_documents(
document_paths: typing.List[str],
client_id,
document_type: str,
mime_type: str = 'application/pdf',
mime_type_list: typing.List[str] = None,
header_fields: typing.Union[str, typing.List[str]] = None,
line_item_fields: typing.Union[str, typing.List[str]] = None,
template_id=None,
schema_id=None,
received_date=None,
enrichment=None,
return_null_values: bool = False)
Extracts the information from multiple documents. The function will run until all documents have been processed or a timeout is reached. The given parameters will be used for all documents
- Argument document_paths: A list of paths to the documents
- Argument client_id: The client ID for which the documents should be uploaded
- Argument document_type: The type of the document being uploaded. For available document types see documentation
- Argument mime_type: Content type that is used for all uploaded files. If 'unknown' is given, the content type is fetched automatically. Default is 'application/pdf'. The 'constants.py' file contains CONTENT_TYPE_[JPEG, PDF, PNG, TIFF, UNKNOWN] that can be used here.
- Argument mime_type_list: A list of content types for each file to be uploaded. Has to have the same length as 'document_paths'. If this parameter is given, 'mime_type' will be ignored.
- Argument header_fields: A list of header fields to be extracted. Can be given as list of strings or as comma separated string. If none are given, no header fields will be extracted. Will be ignored, if schema_id is provided
- Argument line_item_fields: A list of line item fields to be extracted. Can be given as list of strings or as comma separated string. If none are given, no line item fields will be extracted. Will be ignored, if schema_id is provided.
- Argument template_id: (optional) The ID of the template to be used for the documents
- Argument schema_id: (optional) The ID of the schema to be used for the documents. Only schema_id OR header_fields and line_item_fields can be used. If given, header_fields and line_item_fields are ignored
- Argument received_date: (optional) The date the documents were received
- Argument enrichment: (optional) A dictionary of entities that should be used for entity matching. For the format see documentation
- Argument return_null_values: Flag if fields with null as value should be included in the responses or not. Default is False
Returns: An iterator with extracted information for successful documents and exceptions for failed documents. Use next(iterator) within a try-catch block to filter the failed documents.
DoxApiClient.extract_information_from_documents_with_options(
document_paths: typing.List[str],
options: dict,
mime_type: str = 'application/pdf',
mime_type_list: typing.List[str] = None,
return_null_values: bool = False)
Extracts the information from multiple documents. The function will run until all documents have been processed or a timeout is reached. The given options will be used for all documents
- Argument document_paths: A list of paths to the documents
- Argument options: The options for processing the documents as dictionary. It has to include at least a valid client ID and document type
- Argument mime_type: Content type that is used for all uploaded files. If 'unknown' is given, the content type is fetched automatically. Default is 'application/pdf'. The 'constants.py' file contains CONTENT_TYPE_[JPEG, PDF, PNG, TIFF, UNKNOWN] that can be used here.
- Argument mime_type_list: A list of content types for each file to be uploaded. Has to have the same length as 'document_paths'. If this parameter is given, 'mime_type' will be ignored.
- Argument return_null_values: Flag if fields with null as value should be included in the responses or not. Default is False
Returns: An iterator with extracted information for successful documents and exceptions for failed documents. Use next(iterator) within a try-catch block to filter the failed documents.
DoxApiClient.get_extraction_for_document(
document_id,
extracted_values: bool = None,
return_null_values: bool = False)
Gets the extracted information of an uploaded document by document ID. Raises an exception, when the document failed or didn't finish processing after the maximum number of requests
- Argument document_id: The ID of the document
- Argument extracted_values: (optional) Flag if the extracted values or the ground truth should be returned. If set
to
True
the extracted values are returned. If set toFalse
the ground truth is returned. If no ground truth is available, the extracted values will be returned either way. IfNone
is given, the ground truth is returned if available - Argument return_null_values: Flag if fields with null as value should be included in the response or not. Default is False
Returns: The extracted information of the processed document or the ground truth as dictionary
DoxApiClient.get_extraction_for_documents(
document_ids: list,
extracted_values: bool = None,
return_null_values: bool = False)
Gets the extracted information for multiple documents given their document IDs
- Argument document_ids: A list of IDs of documents
- Argument extracted_values: (optional) Flag if the extracted values or the ground truth should be returned. If set
to
True
the extracted values are returned. If set toFalse
the ground truth is returned. If no ground truth is available, the extracted values will be returned either way. IfNone
is given, the ground truth is returned if available - Argument return_null_values: Flag if fields with null as value should be included in the response or not. Default is False
Returns: An iterator with extracted information or ground truths for successful documents and exceptions for failed documents. Use next(iterator) within a try-catch block to filter the failed documents.
DoxApiClient.get_document_list(client_id=None)
Gets a list of document jobs filtered by the client ID
- Argument client_id: (optional) The client ID for which the document jobs should be get. Gets all document jobs if no client ID is given
Returns: A list of document jobs as dictionaries corresponding to the 'results' part of the json response
DoxApiClient.delete_documents(document_ids: list = None)
Deletes a list of documents or all documents
- Argument document_ids: (optional) A list of document IDs that shall be deleted. If this argument is not provided, all documents are deleted.
Returns: The API endpoint response as dictionary
DoxApiClient.upload_enrichment_data(client_id,
data,
data_type: str,
subtype: str = None)
Creates one or more enrichment data records. The function returns after all data was created successfully or raises an exception if something went wrong.
- Argument client_id: The client ID for which the data records shall be created.
- Argument data: A list of data to be uploaded. For the format of the data see documentation
- Argument data_type: The type of data which is uploaded. For the available data types see documentation
- Argument subtype: (optional) Only used for type 'businessEntity'. For the available subtypes see documentation
Returns: The API endpoint response as dictionary
DoxApiClient.get_enrichment_data(client_id,
data_type: str,
subtype: str = None,
top: int = None,
skip: int = None,
data_id=None,
system=None,
company_code=None)
Gets the enrichment data records filtered by the provided parameters
- Argument client_id: The ID of the client for which the enrichment data was created
- Argument data_type: The type of the data records. For the available data types see documentation
- Argument subtype: (optional) The subtype of the records. Only used for type 'businessEntity'. For the available subtypes see documentation
- Argument top: (optional) The maximum number records to be returned
- Argument skip: (optional) The index of the first record to be returned
- Argument data_id: (optional) The ID of a single data record. Only one will be returned
- Argument system: (optional) The system of a single record
- Argument company_code: (optional) The company code of a single record
Returns: A list of enrichment data records corresponding to the 'value' part of the json response. Returns a list with one item when data_id is given
DoxApiClient.delete_all_enrichment_data(data_type: str = None)
This endpoint is deleting all enrichment data records for the account
- Argument data_type: (Optional) The type of enrichment data that should be deleted. For the available data types see documentation
Returns: The API endpoint response as dictionary
DoxApiClient.delete_enrichment_data(client_id,
enrichment_records: list,
data_type: str,
subtype: str = None,
delete_async: bool = False)
Deletes the enrichment data records with the given IDs in the payload
- Argument client_id: The client ID for which the enrichment data was created
- Argument enrichment_records: A list of dictionaries with the form:
{'id':'', 'system':'', 'companyCode':''}
- Argument data_type: The type of enrichment data that should be deleted. For the available document data types see documentation
- Argument subtype: (optional) The subtype of the records that should be deleted. Only used for type 'businessEntity'. For the available subtypes see documentation
- Argument delete_async: Set to
True
to delete data records asynchronously. Asynchronous deletion should be used when deleting large amounts of data to improve performance. Default isFalse
Returns: The API endpoint response as dictionary
DoxApiClient.activate_enrichment_data(params=None)
Activates all enrichment data records for the current tenant
- Argument params: Optional. A dictionary, list of tuples or bytes to send as a query string.
Returns: The API endpoint response as dictionary
DoxApiClient.get_image_for_document(document_id, page_no: int)
Gets the image of a document page for the given document ID and page number
- Argument document_id: The ID of the document
- Argument page_no: The page number for which to get the image
Returns: The image of the document page in the PNG format as bytes
DoxApiClient.get_document_page_text(document_id, page_no: int)
Gets the text for a document page for the given document ID and page number
- Argument document_id: The ID of the document
- Argument page_no: The page number for which to get the text
Returns: The text for the page as a list of dictionaries corresponding to the 'value' part of the json response
DoxApiClient.get_document_text(document_id)
Gets the text for all pages of a document
- Argument document_id: The ID of the document
Returns: A dictionary mapping the page numbers to the text corresponding to the 'results' part of the json reponse
DoxApiClient.get_request_for_document(document_id)
Gets the request of a processed document
- Argument document_id: The ID of the document
Returns: The request of the document as dictionary
DoxApiClient.get_page_dimensions_for_document(document_id, page_no: int)
Gets the dimensions of a document page
- Argument document_id: The ID of the document
- Argument page_no: The page number for which to get the dimensions
Returns: The height and width of the document page as dictionary
DoxApiClient.get_all_dimensions_for_document(document_id)
Gets the dimensions of all document pages
- Argument document_id: The ID of the document
Returns: The height and width of all document pages as dictionary with page number as key corresponding to the 'results' part of the json response
DoxApiClient.post_ground_truth_for_document(
document_id, ground_truth: typing.Union[str, dict])
Saves the ground truth for a document
- Argument document_id: The ID of the document
- Argument ground_truth: Path to the ground truth JSON file or an object representing the ground truth
Returns: The API endpoint response as dictionary
DoxApiClient.post_confirm_document(document_id,
data_for_retraining=False)
Sets the document status to confirmed
- Argument document_id: The ID of the document
- Argument data_for_retraining: Indicates whether the document should be allowed for retraining. Default is False
Returns: The API endpoint response as dictionary