Public Connectors SDK Resources

The connectors SDK public Github repository provides resources to help developers build their own Fusion SDK connectors. Some of the resources include documentation and getting started guides, as well as example connectors.

For developing a Java SDK based connector, check out the Java SDK README.

The repository includes a Gradle project, which wraps each known plugin with a common set of tasks and dependencies.

See the Simple Connector example for instructions on how to build, deploy and run.

Fusion SDK Connectors Overview

The connectors architecture in Fusion 4 and later is designed to be scalable. Depending on the connector, jobs can be scaled by adding instances of just the connector. The fetching process for these types also supports distributed fetching, so that many instances can contribute to the same job.

Connectors can be hosted within Fusion, or can run remotely. In the hosted case, these connectors are cluster aware. This means that when a new instance of Fusion starts up, the connectors on other Fusion nodes become aware of the new connectors, and vice versa. This simplifies scaling connectors jobs.

In the remote case, connectors become clients of Fusion. These clients run a lightweight process and communicate to Fusion using an efficient messaging format. This option makes it possible to put the connector wherever the data lives. In some cases, this might be required for performance or security and access reasons.

The communication of messages between Fusion and a remote connector or hosted connector are identical; Fusion sees them as the same kind of connector. This means you can implement a connector locally, connect to a remote Fusion for initial testing, and when done, upload the exact same artifact (a zip file) into Fusion, so Fusion can host it for you. The ability to run the connector remotely makes the development process much quicker.

Connectors SDK support matrix

Fusion release	SDK version
5.11.x - 5.12.x	4.2.1
5.9.x - 5.10.x	4.2.0
5.8.0 - 5.8.x	4.1.4
5.6.x - 5.7.x	4.1.3
5.5.1-1 - 5.5.1-x	4.1.2
5.5.1 - 5.5.x	4.1.2
5.5.0	4.1.1
5.4.4 - 5.4.x	4.1.0
5.4.0 - 5.4.3	4.0.0
5.3.0 - 5.3.x	3.0.0
5.2.1 - 5.2.x	2.0.3
5.2.0	2.0.2
5.1.2 - 5.1.x	2.0.1
5.1.0 - 5.1.1	2.0.0
5.0.2	2.0.0-pre-release
4.2.6	1.5.0
4.2.4 - 4.2.5	1.4.0
4.2.2 - 4.2.3	1.3.0
4.2.1	1.2.0
4.2.0	1.1.0

Java SDK

The Java SDK provides components for making it simple to build a connector in Java. Whether the plugin is a true crawler or a simple iterative fetcher, the SDK supports both.

The Java SDK includes a set of base classes and interfaces. It also provides the Gradle build utilities for packaging up a connector, and a connector client application that can run your connector remotely.

Many of the base features needed for a connector are provided by Fusion itself. When a connector first connects to Fusion, it sends along its name, type, schema, and other metadata. This connection then stays open, and the two systems can communicate bi-directionally as needed.

This makes it possible for Fusion to manage configuration data, the job state, scheduling, and encryption for example. The Fusion Admin UI also takes care of the view or presentation, by making use of the connector’s schema.

This client-based approach decouples connectors from Fusion, which allows hot deployment of connectors through a simple connect call.

Distributed Data Store

The data persisted by the connectors framework is distributed across the Fusion cluster. Each node holds its primary partition of the data, as well as backups of other partitions. If a node goes down during a crawl, the data store remains intact and usable. Connector implementations do not need to be concerned with this layer, because it is all handled by Fusion.

Server Side Processing

An important point to consider when building a connector is that the server does not guarantee ordering of emitted items such as Candidates, Documents, Deletes, etc., when processing. Therefore, any connector logic that depends on precise ordering of processing (including index-pipeline and Solr commits) may produce incorrect results. For example, when a document replace is immediately followed by a delete-by-query, and the delete-by-query depends on the document replace to be fully processed and committed. If the document commit has not yet occurred, then the delete-by-query may result in the wrong items being deleted.

CrawlDB fields

Core fields required for any connector include: id and state_s.
Connector specific values include the "fields" and "metadata" properties, which result in Solr document prefixed fields: field_ and meta_, respectively.

Field Name	Field Description	Example value
id	Unique candidate indentifier	content:/app
jobId_s	Unique job identifier. All items processed in the new job will have a different jobId.	KTPbmHYTqm
blockId_s	A BlockId identifies a series of 1 or more Jobs, and the lifetime of a BlockId spans from the start of a crawl to the crawls completion.When a Job starts and the previous Job did not complete (failed or stopped), the previous Job’s BlockId is reused. The same BlockId will be reused until the crawl successfully completes.BlockIds are used to quickly identify items in the CrawlDb which may not have been fully processed (complete).	KwhuWW7wya
state_s	State transition. Possible values (FetchInput, Document, Skip, Error, Checkpoint, ACI(AccessControItem), Delete, FetchResult).	Document
targetPhase_s	Name of the phase this item is emitted to.	content
sourcePhase_s	Name of the phase an item was emitted from.	content
isTransient_b	Flag to indicate that the item should be removed from CrawDB after it has been processed.	false
isLeafNode_b	This flag is used to prioritize the processing leaf node instead of nested nodes to avoid emitting of too many Candidates.	false
createdAt_l	Item created timestamp.	1566508663611
createdAt_tdt	Item created ISO date.	2019-08-22T21:17:43.611Z
modifiedAt_l	Timestamp value which is updated when item changes its state. Also, if purge stray items feature is enabled in the connector plugin, this field is used to determine whether the item is stray or not, then the item is deleted if it’s a stray item.	1566508665709
modifiedAt_tdt	ISO date value which is updated when item changes its state. It serves same purpose as modifiedAt_l.	2019-08-22T21:17:45.709Z
fetchInput_id_s	FetchInput Id.	/app

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
assets/images		assets/images
java-sdk		java-sdk
.gitignore		.gitignore
README.asciidoc		README.asciidoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets/images

assets/images

java-sdk

java-sdk

.gitignore

.gitignore

README.asciidoc

README.asciidoc

Repository files navigation

Public Connectors SDK Resources

Fusion SDK Connectors Overview

Connectors SDK support matrix

Java SDK

Distributed Data Store

Server Side Processing

CrawlDB fields

About

Releases

Contributors 18

Languages

lucidworks/connectors-sdk-resources

Folders and files

Latest commit

History

Repository files navigation

Public Connectors SDK Resources

Fusion SDK Connectors Overview

Connectors SDK support matrix

Java SDK

Distributed Data Store

Server Side Processing

CrawlDB fields

About

Resources

Stars

Watchers

Forks

Languages