[ aws . kendra ]

batch-put-document

Description

Adds one or more documents to an index.

The BatchPutDocument operation enables you to ingest inline documents or a set of documents stored in an Amazon S3 bucket. Use this operation to ingest your text and unstructured text into an index, add custom attributes to the documents, and to attach an access control list to the documents added to the index.

The documents are indexed asynchronously. You can see the progress of the batch using Amazon Web Services CloudWatch. Any error messages related to processing the batch are sent to your Amazon Web Services CloudWatch log.

See also: AWS API Documentation

See ‘aws help’ for descriptions of global parameters.

Synopsis

  batch-put-document
--index-id <value>
[--role-arn <value>]
--documents <value>
[--cli-input-json | --cli-input-yaml]
[--generate-cli-skeleton <value>]

Options

--index-id (string)

The identifier of the index to add the documents to. You need to create the index first using the CreateIndex operation.

--role-arn (string)

The Amazon Resource Name (ARN) of a role that is allowed to run the BatchPutDocument operation. For more information, see IAM Roles for Amazon Kendra .

--documents (list)

One or more documents to add to the index.

Documents can include custom attributes. For example, ‘DataSourceId’ and ‘DataSourceSyncJobId’ are custom attributes that provide information on the synchronization of documents running on a data source. Note, ‘DataSourceSyncJobId’ could be an optional custom attribute as Amazon Kendra will use the ID of a running sync job.

Documents have the following file size limits.

  • 5 MB total size for inline documents

  • 50 MB total size for files from an S3 bucket

  • 5 MB extracted text for any file

For more information about file size and transaction per second quotas, see Quotas .

(structure)

A document in an index.

Id -> (string)

A unique identifier of the document in the index.

Title -> (string)

The title of the document.

Blob -> (blob)

The contents of the document.

Documents passed to the Blob parameter must be base64 encoded. Your code might not need to encode the document file bytes if you’re using an Amazon Web Services SDK to call Amazon Kendra operations. If you are calling the Amazon Kendra endpoint directly using REST, you must base64 encode the contents before sending.

S3Path -> (structure)

Information required to find a specific file in an Amazon S3 bucket.

Bucket -> (string)

The name of the S3 bucket that contains the file.

Key -> (string)

The name of the file.

Attributes -> (list)

Custom attributes to apply to the document. Use the custom attributes to provide additional information for searching, to provide facets for refining searches, and to provide additional information in the query response.

(structure)

A custom attribute value assigned to a document.

Key -> (string)

The identifier for the attribute.

Value -> (structure)

The value of the attribute.

StringValue -> (string)

A string, such as “department”.

StringListValue -> (list)

A list of strings.

(string)

LongValue -> (long)

A long integer value.

DateValue -> (timestamp)

A date expressed as an ISO 8601 string.

It is important for the time zone to be included in the ISO 8601 date-time format. For example, 20120325T123010+01:00 is the ISO 8601 date-time format for March 25th 2012 at 12:30PM (plus 10 seconds) in Central European Time.

AccessControlList -> (list)

Information on user and group access rights, which is used for user context filtering.

(structure)

Provides user and group information for document access filtering.

Name -> (string)

The name of the user or group.

Type -> (string)

The type of principal.

Access -> (string)

Whether to allow or deny access to the principal.

DataSourceId -> (string)

The identifier of the data source the principal should access documents from.

HierarchicalAccessControlList -> (list)

The list of principal lists that define the hierarchy for which documents users should have access to.

(structure)

Information to define the hierarchy for which documents users should have access to.

PrincipalList -> (list)

A list of principal lists that define the hierarchy for which documents users should have access to. Each hierarchical list specifies which user or group has allow or deny access for each document.

(structure)

Provides user and group information for document access filtering.

Name -> (string)

The name of the user or group.

Type -> (string)

The type of principal.

Access -> (string)

Whether to allow or deny access to the principal.

DataSourceId -> (string)

The identifier of the data source the principal should access documents from.

ContentType -> (string)

The file type of the document in the Blob field.

JSON Syntax:

[
  {
    "Id": "string",
    "Title": "string",
    "Blob": blob,
    "S3Path": {
      "Bucket": "string",
      "Key": "string"
    },
    "Attributes": [
      {
        "Key": "string",
        "Value": {
          "StringValue": "string",
          "StringListValue": ["string", ...],
          "LongValue": long,
          "DateValue": timestamp
        }
      }
      ...
    ],
    "AccessControlList": [
      {
        "Name": "string",
        "Type": "USER"|"GROUP",
        "Access": "ALLOW"|"DENY",
        "DataSourceId": "string"
      }
      ...
    ],
    "HierarchicalAccessControlList": [
      {
        "PrincipalList": [
          {
            "Name": "string",
            "Type": "USER"|"GROUP",
            "Access": "ALLOW"|"DENY",
            "DataSourceId": "string"
          }
          ...
        ]
      }
      ...
    ],
    "ContentType": "PDF"|"HTML"|"MS_WORD"|"PLAIN_TEXT"|"PPT"
  }
  ...
]

--cli-input-json | --cli-input-yaml (string) Reads arguments from the JSON string provided. The JSON string follows the format provided by --generate-cli-skeleton. If other arguments are provided on the command line, those values will override the JSON-provided values. It is not possible to pass arbitrary binary values using a JSON-provided value as the string will be taken literally. This may not be specified along with --cli-input-yaml.

--generate-cli-skeleton (string) Prints a JSON skeleton to standard output without sending an API request. If provided with no value or the value input, prints a sample input JSON that can be used as an argument for --cli-input-json. Similarly, if provided yaml-input it will print a sample input YAML that can be used with --cli-input-yaml. If provided with the value output, it validates the command inputs and returns a sample output JSON for that command.

See ‘aws help’ for descriptions of global parameters.

Output

FailedDocuments -> (list)

A list of documents that were not added to the index because the document failed a validation check. Each document contains an error message that indicates why the document couldn’t be added to the index.

If there was an error adding a document to an index the error is reported in your Amazon Web Services CloudWatch log. For more information, see Monitoring Amazon Kendra with Amazon CloudWatch Logs

(structure)

Provides information about a document that could not be indexed.

Id -> (string)

The unique identifier of the document.

ErrorCode -> (string)

The type of error that caused the document to fail to be indexed.

ErrorMessage -> (string)

A description of the reason why the document could not be indexed.