Analyzing historical mining data with HAQM Bedrock

The development of a new mine is a long process. Figures from S&P Global¹ show that the average lead time is now extended to over 15 years from discovery to operations. To meet soaring demand for critical minerals, the mining industry is increasingly targeting previously mined or explored areas, and re-evaluating viability of these deposits using new exploration and mining techniques.

These historical workings come with extensive records, including field reports, geophysical surveys, soil samples, and drilling data. Historical mining data offers a trove of information on the nature and recoverability of an ore body. However, sorting through these massive data sets is traditionally resource intensive.

Applying generative artificial intelligence (gen AI) technology to the analysis of these records presents a powerful new way of making use of the valuable information they contain. This post demonstrates how to use a large language model (LLM) to effectively summarize unstructured historic records with HAQM Bedrock.

Extracting text versus context

Initial digitization efforts focused on creating a digital copy of the historic document, but this is time-prohibitive because it requires time and effort to review and summarize every page of every document. Modern optical character recognition (OCR) software improves this by helping to increase searchability of content. Services like HAQM Textract augment OCR capabilities with machine learning to identify and extract specific data from large numbers of similarly structured documents. However, these systems lack the ability to infer context from the text they digitize, often relying on subsequent machine analysis or human review for analysis.

A large language model is a type of advanced artificial intelligence model that can perform a wide range of language-related tasks. The training process for an LLM involves exposing the model to a vast amount of data to establish a mathematical relationship of the likelihood of one word appearing after another. An LLM uses this knowledge to make predictions, or generate text, based on a relatively small number of inputs, in this case historic records.

While this means that the quality of the output generated is dependent on the data the model was exposed to during training, the sheer volume of data used provides an incredibly robust foundation. Where models start to reach the limits in domain specific applications, techniques such as retrieval augmented generation and fine tuning exist to improve the quality and traceability of model output.

It’s important to remember that the LLM is generating entirely new text and not outputting from the source verbatim, so inaccuracies, also known as hallucinations, may occur. There’s no direct relationship between the summary and the source, and the results may differ each time the model is run, so text generated by an LLM should not be relied on without verification.

Using the Converse API to query a document

The HAQM Bedrock Converse API is a unified interface for interacting with text-based generative AI models on HAQM Bedrock. The API allows developers to send messages and receive responses using a consistent format across different models. I used the HAQM Bedrock Converse API to put together a Jupyter notebook to perform analysis on the mine owner’s report from 1939.

Fine tuning outputs with prompts

Prompts are inputs you provide to generate an appropriate response for a given task or instruction. The more specific and detailed the request, the better the generated response will be.

For example, if I ask the model to “Provide a summary of the document”, it returns:

This document is a Mine Owner’s Report filed with the Arizona Department of Mineral Resources in 1939 for the Cyrus Mine, located in Pima County, Arizona. The key details provided are:

and then it continues with an additional 190 words describing the report.

Claude, the large language model this example uses, is chatty when not given the proper inputs. If I provide a more detailed prompt requesting a concise summary such as “Provide a brief analysis of the document. Include the primary mineral being mined and whether mining has occurred”, it returns:

This document is a mine owner’s report for the Cyrus Mine located in Pima County, Arizona. The primary minerals of interest are copper, gold, silver, and molybdenum. No active mining operations are mentioned, but the owner C.P. Thomas indicates plans to run a 600-foot tunnel along the vein or sink a 40-foot shaft to the 200 level and drift 200 feet to further explore and access the mineralized contact vein between limestone and porphyry that is exposed at the surface. The mine appears to be an exploration prospect at this stage.

This is a better response, but still very detailed. To improve the response, I use a process known as role prompting. I set Claude’s role as “a geologist for a mining company” and provide a specific structure for the output, such as “Provide a brief analysis of the document. Follow the format: Name. Location. Mineralization. Total mineral recovered”, it returns:

Cyrus Mine. Pima County, Arizona. Mineralization: Contact vein between limestone and porphyry, containing copper, gold, silver, and molybdenum. Total mineral recovered: None mentioned, appears to be an undeveloped prospect.

Providing the model with more direction with detailed inputs empowers the tool to generate responses with the exact details you are looking for, making it faster and easier to review extensive historical documents.

Creating structured output

Now that we know how to frame input questions to generate an effective document summary, how can we integrate these outputs with other systems for further analysis? Claude’s tool use functionality allows you to generate output based on a format we define. For this example, we ask Claude to provide the output in JSON. JSON is a widely supported data interchange format that’s readable and easy to use. The input is:

“Provide a summary of the document and provide the output in JSON
Follow the format provided below:
file_name – filename
mine_name – mine name
summary – a single line analysis in the following format: Name, Location, Mineralization, Total mineral recovered.
location – location
first_mined – date of first mining activity. n/a if unknown
recovery – total ore recovered. 0 if unknown
primary_mineralization – primary mineral
primary_concentration – primary mineral concentration
secondary_mineralization – an array of secondary minerals identified”

The tool then outputs:

{
“file_name”: “CyrusPima528.pdf”,
“mine_name”: “Cyrus Mine”,
“summary”: “Cyrus Mine. Pima County, Arizona. Contact veins between limestone and porphyry, showing gossan outcrop, garnet, limonite of iron, quartz, sulfide ores assaying 6-25% copper, 4-150 oz/ton gold, 2-20 oz silver, and up to 3% molybdenum.”,
“location”: “Pima County, Arizona”,
“first_mined”: “n/a”,
“recovery”: 0,
“primary_mineralization”: “Copper”,
“primary_concentration”: “6-25%”,
“secondary_mineralization”: [“Gold”, “Silver”, “Molybdenum”]
}

While it works, it’s not guaranteed to consistently generate valid JSON. See Forcing JSON with tool use for a more detailed example of this technique.

Going forward

Using HAQM Bedrock to effectively summarize unstructured historic records is very useful when reviewing thousands of pages of documents, simplifying and accelerating the process to effectively find the information you are looking for. Going forward, we can expand the tool further through multi-modal analysis and model fine tuning.

Generative AI presents a step change in how the mining industry can liberate valuable information from historic records. Take a look at the example notebook on GitHub and see how you can use HAQM Bedrock today.

¹http://www.spglobal.com/marketintelligence/en/news-insights/research/discovery-to-production-averages-15-7-years-for-127-mines

AWS for Industries