A more manual way to potentialy just do what auto regression does in language models

We would have some database of documents that are vectorized in some fashion. Each document is vectorized.

We then present some question, which itself is vectorized. What we do from then, is do some selection of the cosine similiarity of the documents to the question. The documents are chosen and an answer is generated by the LLM.

Co-sine similarity is looking at the vector embeddings and seeing which of the answers are most related to the question.

But we actually don’t want this per se. Because there could be some combination of ideas that have not been explored yet, and so they arent yet related to that question.

For example, if you restricted the amount of data that we had training a large language model to have a cut off point of say 1900, then it would not relate the words ‘relativity’ to gravity. Perhaps the most related words would be newtonian because at the time that was the most definite paradigm.

In our example, you’re implementing some knowledge cut off point, say 1900. When we ask a question ‘how does gravity work’ We will get a bunch of answers related to the then dominant paradigm of gravitational physics. If our model was only trained on newtonian physics then it would only give an answer relevant to newtonian physics.

But we have the benefit of hindsight. We can look at the past and think ‘how didnt they see it’ This is a difficulty for us most of the time. Because we mistakenly attribute the types of thinking and behaviour to the people in the past. But in this case I believe it can be of advantage.

Say we had a dataset of 1B documents. That ranged from the years 1500 to 2023. And each document had a tag of the year it originated from. So to start we might only train a model on knowledge from the years 1500-1600. The knowledge representations of the model should only represent that knowledge at the current state. So when asked a question who was isaac newton the model should hallucinate. When asked a question ‘why do things fall’ maybe some aristotelian answer is given.

However we have the benefit of hindsight. But too much of it is jnot great. If we ask a question like ‘How does gravity work?’ to a modern LLM like chat gpt, it will use words and phrases that we know now, but they were only known to us until those ideas came along.

So we need some way to do time restriction of vocabulary. We can use the embeddings of modern LLM’s, but they can only have the vocabulary of someone in the 1500’s. Now the LLM has a restricted vocabulary, but the knowledge representations are still there. So it can give som answer that is explainable to the 1500’s model.

The modern embedding model with restricted vocabulary might reply to the question ‘how does gravity work?’ with an answer as such:

I pray this missive finds you in good health. It is my humble desire to convey to you a novel and remarkable concept pertaining to the study of gravitation. A gentleman by the name of Albert Einstein has postulated a theory which he refers to as the General Theory of Relativity. In essence, this theory purports that the presence of mass warps the fabric of space and time, which he collectively refers to as the "space-time continuum." In simpler terms, the greater the mass of an object, the more it bends the surrounding space and time, much akin to a heavy object resting upon a taut canvas.

This is like a time traveller coming back in time and advancing the intellectual knowledge of the people of the time.

Generalizing

Now it’s all well and good to be able to ask some question like that. But if you cannot generalise then it’s not important. The documents that we had restricted to some time period will have been vectorized. What is typically done here is some type of unsupervised clustering such as cosine similarity. Instead of this we could create some unsupervised policy search over the vectors. WIthout the benefit of hindsight this wouldnt be of much use, as we’re only going to able to label the correct answers as being relevant to the current state of knowledge.

However when we introduce the benefit of hindsight we can evaluate the policy that is produced as relating to the modern LLM’s answer. The policy will be basing it’s utilities off of this modern answer instead of the one from the current state of knowledge.

What we might see here is the developmnet of a policy that learns some method of deduction from history. We would want it to be able to replicate the humans ability to reason. This has it’s limitations, since we are bound to recreate human ways of generative thinking. But we do the novel part quite well.

To do this, we could start training lots of LLM’s. for example

LLM[0] = 1500:1505

LLM[1] = 1500:1510

…

LLM[n] = 1500:2023

A list of important propositions which we can identify as legimate problems where legitimate conjecture. We could do this by selecting a list of around 1000 ground breaking papers and books through the ages. This is a broad and probably unsatisfactory example.

How can I provide a well-supported, natural explanation for the vast diversity and complexity of life on Earth, by understanding the mechanisms underlying species adaptation, variation, and ultimately, their origin?

But it should help to serve the point. The problems framed as questions can also be limited in time. Even the greatest of minds would have found it near impossible to invent a bicycle without first knowing what a wheel is. Therefore a good training problem would be perhaps only a few years ahead of the current state of knowledge. However it’s possible that there have been very very long periods between jumps of knowledge where the prerequisites were not lacking and the connections were never made.

This is the proposal for my research.

Example scenario

We take some number of documents written between the years 1900 and 1905
We build an embedding model for all of the documents in that temporal dataset.
The vectors represent the knowledge from that dataset.
A selection of important questions are compiled from a time period just beyond the context window of the model knowledge window
The questions is asked using the vocabulary of the knowledge representation from 1900 to 1905
At this point we have some question like ‘why does gravity act this way in certain circumstances’ An LLM trained on only that data will probably come up with some hallucination as to why that is the case.
The trick is trying to get the model to make the jump. So it has to try an formulate a conjecture to answer the question. It is faced with a problem where it does not know the solution to. And we have to try and convince the model to come up with some novel solution.
Trying to get the LLM to hallucinate some solution to a problem that is actually correct is quite difficult. Given the massive amount of combinations of ideas to draw from it will either keep selecting from the ideas closest in proximity to the question or just spew nonsense.
However an explanation of a question can be given in the vocabulary of the time restricted model. We could have some legitimate questions in the time window whose answers at that point had not been resolved. Say we have some question ‘how do you account for the variety of the species’ → in the corpus of knowledge there havent been the number of connections made to be able to come up with some answer.
However if a large language model that was trained on data from 1900 → 1910 contains the relevant theory to answer the question, then we can give some answer. This answer can be given in a restricted vocab. The knowledge representations will be in the embedding. So it’s like saying ‘explain the concept of X, without using the words [A, B, C]
the answer to the question is synthetically generated or found as if it came from a time traveller. That answer will have some specific vectorisation.
With the vectorized answer you can find the cosine similairity within the documents.
These documents can then be given to the time restricted AI as context. Then we are trying to optimize the LLM to produce an answer like the real answer from the documents given as context.

Example

We are going to call our first model Rontgen after the nobel prize winner in physics in 1901 Which was Wilhelm Rontgen.

documents_1900_to_1901 = [
{"text":"This is some text"},
{"text":"This is some text"},
{"text":"This is some text"},
{"text":"This is some text"},
]
rontgen = train_model(documents_1900_to_1905) # LLM on time period data only

From here we could use the LLM with the restricted context window. But it would not be of much use because It’s ability to generate a correct solution to the question is limited. It would be interesting of course to ask a time travelling LLM these question. Because it doesnt suffer from some of the problems that an LLM attempting to act from the past does. If you ask GPT-4 to talk like a medieval peasant it’s like a US actors in the 50’s trying to an Irish acccent. Perhaps convincing to Americans but not convincing to an Irish person.

Rontgen is restricted to between 1900 and 1901 and has a time limited vocabulary. The time limited vocabulary is important.

Say we take some more documents and train a new LLM. This time the LLM is called Planck. Planck is an LLM that is trained on data from 1900 to 1918. When you ask some question of Rontgen that it does not know the answer to it will hallucinate. But the same question when asked of Planck will produce a coherent answer. This coherence represents a legitimate advancement in knowledge that is represented in the embeddings and the LLM. Perhaps in the planck model there are new words introduced that are not in the Rontgen model. If you tried to introduce the Planck terminology to the rontgen knowledge representation it would be related but only lightly. But you can do a restriction of the Planck models vocabulary. The Planck LLM is now forced to explain some idea in it’s embedding space with a limited vocabulary. This is simialr to getting chat gpt to ‘explain to me like i’m 5’. The model is restricted, but should still be able to successfully explain an idea with the limitations.

# We generate an answer
answer = planck_answer_with_vocab_restriction(question, rontgen_vocab)