Digital Immortality: Teaching AI to Be Your Future Self

8 min readNov 17, 2024

Disclaimer: The scope of this article related to implementation of possibility to build “Digital Identity”. The ethical and philosophy and social questions are a wide topics and not fall to the scope of this article

Have you ever considered having a digital representation or a virtual double in the AI world?

Since AIG is not implemented yet as of November 2024 but the big LLMs like GPT 4, Claude, Gemini etc are smart enough to support a meaningful conversation is is time to think about your digital identity that can be used in several diverse, usual and a bit strange ways at the moment:

1. To be your virtual representative, your day to day assistant, your second identity to help you with the day-to day routine like email conversations, scheduling, base finance management, personal adviser who knows you well enough to make a valuable advices.
2. Your footprint for a future generations, for example you grand-grand childs can chat with your digital identity and know more about you and keep the link between a generations.

How this can be physical happen?

If we check how modern AI/ML science evolves we can have some different approaches to implementation of Digital Identity:

RAG. Modern AI language models can now be augmented with specialized knowledge through a technique called RAG (Retrieval-Augmented Generation).This approach combines the model’s inherent capabilities with an external knowledge base stored in a vector database. The knowledge base contains information, actually your “personal knowledge, your “personality”, broken down into digestible chunks of text, which are converted into mathematical representations called embeddings. When the AI needs to access specific information or maintain consistent personality traits, it can retrieve and reference these relevant knowledge snippets from the database. This architecture allows the AI to draw upon a curated pool of knowledge while maintaining the fluidity and natural language understanding of large language models. Think of it as giving the AI instant access to a personal library of information about you, which it can seamlessly integrate into its responses, as example (it can be structured like “Question”, “Answer” or unstructured text like you diary, autobiography, video, audio — for Multimodal AI Models):

{
  "Question": "What are my favorite food?",
  "Answer": " I like BBQ, Texas style steaks"
}

Here is the very high level diagram of RAG architecture (local or Cloud based):

User creates prompt, query enriched with the relevant context from the Knowledge Base and LLM responses to the prompt using a relevant context.

Example of enriched request:

{
  Answer the question: "Do you like steaks?" using this context: "I like BBQ, Texas style steaks"
}

Fine-tuning represents a powerful approach to customizing large language models (LLMs) for specific use cases. This process involves additional training of an existing model using your specialized dataset, essentially teaching the model to become an expert in your domain.

Key Advantages:

Streamlined Architecture: Unlike RAG systems, fine-tuned models don’t require external knowledge bases or vector databases, as the domain knowledge becomes integrated into the model’s parameters
Enhanced Performance: When trained on high-quality, domain-specific data, these models can deliver more accurate and contextually appropriate responses
Faster Inference: Without the need to query external databases, fine-tuned models can potentially offer quicker response times

Technical Challenges: The fine-tuning process, while powerful, comes with significant technical demands:

Substantial Computing Resources: Training modern LLMs requires access to high-performance hardware, including specialized GPUs and large amounts of memory
Storage Infrastructure: Managing the training process necessitates robust storage systems to handle massive datasets
Technical Expertise: Success depends heavily on machine learning expertise, including understanding of model architecture, training dynamics, and optimization techniques
Cost Considerations: The computational intensity of fine-tuning can translate into significant infrastructure and operational expenses

While fine-tuning offers a path to superior performance for specific applications, organizations must carefully weigh these technical requirements against their available resources and expertise.

Using this approach we train already pre-trained model using dataset with personal details, so we’re not train brand new model from a scratch but just adding “personality” to the model.

Preparing Dataset

One of the main question is the requirements to the dataset to be able to train the model to receive an adequate result.

Let’s try to see what options do we have and which techniques can we use.

Explicit Consent & Data Collection:

Ask user to voluntarily share their preferences and information
Use transparent surveys and questionnaires
Allow users to review and control what data is stored

Classification of data to be collected

Professional Identity:

Career background and expertise areas
Industry specializations
Professional goals and aspirations
Preferred communication style in work context
Common work-related tasks and challenges

2. Personal Identity:

Family details
Personal Preferences (food, sport, hobbies etc)
Behavioral Preferences (party, travels etc…)
History details
etc

3. Learning & Problem-Solving:

Preferred learning style (visual, text, interactive)
Knowledge areas of interest
Decision-making approach
Problem-solving methodology preferences
Information processing style (detail vs big picture)

4. Communication Preferences:

Preferred level of formality
Response length preferences (concise vs detailed)
Preferred explanation style (examples, analogies, step-by-step)
Language preferences
Feedback style preferences

5. Digital Interaction Style:

Preferred content formats
Task organization methods
Technology comfort level
Tool preferences
Response time expectations

6. Personal Development Goals:

Skills wanting to develop
Topics interested in learning
Personal growth objectives
Productivity preferences
Time management style

Best Practices to collect personal dataset:

Make all questions optional
Explain how data will be used
Allow users to update preferences
Provide clear privacy policies
Include data deletion options
Start with essential questions and allow progressive profiling

Security warning: Since you’re collecting private information you should be extra careful with data protection, newer share dataset, Knowledge Base, Fine Tuned Models with non-trusted 3rd party, encrypt data in rest and data-in-transfer with a strong cryptography ( like AES, RSA, SSL/TLS/HTTPS)

Why I can’t train the LLM model just using prompt?

At the moment the main problem of a modern LLM models is a short context window, let’s compare the most advanced models:

Claude 3 Opus: 200K tokens
Claude 3 Sonnet: 200K tokens
Anthropic Claude 2.1: 200K tokens
Google Gemini Ultra: 128K tokens
GPT-4 Turbo: 128K tokens
Mistral Large: 128K tokens

Let‘s’ break down the approximate text capacity of 200K tokens in English:

General Approximations for 200K tokens:

Words: ~150,000 words (as tokens to words ratio is roughly 4:3 in English)

This translates to approximately:

300–400 pages of a typical book
500–600 pages of regular typed text
A small novel
~150 academic papers (1000 words each)
~100 detailed blog posts

Common Document Equivalents:

50–75 research papers with references
3–4 PhD theses
Several technical documentation manuals

Practical Examples:

The complete Shakespeare play “Hamlet” (~30K words)
First two Harry Potter books combined

It seems not so bad, but on practice chatbots like Chat GPT sends all the chat context which every prompt so it is really memory consuming operation so context window over very fast and you have to start a new chat and lost the chat history so at least with the current state of LLM the context is still pretty small for a detailed Personal Identity Profile.

Let me explain why even a large 200K token context window can fill up quickly during chat interactions:

Complete Conversation Storage:

Each message includes metadata and formatting
Both user and assistant messages are stored
Timestamps and session information
System prompts and instructions
Model’s personality and behavior guidelines
Previous context and referenced information

Assistant’s Internal Processing:

Detailed reasoning steps (even if not shown)
Multiple iterations of thought processes
Analysis of previous context
Maintaining consistency with past responses
Tracking conversation thread and topics

Real-world Chat Volume Example:

Per typical conversation turn:
- User message: 50-100 tokens
- Assistant thinking: 100-200 tokens
- Assistant response: 200-500 tokens
- Metadata/formatting: 50-100 tokens
Total per turn: ~400-900 tokens

In active usage:
- 30 conversation turns per hour
- 8 hours of active chat
- ~(400-900) × 30 × 8 = 96,000-216,000 tokens

Additional Overhead:

Quoted text and references
Examples and explanations
Lists and structured data
Embedded instructions and guidelines

This shows how even a “day’s worth” of meaningful conversation can easily exceed 200K tokens, especially in professional or technical discussions where context preservation is crucial.

Summary

As of today, it is technically possible to create a functional representation of your Digital Identity or a digital twin (via Knowledge Base or fine-tuned LLM model as shown in the article). However, your digital counterpart wouldn’t possess the ability to think in the way we understand the process or make groundbreaking innovations. Still, this marks a significant first step toward what we might call a “digital super-you” or “your AI-generated self (AIG).”

The primary challenges are clear:

Developing a powerful and intelligent large language model (LLM or AIG).
Gathering a comprehensive and complete dataset that encapsulates a 360-degree view of your personality.

Let’s briefly delve into the challenges surrounding the dataset:

1. Complexity of Personality

Human personality is dynamic and multifaceted, influenced by genetics, environment, experiences, emotions, and social contexts.
Capturing all aspects of personality would require data about every thought, emotion, decision, and behavior, as well as their underlying causes, which is both technically and ethically unfeasible.

2. Limitations of Data Collection

Data diversity: Human behavior changes across different situations, times, and cultures. No dataset can comprehensively account for all contexts.
Measurement challenges: While tools like psychometric tests, social media behavior analysis, and biometrics can provide insights, they are incomplete and may introduce biases or inaccuracies.
Privacy concerns: Collecting deeply personal data raises significant ethical and legal issues, such as consent, misuse, and potential harm.

3. Neuroscience and Technology Gaps

The brain is still not fully understood. Although technologies like fMRI, EEG, and brain-computer interfaces provide insights into neural activity, they can’t fully decode thoughts or emotions.
AI models can simulate personality traits or behaviors based on available data, but they lack the subjective experiences and consciousness of a human being.

4. Ethical Considerations

Collecting a “complete dataset” would involve intrusive monitoring, violating fundamental rights to privacy and autonomy.
Representation of such data risks misuse, such as manipulation or exploitation, leading to ethical dilemmas.

Current Approximations

While a true 360-degree dataset is unattainable, partial representations can be built using:

Psychometric assessments: Tools like the Big Five Personality Test or MBTI.
Digital footprints: Social media activity, browsing history, or purchasing patterns.
AI models: Simulations based on observed behaviors or input data (e.g., chatbots mimicking communication styles).

The emergence of highly personalized artificial intelligence, or ‘digital twins,’ appears to be an imminent reality rather than a distant possibility. As these AI counterparts become more sophisticated and prevalent, society faces a crucial challenge: establishing frameworks for meaningful and secure interactions between humans and their digital alter egos.

This technological evolution will require humanity to develop new paradigms for:

Understanding the boundaries between human and artificial intelligence
Creating safe and ethical guidelines for human-AI relationships
Maintaining privacy and security in an interconnected world
Balancing the benefits of personalized AI with potential risks
Building trust while preserving human autonomy

As we stand on the threshold of this transformation, proactively addressing these considerations becomes essential for ensuring a harmonious integration of digital twins into our daily lives.