Digital Immortality: Teaching AI to Be Your Future Self
Disclaimer: The scope of this article related to implementation of possibility to build “Digital Identity”. The ethical and philosophy and social questions are a wide topics and not fall to the scope of this article
Have you ever considered having a digital representation or a virtual double in the AI world?
Since AIG is not implemented yet as of November 2024 but the big LLMs like GPT 4, Claude, Gemini etc are smart enough to support a meaningful conversation is is time to think about your digital identity that can be used in several diverse, usual and a bit strange ways at the moment:
1. To be your virtual representative, your day to day assistant, your second identity to help you with the day-to day routine like email conversations, scheduling, base finance management, personal adviser who knows you well enough to make a valuable advices.
2. Your footprint for a future generations, for example you grand-grand childs can chat with your digital identity and know more about you and keep the link between a generations.
How this can be physical happen?
If we check how modern AI/ML science evolves we can have some different approaches to implementation of Digital Identity:
RAG. Modern AI language models can now be augmented with specialized knowledge through a technique called RAG (Retrieval-Augmented Generation).This approach combines the model’s inherent capabilities with an external knowledge base stored in a vector database. The knowledge base contains information, actually your “personal knowledge, your “personality”, broken down into digestible chunks of text, which are converted into mathematical representations called embeddings. When the AI needs to access specific information or maintain consistent personality traits, it can retrieve and reference these relevant knowledge snippets from the database. This architecture allows the AI to draw upon a curated pool of knowledge while maintaining the fluidity and natural language understanding of large language models. Think of it as giving the AI instant access to a personal library of information about you, which it can seamlessly integrate into its responses, as example (it can be structured like “Question”, “Answer” or unstructured text like you diary, autobiography, video, audio — for Multimodal AI Models):
{
"Question": "What are my favorite food?",
"Answer": " I like BBQ, Texas style steaks"
}
Here is the very high level diagram of RAG architecture (local or Cloud based):
User creates prompt, query enriched with the relevant context from the Knowledge Base and LLM responses to the prompt using a relevant context.
Example of enriched request:
{
Answer the question: "Do you like steaks?" using this context: "I like BBQ, Texas style steaks"
}
Fine-tuning represents a powerful approach to customizing large language models (LLMs) for specific use cases. This process involves additional training of an existing model using your specialized dataset, essentially teaching the model to become an expert in your domain.
Key Advantages:
- Streamlined Architecture: Unlike RAG systems, fine-tuned models don’t require external knowledge bases or vector databases, as the domain knowledge becomes integrated into the model’s parameters
- Enhanced Performance: When trained on high-quality, domain-specific data, these models can deliver more accurate and contextually appropriate responses
- Faster Inference: Without the need to query external databases, fine-tuned models can potentially offer quicker response times
Technical Challenges: The fine-tuning process, while powerful, comes with significant technical demands:
- Substantial Computing Resources: Training modern LLMs requires access to high-performance hardware, including specialized GPUs and large amounts of memory
- Storage Infrastructure: Managing the training process necessitates robust storage systems to handle massive datasets
- Technical Expertise: Success depends heavily on machine learning expertise, including understanding of model architecture, training dynamics, and optimization techniques
- Cost Considerations: The computational intensity of fine-tuning can translate into significant infrastructure and operational expenses
While fine-tuning offers a path to superior performance for specific applications, organizations must carefully weigh these technical requirements against their available resources and expertise.
Using this approach we train already pre-trained model using dataset with personal details, so we’re not train brand new model from a scratch but just adding “personality” to the model.
Preparing Dataset
One of the main question is the requirements to the dataset to be able to train the model to receive an adequate result.
Let’s try to see what options do we have and which techniques can we use.
Explicit Consent & Data Collection:
- Ask user to voluntarily share their preferences and information
- Use transparent surveys and questionnaires
- Allow users to review and control what data is stored
Classification of data to be collected
- Professional Identity:
- Career background and expertise areas
- Industry specializations
- Professional goals and aspirations
- Preferred communication style in work context
- Common work-related tasks and challenges
2. Personal Identity:
- Family details
- Personal Preferences (food, sport, hobbies etc)
- Behavioral Preferences (party, travels etc…)
- History details
- etc
3. Learning & Problem-Solving:
- Preferred learning style (visual, text, interactive)
- Knowledge areas of interest
- Decision-making approach
- Problem-solving methodology preferences
- Information processing style (detail vs big picture)
4. Communication Preferences:
- Preferred level of formality
- Response length preferences (concise vs detailed)
- Preferred explanation style (examples, analogies, step-by-step)
- Language preferences
- Feedback style preferences
5. Digital Interaction Style:
- Preferred content formats
- Task organization methods
- Technology comfort level
- Tool preferences
- Response time expectations
6. Personal Development Goals:
- Skills wanting to develop
- Topics interested in learning
- Personal growth objectives
- Productivity preferences
- Time management style
Best Practices to collect personal dataset:
- Make all questions optional
- Explain how data will be used
- Allow users to update preferences
- Provide clear privacy policies
- Include data deletion options
- Start with essential questions and allow progressive profiling
Security warning: Since you’re collecting private information you should be extra careful with data protection, newer share dataset, Knowledge Base, Fine Tuned Models with non-trusted 3rd party, encrypt data in rest and data-in-transfer with a strong cryptography ( like AES, RSA, SSL/TLS/HTTPS)
Why I can’t train the LLM model just using prompt?
At the moment the main problem of a modern LLM models is a short context window, let’s compare the most advanced models:
- Claude 3 Opus: 200K tokens
- Claude 3 Sonnet: 200K tokens
- Anthropic Claude 2.1: 200K tokens
- Google Gemini Ultra: 128K tokens
- GPT-4 Turbo: 128K tokens
- Mistral Large: 128K tokens
Let‘s’ break down the approximate text capacity of 200K tokens in English:
General Approximations for 200K tokens:
- Words: ~150,000 words (as tokens to words ratio is roughly 4:3 in English)
This translates to approximately:
- 300–400 pages of a typical book
- 500–600 pages of regular typed text
- A small novel
- ~150 academic papers (1000 words each)
- ~100 detailed blog posts
Common Document Equivalents:
- 50–75 research papers with references
- 3–4 PhD theses
- Several technical documentation manuals
Practical Examples:
- The complete Shakespeare play “Hamlet” (~30K words)
- First two Harry Potter books combined
It seems not so bad, but on practice chatbots like Chat GPT sends all the chat context which every prompt so it is really memory consuming operation so context window over very fast and you have to start a new chat and lost the chat history so at least with the current state of LLM the context is still pretty small for a detailed Personal Identity Profile.
Let me explain why even a large 200K token context window can fill up quickly during chat interactions:
Complete Conversation Storage:
- Each message includes metadata and formatting
- Both user and assistant messages are stored
- Timestamps and session information
- System prompts and instructions
- Model’s personality and behavior guidelines
- Previous context and referenced information
Assistant’s Internal Processing:
- Detailed reasoning steps (even if not shown)
- Multiple iterations of thought processes
- Analysis of previous context
- Maintaining consistency with past responses
- Tracking conversation thread and topics
Real-world Chat Volume Example:
Per typical conversation turn:
- User message: 50-100 tokens
- Assistant thinking: 100-200 tokens
- Assistant response: 200-500 tokens
- Metadata/formatting: 50-100 tokens
Total per turn: ~400-900 tokens
In active usage:
- 30 conversation turns per hour
- 8 hours of active chat
- ~(400-900) × 30 × 8 = 96,000-216,000 tokens
Additional Overhead:
- Quoted text and references
- Examples and explanations
- Lists and structured data
- Embedded instructions and guidelines
This shows how even a “day’s worth” of meaningful conversation can easily exceed 200K tokens, especially in professional or technical discussions where context preservation is crucial.
Summary
As of today, it is technically possible to create a functional representation of your Digital Identity or a digital twin (via Knowledge Base or fine-tuned LLM model as shown in the article). However, your digital counterpart wouldn’t possess the ability to think in the way we understand the process or make groundbreaking innovations. Still, this marks a significant first step toward what we might call a “digital super-you” or “your AI-generated self (AIG).”
The primary challenges are clear:
- Developing a powerful and intelligent large language model (LLM or AIG).
- Gathering a comprehensive and complete dataset that encapsulates a 360-degree view of your personality.
Let’s briefly delve into the challenges surrounding the dataset:
1. Complexity of Personality
- Human personality is dynamic and multifaceted, influenced by genetics, environment, experiences, emotions, and social contexts.
- Capturing all aspects of personality would require data about every thought, emotion, decision, and behavior, as well as their underlying causes, which is both technically and ethically unfeasible.
2. Limitations of Data Collection
- Data diversity: Human behavior changes across different situations, times, and cultures. No dataset can comprehensively account for all contexts.
- Measurement challenges: While tools like psychometric tests, social media behavior analysis, and biometrics can provide insights, they are incomplete and may introduce biases or inaccuracies.
- Privacy concerns: Collecting deeply personal data raises significant ethical and legal issues, such as consent, misuse, and potential harm.
3. Neuroscience and Technology Gaps
- The brain is still not fully understood. Although technologies like fMRI, EEG, and brain-computer interfaces provide insights into neural activity, they can’t fully decode thoughts or emotions.
- AI models can simulate personality traits or behaviors based on available data, but they lack the subjective experiences and consciousness of a human being.
4. Ethical Considerations
- Collecting a “complete dataset” would involve intrusive monitoring, violating fundamental rights to privacy and autonomy.
- Representation of such data risks misuse, such as manipulation or exploitation, leading to ethical dilemmas.
Current Approximations
While a true 360-degree dataset is unattainable, partial representations can be built using:
- Psychometric assessments: Tools like the Big Five Personality Test or MBTI.
- Digital footprints: Social media activity, browsing history, or purchasing patterns.
- AI models: Simulations based on observed behaviors or input data (e.g., chatbots mimicking communication styles).
The emergence of highly personalized artificial intelligence, or ‘digital twins,’ appears to be an imminent reality rather than a distant possibility. As these AI counterparts become more sophisticated and prevalent, society faces a crucial challenge: establishing frameworks for meaningful and secure interactions between humans and their digital alter egos.
This technological evolution will require humanity to develop new paradigms for:
- Understanding the boundaries between human and artificial intelligence
- Creating safe and ethical guidelines for human-AI relationships
- Maintaining privacy and security in an interconnected world
- Balancing the benefits of personalized AI with potential risks
- Building trust while preserving human autonomy
As we stand on the threshold of this transformation, proactively addressing these considerations becomes essential for ensuring a harmonious integration of digital twins into our daily lives.