Wisdom of My Crowd - Everything About AI

Everything about AI: research, engineering, application, and some fun stuff


Research


What you should learn ?

Learn compilers, distributed systems, formal methods, HPC etc instead of data science for the next decade.

AI is massively populated and no low hanging fruits in model building now.


EE 2 AI - Yi

For all young folks who are interested in intelligence, I highly recommend to study five subjects and understand how they are related: compressive sensing (sparse coding), information theory, control theory, game theory, and optimization. The rest is mostly realization...


AI and Control Theory

It strikes me that AI alignment/safety is like controllability, and AI interpretability is like observability. These are both classical concepts from system theory that seem to be largely unknown to AI researchers.


Hurdles to truly revolutionary AI

  1. Compression - solved ✅

  2. Search - unsolved ❌

  3. Deduction - unsolved ❌

Some think (2) and (3) are already solved but that's just because we're training on a lot of data.


Transformer - Pedro

A transformer is a differentiable rewriting system, and rewriting systems are Turing-complete and closely matched to the workings of language. So it's not surprising transformers learned on a lot of data are powerful.


Second bitter lesson

I propose a Second Bitter Lesson:

“𝘐𝘯 𝘓𝘓𝘔’𝘴, 𝘰𝘱𝘵𝘪𝘮𝘪𝘻𝘪𝘯𝘨 𝘥𝘪𝘳𝘦𝘤𝘵𝘭𝘺 𝘧𝘰𝘳 𝘵𝘩𝘦 𝘵𝘩𝘪𝘯𝘨 𝘺𝘰𝘶 𝘸𝘢𝘯𝘵 𝘪𝘴 𝘮𝘰𝘳𝘦 𝘱𝘰𝘸𝘦𝘳𝘧𝘶𝘭 𝘵𝘩𝘢𝘯 𝘩𝘰𝘱𝘪𝘯𝘨 𝘪𝘵 𝘦𝘮𝘦𝘳𝘨𝘦𝘴”

This was the lesson from RLHF—base models are hopeless text predictors that nobody would ever use; with InstructGPT/GPT3.5, everyone was shocked by the capability jump which was due to simply rewarding outputs that we want over ones we don’t.

In the Process Supervision paper (and the presumed Q* breakthrough), we get better reasoning when we … reward good reasoning, rather than merely rewarding final answers that look correct.

There’s probably many more examples of this principle and I expect it to become increasingly important as LLM’s get better. I suspect you can basically get an LLM to do anything you want by just providing sufficiently fine-grained reward signals at scale.

And yet, this is a bitter lesson because people are continually shocked when it works. They’ll come up with complicated schemas before trying the thing that’s obvious and simple when you take a step back.


Not much structure in the sturctured data

One of the biggest ironies in Data Science is that we call tabular data “structured data”, while all other data modalities (text, audio, images, video, etc.) are considered “unstructured data.”

The reason that this is ironic is that tabular data can in principle be anything, and there is no a-priory structure and relationship between different columns/features in your dataset.

By contrast, the other data modalities have an enormous amount of intrinsic local structure to them. Most of the time you can’t just jumble the words in a text around, and you definitely can’t arbitrarily permits pixels in an image .

And this latter smooth local structure is the primary reason why neural networks perform well on the latter data modalities, while they struggle with the former.

There is an inductive smoothness and locality bias that neural networks possess that can be a liability when dealing with dataset that poses no such structure.


Engineering


Using AI for coding

llms for coding work best when you know exactly what the solution should look like but don't wanna type it out and cross check all the details line-by-line.

if you couldn't build it yourself, the llm isn't going to magically uplevel you. it's just gonna make slop in bulk

The right way is:

Step 1 is to break down your problem such that you can get useful snippets

Step 2 is to use your programming skill to combine them


AI engineers

"AI engineers" specialize in repurposing large language models for general reasoning and code generation. They build the layer between ML models and end products.

The key differences from other roles are:

  • Compared to ML engineers, AI engineers focus more on the application layer rather than research, training, and inferencing of models.

  • Compared to software engineers using AI tools, AI engineers specialize specifically in integrating and customizing large language models like GPT-4 into products. They have expertise in techniques like rag retrieval, prompt engineering, etc.

So in summary, AI engineers have a specialized focus on leveraging foundation models to build AI-powered applications, sitting between basic research and software development.


Prompt Engineering : search a key-value store with the right key

My interpretation of prompt engineering is this:

  1. A LLM is a repository of many (millions) of vector programs mined from human-generated data, learned implicitly as a by-product of language compression. A "vector program" is just a very non-linear function that maps part of the latent space unto itself.

  2. When you're prompting, you're fetching one of these programs and running it on an input -- part of your prompt serves as a kind of "program key" (as in database key) and part serves as program argument(s). Like, in "write this paragraph in the style of Shakespeare: {my paragraph}", the part "write this paragraph in the stye of X: Y" is a program key, with arguments X=Shakespeare and Y={my paragraph}.

  3. The program fetched by your key may or may not work well for the task at hand. There's no reason why it should be optimal. There are lots of related programs to choose from.

  4. Prompt engineering represents a search over many keys in order a find a program that is empirically more accurate for what you're trying to do. It's no different than trying different keywords when searching for a Python library.

  5. Everything else is unnecessary anthropomorphism on the part of the prompter. You're not talking to a human who understands language the way you do. Stop pretending you are.


Poor man's RLHF

  1. Have user indicate when model is correct
  2. Store associated (input, output) in embedding index
  3. At inference time, retrieve nearest K previous inputs
  4. Put these top K (inputs, output) pairs into context as few-shot examples

Options for customizing LLM output

  1. Pre-train on domain data (eg legal docs)
  2. Fine-tune on supervised tasks
  3. RL w/ reward model (RM) 3a. If RM trained to predict what human says is good -> RLHF 3b. If RM trained to predict what AI says is good -> RLAIF
  4. prompt w/ domain context

Knobs for RAG

Data scientist's approach to RAG:
What are some of the "hyperparameters" in RAG?

  • Indexing algo
  • embedding models
  • alpha for hybrid search
  • re-ranking models
  • LLMs
  • prompt engineering vs. fine-tuning

DL training tips

If you lower your 𝚋𝚊𝚝𝚌𝚑_𝚜𝚒𝚣𝚎 by a ratio of 𝚡 you also decrease your 𝚕𝚎𝚊𝚛𝚗𝚒𝚗𝚐_𝚛𝚊𝚝𝚎 by the same ratio 𝚡.

And the other way around.

𝚠𝚎𝚒𝚐𝚑𝚝_𝚍𝚎𝚌𝚊𝚢 tip:

When training with high 𝚠𝚎𝚒𝚐𝚑𝚝_𝚍𝚎𝚌𝚊𝚢 don't take your loss value blindly. Always also measure your real metric alongside it.

It might look like the loss isn't going anywhere while your metric still improves.


DL training baseline

Starter pack for a strong deep learning baseline model:

  • Prepare a good train/validation data split
  • Browse recent Kaggle top solutions for similar problems and identify most popular foundation model (e.g., Deberta for NLP or EfficientNet for CV)
  • Tune learning rate and nr. of epochs optimizing val metric

These simple steps will already bring you quite far. Can subsequently focus on tuning more advanced hyperparameters.


Building with LLMs: prompting, eval, agent, LLMOPs

  • Prompt Engineering, RAG, and Finetuning can all be leveraged but which one you use and when you use it depends on the task at hand. There is a flow that works great but it's important to start with a simple baseline using prompt engineering (e.g., few-shot learning). Then you can use RAG if context needs knowledge enrichment. Finetuning helps when you want the model to perform a specific task such as converting an email to a specific tone or a blog of text to a specific structure.

  • Evaluation is hard! Regardless of how you use and apply LLMs, it won't matter much if you are not consistently evaluating. Hence it's important to start with a simple baseline that allows you to put together a good evaluation pipeline for the future. You should always be evaluating, even after you launch your AI application.

  • AI assistants are here and they are the next wave of AI innovation. How to build the most powerful and robust personalized assistant will be challenging but they can help transform our personal and professional lives. It's also a great opportunity and time for businesses to think about what can be done with personalized AI assistants in their business and how to leverage state-of-the-art LLMs and large-scale data. This is not easy but it's easier to get started today with the introduction of new features like GPTs.

  • LLMOps (LLM operations) is going to a be challenge and something we need to start to think about now. Building with LLMs is becoming more complex as it's no longer just a simple LLM call and getting back a text completion. These LLM APIs we are using will now involve complex components like retrievers, threads, prompt chains, access to tools, etc. that will need to be logged and monitored. Access to LLM APIs is becoming cheaper

  • For AI businesses, it's important to think about ways to quickly adapt and embrace AI strategically. You should do so responsibly, always trying to innovate on product experience, and always putting your users first. Not everything will require AI but it's important you experiment, explore, and measure the success of AI applications. It's still a good time to get involved but it has to happen now. No more sitting back.


RAG + FT

The deeper I go into LLM use cases, the more the need for customization.

RAG and finetuned models bridge that gap. But these solutions are not easy to get right. RAG only works if your retriever is effective and finetuning only makes sense if the data quality is good.

That being said, I see a lot of synergies with these two approaches for enabling even better customization of LLMs.

Example: a finetuned model can get you the right tone/style for a customer success chatbot but it can improve in usability given an optimal context which RAG can help improve.

This is why I typically advise dev teams to break a task down into smaller subtasks which could enable using a combination of approaches that enrich your LLM-powered solution. Many such cases.


LoRA

I ran hundreds if not thousands of LoRA & QLoRA experiments to finetune open-source LLMs, and here’s what I learned:

  1. Despite the inherent randomness of LLM training (or when training models on GPUs in general), the outcomes remain remarkably consistent across multiple runs.

  2. QLoRA presents a trade-off that might be worthwhile if you're constrained by GPU memory. It offers 33% memory savings at the cost of a 33% increase in runtime.

  3. When finetuning LLMs, the choice of optimizer shouldn't be a major concern. While SGD on its own is suboptimal, there's minimal variation in outcomes whether you employ AdamW, SGD with a scheduler, or AdamW with a scheduler.

  4. While Adam is often labeled a memory-intensive optimizer due to its introduction of two new parameters for every model parameter, this doesn't significantly affect the peak memory demands of the LLM. This is because the majority of the memory is allocated for large matrix multiplications rather than retaining extra parameters.

  5. For static datasets, iterating multiple times as done in multi-epoch training might not be beneficial. It often deteriorates the results, probably due to overfitting.

  6. If you're incorporating LoRA, ensure it's applied across all layers, not just to the Key and Value matrices, to maximize model performance.

  7. Adjusting the LoRA rank is essential, and so is selecting an apt alpha value. A good heuristic is setting alpha at twice the rank's value.

  8. 7B models can be finetuned efficiently within a few hours on a single GPU possessing 14 Gb of RAM. With a static dataset, optimizing an LLM to excel across all benchmark tasks is unattainable. Addressing this requires diverse data sources, or perhaps LoRA might not be the ideal tool.


Speculative decoding

The trick that allows ChatGPT to generate text so fast.

TL;DR:

  1. Generate some of steps with a small model.
  2. Predict them all (at the same time) with the large model.
  3. If all agree: You just saved yourself many slow steps of large model generation.

But:

  • You need a totally new small model for this that is both fast and accurate enough.
  • It is not easy to actually work with 2 models in memory and get to the point you actually save time and not blow up the memory.

So instead: Train special heads on top of the frozen powerful model you already have and use them for the same purpose.

What an elegant idea. well done, wow.


Data engineering skills

Top five skills to break into data engineering:

  • data modeling Dimensional data modeling - what analysts use Relational data modeling - what software engineers use One Big Table data modeling - a new cutting edge way that is appropriate sometime

  • distributed compute The difference between map and reduce Shuffle tuning Parallelism management Tools: Spark, BigQuery, Trino, and Snowflake

  • data quality Automated data quality checks Pipeline specification (e.g. Airbnb MIDAS process) Data contracts (e.g. write-audit-publish pattern) Tools: Great Expectations, Gable, Deequ

  • data governance Handling data retention appropriately Handling PII appropriately Data lineage (i.e. knowing where data came from) Tools: Apache Atlas

  • data architecture Real-time vs batch processing Low-latency servering layers Data lake vs data lake house vs data warehouse Orchestration tools: Flink, Airflow, Mage, Prefect, Spark Structured Streaming Data at rest tools: Iceberg, Delta Lake, Hudi


Pandas and Polars

Polars and pandas work together like peanut butter and jelly.

With Pandas 2.0 supporting PyArrow, you can do something like this:

  1. Do exploratory work in Pandas
  2. Zero-copy convert to Polars for heavy computations
  3. Zero-copy back to Pandas for further analysis

They solve different problems, so just use them together.


AI/ML vs SW

With some exceptions, the biggest impacts in AI come from people who are experts at both software and machine learning.

Though most people expect the opposite, it’s generally much faster to learn ML than software.

So great software engineers tend to have outsize potential in AI


ML in the details

Machine learning engineering is primarily about patience, attention to detail, and thinking deeply about small things. The day-to-day can be quite tedious & frustrating — but the results of proper execution make it worth it.


Things that Kagglers have known forever but AI people are just rediscovering

  • Knowledge distillation
  • Pseudo labeling
  • DeBERTa
  • The dominance of XGBoost
  • How papers overfit on benchmarks
  • Data generation
  • How bad low-quality data is
  • Boosting with SVMs/XGBoost
  • Importance of postprocessing
  • Importance of rapid iteration
  • Just how much you can do on 1 GPU

(Model) size matters

Lately have been exclusively using 70B models locally for assistance / code completion. They just follow directions significantly better and can use context correctly. Smaller models are good for automation/narrow tasks, not very good as “intelligent assistants”


Footgun

The biggest footgun w/ fine tuning & inferencing LLMs are incorrect prompt templates, eos/bos and other formatting (spaces, etc)

I’ve learned to be paranoid bc I always find mistakes

Repeating @johnowhitaker ’s advice: decode your tokens right before calling your model’s forward pass as a debugging step


Applications


AI applications

LLM apps are moving past the standard chatbots

  • legal document generation
  • RFP response generator and RFP summarizer
  • on-the-fly dashboards and insights
  • contract creator
  • customer support and tech support bot
  • marketing and sales email generator
  • financial document analyzer
  • predictive model creator or AI building AI

AI startup

If you’re building an AI startup, you should aim for your solution to be 10X better than the alternative. It should be obvious to the user that there’s simply no going back to what they were doing before. Switching costs in software are high, and the status quo is strong.


Build with AI

Trained AI models are not products. All the best models will ultimately be commoditized to open weights / source. Cost of inferencing is also asymptotically approaching quickly to near zero.

So where does value get captured in AI? Domain specific products with deep industry context and polish.

There is also a huge level of engineering involved in building compelling products assembled out of existing open models and approaches which is why OpenAI could very well never need to innovate at all on the architecture or optimization level and just focus on product quality — note, this is already very true given they did not discover the Transformer.


New paradigm

LLMs are intuition machines that are different from the human kind of intuition. Humans with their limited context windows are abstracting all the time. This isn't what LLMs do with their huge context windows. LLMs brute force pattern matching in a different way than humans.

So just as a Von Neumann computer is so much more capable versus humans following logic proofs, so is a Deep Learning system when following inductive heuristics. But they all lack the ability to create new and more useful abstractions.

What is clear is that LLMs are tools that are complimentary to our human cognition. They augment our thinking. They allow us to think deeper and harder about concepts. This is why it's important to know the Patterns for Generative AI.

We are in an entirely new reality now! Our human intuitions are being augmented with AI intuition. The synergy is an entirely different kind of cognition. A kind of cognition that will catalyze civilization into a world of abundance rather than that of scarcity. The problems that humanity is unable to solve for centuries, will be solved by this new kind of "centaur" cognition.


AI assistant

The near future of AI is to serve as a universal assistant. Whatever you create on a computer -- slides, code, spreadsheets, docs, tunes, 3D environments, etc. -- you will be able to leverage a digital assistant to help you with boilerplate, filling in details, autocomplete, etc.


Getting useful values out of current AI

Our currently ML models probably won't get us to AGI without some serious changes.

But that's not going to stop me from getting as much value as possible out of them in the mean time.


Building on top of AI

All the enterprise AI startups going after verticals have the right idea. Pick a market, deeply understand the workflows, build simple software to model the workflows, and use AI to augment the human judgment involved. Huge opportunity in 1,000’s of categories.


Value of AI

The value of AI lies in generalization, not special-casing. It's easy to make an AI demo that works well in one situation. It's hard to make an AI system that actually creates value in the real world. That requires adaptability. Robustness. Reliability.


Using whatever tools you can, and make AI work

We don’t need even bigger and better LLMs rn. We just need the current apps and systems that are built around them to stop hallucinating and hand-off processes and functions that can be done better by other means to bespoke subsystems.


What types of task AI can help ?

I don't think LLMs only solve "typing" (i.e. autocomplete). The way I see it, they solve a broad category of automation problems, where

  1. The task medium is natural language
  2. Many examples of the task were featured in the training data
  3. You don't need >90% accuracy

Domains AI can help

AI is strong today in bounded domains - coding assist, driver assist, translation, transcription, and stock art.


What goes into building an AI application

  • Query rewriting, lexical+embedding retrieval, ranking
  • Lots of extraction (key phrase, name, date)
  • Plugins (calendar, summarization, context, etc.)
  • Good ol' heuristics (date/name boosting, recency)

AI use cases: data engineering

Data cleaning / generation is probably the biggest practical application of LLMs


AI use cases: seq2seq

Despite the fact that most LLMs have the chatting capability and many are even finetuned to chat, this capability is useless in the commercial B2C or B2B setting. Multistage chats are unreliable, they quickly diverge from the business objective, the level of hallucination multiplies after each stage of the dialogue, and the beginning of the dialogue becomes partially or fully forgotten by the model.

This is why, LLM's primary use is and likely will remain one stage: classification, regression, entity recognition, question answering, data labeling, paraphrasing, and machine translation. That is all use cases where sequence-to-sequence models like Flan T5 XXL work best. Chat LLMs are fun but useless for business.


Best AI systems: deep exploration of information

The best LLM-powered products I have used all have something in common: they enable deeper exploration of information.

If you are building with LLMs, think beyond just enhancing style or transforming text... think about how LLMs might enable a completely new experience that encourages deeper exploration and learning.

This will build stickiness and push toward more innovative AI products. I regularly encourage AI startups to do this when I sense they are not thinking deeply about how to apply LLMs to their use cases. Don't be afraid to experiment. Today is the right to experiment.


Innovate with AI

The more exciting part of the recent AI announcements we are seeing is the unique use cases it will unlock.

So many people are focused on quickly using LLMs and these multimodal systems to replace old systems. That's great but understand that just simply doing so won't be enough long term.

With every modality that's being added to the stack, it unlocks the opportunity to do more complex innovation. But it also gives others the same powers to innovate. In other words, I think it's mostly a waste of time and resources to just build something that incrementally improves an older system/solution using AI.

The goal should be to unlock new experiences and use cases. That's the power of the AI systems we are building today. I would suggest the same for people trying to apply AI in the workplace. Don't be afraid that AI will replace you because it won't. It will replace boring and repetitive tasks regardless. But AI should be seen as an enabler. It will unlock new opportunities and new ways to work and be creative. Search for those opportunities instead of worrying about some boring tasks being delegated to these AI systems. In my view, it doesn't matter which profession it is, there is always something creative, unique, and innovative that every human can contribute and you can leverage AI to help you. Creativity is our key differentiator, accelerate it with AI.

This whole approach is more sustainable long-term, enables you to create your own path, and avoids getting squashed by these big players with every incremental change or feature. You don't want to be in that race given the velocity of things right now.

This is easier said than done. This requires patience, not falling for the urge to be relevant short term, and willingness to experiment and fail, building insights and a strong team/community for the long term.

I tell many of my peers and many companies I advise that this is the beginning, not the end. On the surface, it might feel like we have solved it all. This couldn't be further from the truth. No need to rush to commit if it's not clear it could be transformative to your business right away. Think about how AI can unlock new and unique experiences - that's the way.


AI Fun


The psychology of AI alarmists

Elon Musk: Savior complex. Needs something to save the world from.

Geoff Hinton: Ultra-leftist, world-class eccentric.

Yoshua Bengio: Hopelessly naive idealist.

Stuart Russell: His only impactful application ever was to nuclear test monitoring. Fixated on nuclear analogies and regulation ever since.

Max Tegmark: Shallow, opportunist, publicity seeker.

Yuval Harari: Clueless purveyor of vacuous nonsense.

Gary Marcus: Hates deep learning. When it became impossible to say it doesn’t work, switched to saying it’s too dangerous.

Nick Bostrom: Confuses his armchair with the real world.

Sam Altman: Seeming earnest about AI risks is part of the sales pitch.


"Hello World" of ML/AI

2013: RandomForestClassifier on Iris

2015: XGBoost on Titanic

2017: MLPs on MNIST

2019: AlexNet on Cifar-10

2021: DistilBERT on IMDb movie reviews

2023: Llama 2 on Alpaca 50k?


The bitter lession

gpu poors fiddle with RAG reranking

gpu riches train 10m context window with perfect recall

always need both but that lesson sure is bitter