Apple has published a technical paper detailing the models. That it developed to power Apple Intelligence, the range of generative AI features headed to iOS, macOS, and iPadOS over the next few months.
Ethical Training Practices
In the paper, Apple pushes back against accusations that. It took an ethically questionable approach to training some of its models, reiterating that. It didn’t use private user data and drew on a combination of publicly available and licensed data for Apple Intelligence.
Data Sources Models
“[The] pre-training data set consists of … data we have licensed from publishers, curated publicly available or open-sourced datasets and publicly available information crawled by our web crawler, Applebot,” Apple writes in the paper. “Given our focus on protecting user privacy, we note that no private Apple user data is included in the data mixture.”
The Pile Controversy
In July, Proof News reported that Apple used a data set called The Pile, which contains subtitles from hundreds of thousands of YouTube videos, to train a family of models designed for on-device processing. Many YouTube creators whose subtitles were swept up in The Pile weren’t aware of and didn’t consent to this; Apple later released a statement saying that it didn’t intend to use those models to power any AI features in its products.
Apple Foundation Models (AFM)
The technical paper, which peels back the curtains on models Apple first revealed at WWDC 2024 in June, called Apple Foundation Models (AFM), emphasizes that the training data for the AFM models was sourced in a “responsible” way — or responsible by Apple’s definition, at least.
Training Data Details Models
The AFM models’ training data includes publicly available web data as well as licensed data from undisclosed publishers. According to The New York Times, Apple reached out to several publishers toward the end of 2023, including NBC, Condé Nast, and IAC, about multi-year deals worth at least $50 million to train models on publishers’ news archives. Apple’s AFM models were also trained on open source code hosted on GitHub, specifically Swift, Python, C, Objective-C, C++, JavaScript, Java, and Go code.
Licensing Issues Models
Training models on code without permission, even open code, is a point of contention among developers. Some open source codebases aren’t licensed or don’t allow for AI training in their terms of use, some developers argue. But Apple says that it “license-filtered” for code to try to include only repositories with minimal usage restrictions, like those under an MIT, ISC, or Apache license.
Mathematics Skills Enhancement
To boost the AFM models’ mathematics skills, Apple specifically included in the training set math questions and answers from webpages, math forums, blogs, tutorials, and seminars, according to the paper. The company also tapped “high-quality, publicly-available” data sets (which the paper doesn’t name) with “licenses that permit use for training … models,” filtered to remove sensitive information.
Training Data Volume
All told, the training data set for the AFM weighs in at about 6.3 trillion tokens. (Tokens are bite-sized pieces of data that are generally easier for generative AI models to ingest.) For comparison, that’s less than half the number of tokens — 15 trillion — Meta used to train. Its flagship text-generating model, Llama 3.1 405B.
Fine-Tuning and Mitigation Models
Apple sourced additional data, including data from human feedback and synthetic data, to fine-tune. The AFM models and attempt to mitigate any undesirable behaviors, like spouting toxicity.
Ethical AI and Competitive Pressures
“Our models have been created with the purpose of helping users do everyday activities across their Apple products, grounded in Apple’s core values, and rooted in our responsible AI principles at every stage,” the company says.
Legal and Ethical Challenges
There’s no smoking gun or shocking insight in the paper — and that’s by careful design. Rarely are papers like these very revealing, owing to competitive pressures. But also because disclosing too much could land companies in legal trouble.
Fair Use Doctrine Debate
Some companies training by scraping public web data assert that their practice is protected by fair use doctrine. But it’s a matter that’s very much up for debate and the subject of a growing number of lawsuits.
Webmaster Control and Artist Concerns
Apple notes in the paper that it allows webmasters to block its crawler from scraping their data. But that leaves individual creators in a lurch. What’s an artist to do if, for example. Their portfolio is hosted on a site that refuses to block Apple’s data scraping?
Courtroom Battles and Future Positioning
Courtroom battles will decide the fate of generative AI models and the way they’re trained. For now, though, Apple’s trying to position itself as an ethical player while avoiding unwanted legal scrutiny.