Join our mailing list to receive the latest news and updates from our team.
You have Successfully Subscribed!
Research indicates that the leading large language models can exhibit a bizarre feature: They can fake their safety alignment to appear harmless, helpful and truthful, hiding toxic behavior.
The Trump administration is looking to develop a process that would have the federal government review the safety of powerful artificial intelligence models before approving their release, according to a report in The New York Times on May 4, 2026. The move would stand in contrast to the administration’s generally anti-regulatory approach to industry and comes in the wake of Anthropic voluntarily postponing the release of its latest AI model, Mythos.
Anthropic was concerned because when it tested Mythos, the model found thousands of vulnerabilities in operating systems and web browsers. The implication was that if a cybercriminal or hostile foreign agent had Mythos, they could penetrate computer systems worldwide and compromise the basic computer code underlying public safety, national economies and military security.
As a result, Anthropic gave limited access only to about 50 companies and organizations managing critical infrastructure as part of its Project Glasswing. The initiative aims to help governments and corporations close software loopholes Mythos has identified. When Anthropic sought to broaden the number of organizations with access to Mythos, the White House objected.
Security experts, meanwhile, have expressed concern that AI researchers in nations such as China, Russia, Iran and North Korea might soon create similarly powerful AI models and use them to threaten or attack other countries, or to create chaos in those countries’ economies.
Major Challenges
As a computer scientist in this area, my work on computer security and malware shows it’s difficult to even define what safety measures the field should take to make models safe to use. Yet the future of many industries, critical infrastructure, national security and human well-being seems to depend on achieving AI models that are truthful, ethical and reasonable.
The first of these challenges, truthfulness and factual accuracy, came to light when OpenAI’s ChatGPT burst onto the scene in 2022. People worldwide realized that the output of large language models does not necessarily reflect a truthful reality. The goal for AI companies was coherent writing that read as if a human wrote it. If an output was factually flawed, programmers wrote it off as a “hallucination” by the model.
After AI programs led to some legal catastrophes and stock market panic, AI companies have made at least some effort to ensure that their models avoid falsehoods and inaccuracies.
Nonetheless, false information stated confidently within a sea of solid-sounding text can take on a life of its own. Because of the consequences, research is underway on how to engineer truthfulness into models, or at least prevent hallucination.
Truthfulness and grounding in reality are part of a larger and more general concern about safe AI models. The very pace of their advancement may pose a threat.
Troubling Breaches By AI Bots
Numerous incidents in the past two years show that large language models have already caused harm.
The National Law Review uncovered multiple cases in 2024 and 2025 of teenagers and children using chatbots to explore self-harm, in some cases with lethal consequences. Lawsuits have since been filed claiming that the chatbots encouraged suicide.
In 2025, investigators at cybersecurity company ESET Research discovered a program called PromptLock. It uses large language models to generate ransomware that executes attacks and decides autonomously whether to steal files or encrypt them for ransom.
Anthropic engineers revealed that a group of people whom they suspected were sponsored by the Chinese government used Anthropic’s Claude model to launch a “highly sophisticated espionage campaign” that attempted to infiltrated roughly 30 targets around the world and “succeeded in a small number of cases.” Anthropic said it disrupted the campaign by banning accounts involved in the campaign, notifying affected organizations and coordinating with authorities.
In 2024 Microsoft and OpenAI warned that foreign agencies in Russia, Iran, China and other countries used AI tools and large language models to automate attacks and to increase attack sophistication.
Finally, whistleblowers have filed reports about governments using AI tools for real-time decision-making in both military and civilian arenas. In my view, this could lead to a completely new level of potential harm to innocent people.
How to Lessen the Danger
These incidents, and the broad variety of dangers they present, raise the question of whether society should encourage clearer, bolder safety principles for AI corporations and the governments that employ their technology. Are there reliable technical solutions that could keep AI from being used maliciously?
But it may be extremely difficult, if not impossible, to provide a guarantee of safety against malicious users. In 2025 researchers from the U.S. and Europe showed that any filtering safety method imposed on an existing AI model is unreliable.
This means that judgment about truth and safe behavior must be baked into the model, not added later. Sure enough, recent findings show that the leading AI models were 100% successful at circumventing imposed safety measures, a capability known as jailbreaking.
Today there are no definitive answers about what safe AI looks like. I think it’s fair to assert that software engineers do not know how to build reliable protections into AI models. Nor do members of Congress, who in April met to consider special bills on AI ethics and safety.
Steps Forward
Some basic steps could help users and regulators assess the ethical and safety standards in an AI program. Large language models that are open, rather than proprietary, are easier to assess. Knowing which data a model is trained on helps.
Also, AI companies could clearly define their ethics principles. Governments could clearly define and enforce legal constraints that reflect the expectations of society, without being influenced by AI campaigners.
Any vast set of challenges can appear like a mountain: foreboding, encased in moving mist, insurmountable. But as mountain climbers will tell you, clarity in strategy, careful planning and a collaborative persistence can help you scale the peak.
The Conversation is a nonprofit, independent news organization dedicated to unlocking the knowledge of experts for the public good. We publish trustworthy and informative articles written by academic experts for the general public and edited by our team of journalists.
{"id":null,"mode":"form","open_style":"in_place","currency_code":"USD","currency_symbol":"$","currency_type":"decimal","blank_flag_url":"https:\/\/factkeepers.com\/wp-content\/plugins\/tip-jar-wp\/\/assets\/images\/flags\/blank.gif","flag_sprite_url":"https:\/\/factkeepers.com\/wp-content\/plugins\/tip-jar-wp\/\/assets\/images\/flags\/flags.png","default_amount":500,"top_media_type":"none","featured_image_url":false,"featured_embed":"","header_media":null,"file_download_attachment_data":null,"recurring_options_enabled":true,"recurring_options":{"never":{"selected":true,"after_output":"One time only"},"weekly":{"selected":false,"after_output":"Every week"},"monthly":{"selected":false,"after_output":"Every month"},"yearly":{"selected":false,"after_output":"Every year"}},"strings":{"current_user_email":"","current_user_name":"","link_text":"Leave a tip","complete_payment_button_error_text":"Check info and try again","payment_verb":"Pay","payment_request_label":"Factkeepers.com","form_has_an_error":"Please check and fix the errors above","general_server_error":"Something isn't working right at the moment. Please try again.","form_title":"Help Support Factkeepers","form_subtitle":null,"currency_search_text":"Country or Currency here","other_payment_option":"Other payment option","manage_payments_button_text":"Manage your payments","thank_you_message":"Thank you for being a supporter!","payment_confirmation_title":"Factkeepers.com","receipt_title":"Your Receipt","print_receipt":"Print Receipt","email_receipt":"Email Receipt","email_receipt_sending":"Sending receipt...","email_receipt_success":"Email receipt successfully sent","email_receipt_failed":"Email receipt failed to send. Please try again.","receipt_payee":"Paid to","receipt_statement_descriptor":"This will show up on your statement as","receipt_date":"Date","receipt_transaction_id":"Transaction ID","receipt_transaction_amount":"Amount","refund_payer":"Refund from","login":"Log in to manage your payments","manage_payments":"Manage Payments","transactions_title":"Your Transactions","transaction_title":"Transaction Receipt","transaction_period":"Plan Period","arrangements_title":"Your Plans","arrangement_title":"Manage Plan","arrangement_details":"Plan Details","arrangement_id_title":"Plan ID","arrangement_payment_method_title":"Payment Method","arrangement_amount_title":"Plan Amount","arrangement_renewal_title":"Next renewal date","arrangement_action_cancel":"Cancel Plan","arrangement_action_cant_cancel":"Cancelling is currently not available.","arrangement_action_cancel_double":"Are you sure you'd like to cancel?","arrangement_cancelling":"Cancelling Plan...","arrangement_cancelled":"Plan Cancelled","arrangement_failed_to_cancel":"Failed to cancel plan","back_to_plans":"\u2190 Back to Plans","update_payment_method_verb":"Update","sca_auth_description":"Your have a pending renewal payment which requires authorization.","sca_auth_verb":"Authorize renewal payment","sca_authing_verb":"Authorizing payment","sca_authed_verb":"Payment successfully authorized!","sca_auth_failed":"Unable to authorize! Please try again.","login_button_text":"Log in","login_form_has_an_error":"Please check and fix the errors above","uppercase_search":"Search","lowercase_search":"search","uppercase_page":"Page","lowercase_page":"page","uppercase_items":"Items","lowercase_items":"items","uppercase_per":"Per","lowercase_per":"per","uppercase_of":"Of","lowercase_of":"of","back":"Back to plans","zip_code_placeholder":"Zip\/Postal Code","download_file_button_text":"Download File","input_field_instructions":{"tip_amount":{"placeholder_text":"How much would you like to donate? You can change this amount to anything you would like.","initial":{"instruction_type":"normal","instruction_message":"How much would you like to donate? You can change this amount to anything you would like."},"empty":{"instruction_type":"error","instruction_message":"How much would you like to donate? You can change this amount to anything you would like."},"invalid_curency":{"instruction_type":"error","instruction_message":"How much would you like to donate? You can change this amount to anything you would like."}},"recurring":{"placeholder_text":"Recurring","initial":{"instruction_type":"normal","instruction_message":"How often would you like to donate this?"},"success":{"instruction_type":"success","instruction_message":"How often would you like to donate this?"},"empty":{"instruction_type":"error","instruction_message":"How often would you like to donate this?"}},"name":{"placeholder_text":"Name on Credit Card","initial":{"instruction_type":"normal","instruction_message":"Enter the name on your card."},"success":{"instruction_type":"success","instruction_message":"Enter the name on your card."},"empty":{"instruction_type":"error","instruction_message":"Please enter the name on your card."}},"privacy_policy":{"terms_title":"Terms and conditions","terms_body":null,"terms_show_text":"View Terms","terms_hide_text":"Hide Terms","initial":{"instruction_type":"normal","instruction_message":"I agree to the terms."},"unchecked":{"instruction_type":"error","instruction_message":"Please agree to the terms."},"checked":{"instruction_type":"success","instruction_message":"I agree to the terms."}},"email":{"placeholder_text":"Your email address","initial":{"instruction_type":"normal","instruction_message":"Enter your email address"},"success":{"instruction_type":"success","instruction_message":"Enter your email address"},"blank":{"instruction_type":"error","instruction_message":"Enter your email address"},"not_an_email_address":{"instruction_type":"error","instruction_message":"Make sure you have entered a valid email address"}},"note_with_tip":{"placeholder_text":"Your note here...","initial":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"empty":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"not_empty_initial":{"instruction_type":"normal","instruction_message":"Attach a note to your tip (optional)"},"saving":{"instruction_type":"normal","instruction_message":"Saving note..."},"success":{"instruction_type":"success","instruction_message":"Note successfully saved!"},"error":{"instruction_type":"error","instruction_message":"Unable to save note note at this time. Please try again."}},"email_for_login_code":{"placeholder_text":"Your email address","initial":{"instruction_type":"normal","instruction_message":"Enter your email to log in."},"success":{"instruction_type":"success","instruction_message":"Enter your email to log in."},"blank":{"instruction_type":"error","instruction_message":"Enter your email to log in."},"empty":{"instruction_type":"error","instruction_message":"Enter your email to log in."}},"login_code":{"initial":{"instruction_type":"normal","instruction_message":"Check your email and enter the login code."},"success":{"instruction_type":"success","instruction_message":"Check your email and enter the login code."},"blank":{"instruction_type":"error","instruction_message":"Check your email and enter the login code."},"empty":{"instruction_type":"error","instruction_message":"Check your email and enter the login code."}},"stripe_all_in_one":{"initial":{"instruction_type":"normal","instruction_message":"Enter your credit card details here."},"empty":{"instruction_type":"error","instruction_message":"Enter your credit card details here."},"success":{"instruction_type":"normal","instruction_message":"Enter your credit card details here."},"invalid_number":{"instruction_type":"error","instruction_message":"The card number is not a valid credit card number."},"invalid_expiry_month":{"instruction_type":"error","instruction_message":"The card's expiration month is invalid."},"invalid_expiry_year":{"instruction_type":"error","instruction_message":"The card's expiration year is invalid."},"invalid_cvc":{"instruction_type":"error","instruction_message":"The card's security code is invalid."},"incorrect_number":{"instruction_type":"error","instruction_message":"The card number is incorrect."},"incomplete_number":{"instruction_type":"error","instruction_message":"The card number is incomplete."},"incomplete_cvc":{"instruction_type":"error","instruction_message":"The card's security code is incomplete."},"incomplete_expiry":{"instruction_type":"error","instruction_message":"The card's expiration date is incomplete."},"incomplete_zip":{"instruction_type":"error","instruction_message":"The card's zip code is incomplete."},"expired_card":{"instruction_type":"error","instruction_message":"The card has expired."},"incorrect_cvc":{"instruction_type":"error","instruction_message":"The card's security code is incorrect."},"incorrect_zip":{"instruction_type":"error","instruction_message":"The card's zip code failed validation."},"invalid_expiry_year_past":{"instruction_type":"error","instruction_message":"The card's expiration year is in the past"},"card_declined":{"instruction_type":"error","instruction_message":"The card was declined."},"missing":{"instruction_type":"error","instruction_message":"There is no card on a customer that is being charged."},"processing_error":{"instruction_type":"error","instruction_message":"An error occurred while processing the card."},"invalid_request_error":{"instruction_type":"error","instruction_message":"Unable to process this payment, please try again or use alternative method."},"invalid_sofort_country":{"instruction_type":"error","instruction_message":"The billing country is not accepted by SOFORT. Please try another country."}}}},"fetched_oembed_html":false}