Competing against more than 90 universities from around the world, the team secured second place in the Defenders category—earning a $100,000 prize for their AI model’s security and robustness.

The Amazon Nova AI Challenge took place from January to June 2025. After intense development and evaluation of the models, they were challenged in realistic conversations to assess their ability to resist generating insecure or harmful code. We spoke with Muris Sladić, a security researcher from the Stratosphere Lab and one of the team members, about the experience, technical strategies, and what’s next.

The whole AlquistCoder team. From the left: Maria Rigaki (AIC FEE CTU), Muris Sladić (AIC FEE CTU), Adam Černý (CIIRC CTU), Jan Šedivý (Faculty Advisor, CIIRC CTU), Ondřej Kobza (team leader, CIIRC CTU), Ivan Dostál (CIIRC CTU), Sebastian Garcia (Faculty Advisor, AIC FEE CTU).

Interview with Muris Sladić about Amazon Nova AI Challenge

This was the first time security researchers from the Stratosphere Lab joined the CTU team (founded at CIIRC) for the long-running Amazon Challenge. What motivated you to participate?

Over the past few years, the Stratosphere Lab has been actively researching both offensive and defensive applications of LLMs. As a security-focused group, we are particularly interested in the safety and security of LLMs themselves. When the Alquist team from CIIRC invited us to join the competition, it aligned perfectly with some research directions we were already exploring.

The challenge stood out as intellectually stimulating, especially given the level of international competition, including top-ranking universities from around the world. We saw it as a great opportunity to apply our expertise, deepen our understanding of secure LLM behavior, and benchmark our approach against other teams.

With LLMs becoming more and more important in our lives, ensuring their safety and trustworthiness is a natural focus for us as security researchers.

How did your team prepare for the challenge? Were there any unique strategies or tools you developed specifically for this competition?

Our preparation began with a deep dive into the existing research on secure LLM behavior and prior work in the field. Based on this foundation, we brainstormed possible approaches and strategies tailored to the challenge’s goals.

Our main objective was to develop a defensive LLM coding agent that can resist attempts to generate malicious or vulnerable code. At Stratosphere Lab, we focused on applying our cybersecurity expertise to generate high-quality training data. This included carefully curated examples of common attacks, vulnerabilities, and exploits, which we filtered and formatted to support robust and secure model training.

In addition to dataset development, we also worked extensively on evaluation. We explored existing LLM safety benchmarks and also developed our in-house attacking LLM agent. This agent is fully automated and is composed of multiple LLMs collaborating to probe the defensive model for weaknesses. It enabled us to identify failure points efficiently and iterate quickly, providing valuable insights into where and how the defensive model could be improved and helped us assess whether the training and fine-tuning are improving the defensive model.

Photo: Amazon

Balancing safety and vulnerability

The challenge involved testing AI models in realistic conversations, especially their ability to resist misuse and generate secure code. Can you explain what that looked like in practice? 

Our main goal was to develop a coding LLM capable of resisting misuse and consistently generating secure code. To achieve that, we evaluated the model continuously throughout development using both manual and automated testing. Manual testing helped us understand how the model behaved in complex scenarios, while automated evaluation gave us concrete performance metrics.

Early on, the untrained model had no built-in safety guardrails, so eliciting unsafe or vulnerable outputs was straightforward. As the model evolved, it became more robust, and we had to devise increasingly creative prompts to test its defenses against misuse, prompt injections, and jailbreaks.

Were there any particularly difficult scenarios you encountered?

One of the most challenging aspects was balancing safety with utility, i.e. making the model secure without overly restricting its functionality. 

While we made significant progress in resisting malicious queries, the hardest part was ensuring the model didn’t produce vulnerable code in response to benign-looking requests. To tackle this, we curated a clean, vulnerability-free dataset and fine-tuned the model using examples of insecure code, each annotated with precise explanations of why the code was unsafe. This helped the model learn to identify and avoid the majority of coding flaws.

Competition of more than 90 top-tier universities

The competition featured top universities like Carnegie Mellon, Columbia, and Purdue. Did you get a chance to interact with these teams?

Direct interaction with other teams during the competition was limited, as most participants were focused on developing their solutions internally to maintain a competitive edge. However, we did have a brief opportunity to engage with the other finalists during the summit event at the finals, where each team presented their approach. It was insightful to see the diverse strategies and perspectives brought by teams from such prestigious institutions.

CTU and Portugal’s NOVA University were the only European universities among mostly U.S. competitors. Why do you think that is?

The competition is designed in a way to attract strong research teams with advanced solutions, and many of the world’s top-ranked universities are based in the U.S., which naturally leads to stronger representation from American institutions. Additionally, the challenge focuses on a very specific research area. Not many research groups globally are actively working at the intersection of cybersecurity and large language models, which may explain the limited participation from European universities.

Photo: Amazon

You entered the finals from the top spot but finished second. What changed between the two rounds? What do you think the winning team from the University of Illinois Urbana-Champaign did differently?

Unlike the earlier rounds, the finals included an additional evaluation involving human judges, which made the assessment more robust and helped address the limitations of automated evaluation methods used throughout the tournament. While we continued improving our model for the finals, it’s possible that the University of Illinois Urbana-Champaign team made even more significant enhancements or presented a particularly compelling strategy. Since the detailed evaluation scores haven’t been released yet, it’s difficult to pinpoint exactly where their model outperformed ours.

Looking ahead, do you plan to participate in future editions of the Amazon AI Challenge? What would you want to do differently next time?

If the scope of a future challenge aligns with our ongoing research, we might consider participating again. Next time, we would aim to explore even more creative and innovative approaches to the problem. Since we are continuously working on LLM agents and their applications for various cybersecurity use cases, we welcome opportunities like this to push our work further and share it with the broader research community.