That Morning 50,000 Users Couldn't Login: My First Production Crash
June 15th, 2021, 6:47 AM. I was having phở for breakfast at my favorite spot in District 1 when my phone exploded. Not literally - though that would have been less stressful than what actually happened.
17 Slack notifications. 8 missed calls. 23 WhatsApp messages. All variations of the same theme: “THE APP IS BROKEN. NOBODY CAN LOGIN. FIX IT NOW.”
I threw 50,000 VND on the table and ran. Literally ran. Four blocks to the office, laptop bouncing in my backpack, already sweating through my shirt in the Saigon morning humidity. By the time I reached the office, Crashlytics showed 47,000 crashes in the last hour.
This is the story of the worst morning of my career, and paradoxically, the day I became a real engineer.
The Calm Before the Storm
Let me rewind two days. We had just pushed version 3.7.0 to production. It was a minor release - some UI tweaks, performance improvements, and one seemingly innocent change: migrating from SharedPreferences to DataStore for better async handling.
// The migration code that seemed so simple
class PreferencesMigration(private val context: Context) {
suspend fun migrate() {
val sharedPrefs = context.getSharedPreferences("user_prefs", Context.MODE_PRIVATE)
val dataStore = context.dataStore
// Migrate all preferences
sharedPrefs.all.forEach { (key, value) ->
when (value) {
is String -> dataStore.edit { prefs ->
prefs[stringPreferencesKey(key)] = value
}
is Int -> dataStore.edit { prefs ->
prefs[intPreferencesKey(key)] = value
}
is Boolean -> dataStore.edit { prefs ->
prefs[booleanPreferencesKey(key)] = value
}
// ... handle other types
}
}
// Clear old SharedPreferences after successful migration
sharedPrefs.edit().clear().apply()
}
}
I had tested it thoroughly. On my device. On the QA team’s devices. On the beta program with 500 users. Everything worked perfectly.
What I didn’t test was what happens when 50,000 users try to migrate simultaneously at 6:30 AM - peak login time for our banking app.
The Crime Scene
When I finally got to my desk and opened Crashlytics, the stack trace made no sense:
Fatal Exception: java.lang.IllegalStateException:
SharedPreferences file /data/data/com.bankingapp/shared_prefs/user_prefs.xml
already exists but is not readable
at android.app.SharedPreferencesImpl.loadFromDisk(SharedPreferencesImpl.java:115)
at android.app.SharedPreferencesImpl.<init>(SharedPreferencesImpl.java:73)
at android.app.ContextImpl.getSharedPreferences(ContextImpl.java:419)
at com.bankingapp.data.PreferencesMigration.migrate(PreferencesMigration.kt:8)
at com.bankingapp.MainActivity.onCreate(MainActivity.kt:47)
“Already exists but is not readable”? How is that possible?
I started digging through the crash reports. They all had something in common - they were from users who had been using the app for over a year. New installations were fine. Recent users were fine. But our loyal, long-term users? Completely locked out.
The Investigation
First instinct: rollback. But our product manager shut that down immediately. “We can’t rollback. The new version fixes a critical security vulnerability. We need to fix forward.”
Great. No pressure.
I started by trying to reproduce the issue. I installed the old version, added a bunch of preferences, then updated to the new version. It worked fine. I tried with different amounts of data. Still fine. I was about to scream when our junior developer, Minh, asked a simple question:
“What if the file permissions are wrong?”
File permissions. On Android. Each app has its own sandbox, so permissions shouldn’t matter, right? Wrong.
I SSH’d into our test device (yes, we had rooted test devices for exactly this purpose) and checked:
banking_test:/ $ su
banking_test:/ # cd /data/data/com.bankingapp/shared_prefs/
banking_test:/data/data/com.bankingapp/shared_prefs # ls -la
-rw-rw---- 1 u0_a142 u0_a142 4096 Jun 15 06:30 user_prefs.xml
-rw-rw---- 1 u0_a142 u0_a142 512 Jun 15 06:30 user_prefs.xml.bak
Normal. But then I checked a crashed user’s device (we had remote debug access for some power users who opted in):
-rw------- 1 root root 4096 Jun 15 06:30 user_prefs.xml
-rw-rw---- 1 u0_a245 u0_a245 512 Jun 15 06:30 user_prefs.xml.bak
The file was owned by root! How the hell did that happen?
The Eureka Moment
I was staring at the migration code when it hit me. The clear() operation:
// Clear old SharedPreferences after successful migration
sharedPrefs.edit().clear().apply()
In certain Android versions (particularly custom ROMs popular in Vietnam like BKAV or Viettel), when you clear SharedPreferences while another process is reading them, the file gets recreated with wrong permissions. It’s a race condition that only happens under specific circumstances:
- User opens app (migration starts)
- Our background sync service also starts (reads SharedPreferences)
- Migration completes and calls clear()
- OS recreates the file but assigns wrong permissions
- Next access fails
But why only old users? Because they had accumulated lots of preferences, making the migration take longer, increasing the window for the race condition.
The Hot Fix
We needed a fix that could be deployed immediately. No time for elegant solutions. I wrote the hackiest code of my career:
class EmergencyPreferencesFix {
fun fixPermissions(context: Context): Boolean {
return try {
// Try to access preferences normally
val prefs = context.getSharedPreferences("user_prefs", Context.MODE_PRIVATE)
prefs.getString("test", null)
true
} catch (e: Exception) {
// If failed, try alternative approach
tryAlternativeAccess(context)
}
}
private fun tryAlternativeAccess(context: Context): Boolean {
// Nuclear option: delete and recreate
val prefsFile = File(context.filesDir.parent, "shared_prefs/user_prefs.xml")
val backupFile = File(context.filesDir.parent, "shared_prefs/user_prefs.xml.bak")
return try {
// Try to read backup
if (backupFile.exists() && backupFile.canRead()) {
// Parse XML manually (yes, really)
val prefs = parsePreferencesXml(backupFile)
// Delete corrupted file
prefsFile.delete()
backupFile.delete()
// Recreate with correct permissions
val newPrefs = context.getSharedPreferences("user_prefs", Context.MODE_PRIVATE)
val editor = newPrefs.edit()
prefs.forEach { (key, value) ->
when (value) {
is String -> editor.putString(key, value)
is Int -> editor.putInt(key, value)
is Boolean -> editor.putBoolean(key, value)
is Float -> editor.putFloat(key, value)
is Long -> editor.putLong(key, value)
}
}
editor.apply()
true
} else {
// Last resort: start fresh
prefsFile.delete()
backupFile.delete()
// Create new preferences with default values
initializeDefaultPreferences(context)
true
}
} catch (e: Exception) {
// If everything fails, at least log it
FirebaseCrashlytics.getInstance().recordException(e)
false
}
}
private fun parsePreferencesXml(file: File): Map<String, Any> {
// I'm not proud of this code
val prefs = mutableMapOf<String, Any>()
try {
val content = file.readText()
// Regex parsing XML because XmlPullParser wasn't working
// (Yes, I know, please don't judge)
val stringPattern = "<string name=\"(.+?)\">(.+?)</string>".toRegex()
val intPattern = "<int name=\"(.+?)\" value=\"(.+?)\" />".toRegex()
val boolPattern = "<boolean name=\"(.+?)\" value=\"(.+?)\" />".toRegex()
stringPattern.findAll(content).forEach {
prefs[it.groupValues[1]] = it.groupValues[2]
}
intPattern.findAll(content).forEach {
prefs[it.groupValues[1]] = it.groupValues[2].toInt()
}
boolPattern.findAll(content).forEach {
prefs[it.groupValues[1]] = it.groupValues[2].toBoolean()
}
} catch (e: Exception) {
// Silent fail, we'll use defaults
}
return prefs
}
}
I’m not proud of this code. Parsing XML with regex? Manually recreating SharedPreferences? It’s everything they tell you not to do. But it worked.
The Deployment Drama
9:30 AM. We had a fix. But how do you deploy to 50,000 angry users who can’t even open the app?
Our solution was creative: we deployed a special version (3.7.1) that didn’t require login for the first screen. It would:
- Show a “Maintenance” message
- Run the fix in the background
- Auto-restart the app when fixed
But Google Play review takes hours, sometimes days. We couldn’t wait.
That’s when our Head of Engineering made the call: “Deploy through our CDN.”
We had an emergency update mechanism built into the app (for exactly this kind of situation) that could download and apply patches without going through the Play Store. It was meant for critical security fixes, but this qualified.
class EmergencyPatcher {
fun checkAndApplyPatch() {
val patchUrl = "https://cdn.bankingapp.com/emergency/patch_3.7.1.jar"
// Download patch
val patchFile = downloadPatch(patchUrl)
// Verify signature (CRITICAL for security)
if (!verifySignature(patchFile)) {
return
}
// Load patch using DexClassLoader
val dexLoader = DexClassLoader(
patchFile.absolutePath,
context.cacheDir.absolutePath,
null,
this.javaClass.classLoader
)
// Replace broken class with patched version
val patchedClass = dexLoader.loadClass("com.bankingapp.EmergencyPreferencesFix")
val fixMethod = patchedClass.getMethod("fixPermissions", Context::class.java)
// Apply fix
val result = fixMethod.invoke(patchedClass.newInstance(), context) as Boolean
if (result) {
// Restart app
restartApp()
}
}
}
By 10:15 AM, we pushed the patch. Within 30 minutes, crash rates started dropping.
The Clean-Up
By noon, 90% of affected users were fixed. But we still had 5,000 users whose apps were so broken they couldn’t even download the patch. For them, we had to get creative.
We sent SMS messages (we’re a bank, we have everyone’s phone number) with a link to download a standalone fixer app:
“BankingApp: We detected an issue with your app. Please install this fix: https://fix.bankingapp.com/repair”
The repair app was simple - it just needed permission to access the main app’s data directory and fix the permissions. Not elegant, but effective.
The Post-Mortem
Two days later, when everyone could breathe again, we had the post-mortem. The room was tense. I expected to be fired.
Instead, our CTO said something I’ll never forget: “This is the best mistake we’ve ever made.”
He explained: “We learned more about our system in these 6 hours than in the past year. We discovered:
- Our emergency patch system actually works
- Our monitoring needs improvement
- Our rollout process has gaps
- Our team can handle crisis”
The lessons we implemented:
1. Staged Rollouts Are Not Enough
We were doing staged rollouts (1% → 5% → 20% → 100%), but over days. The issue manifested within hours. Now we have “canary periods” - 1% for at least 6 hours during peak usage before proceeding.
2. Test on Real User Data
Our QA environment had clean data. Real users had years of accumulated cruft. We now have a “chaos testing” environment with data copied from production (anonymized, of course).
3. Race Conditions Are Everywhere
We added extensive synchronization around SharedPreferences operations:
object PreferenceManager {
private val lock = Any()
fun getPreferences(context: Context): SharedPreferences {
synchronized(lock) {
return context.getSharedPreferences("user_prefs", Context.MODE_PRIVATE)
}
}
fun migratePreferences(context: Context) {
synchronized(lock) {
// Migration code here
}
}
}
4. Always Have a Rollback Plan
“We can’t rollback” should never be the answer. We now maintain compatibility layers:
class PreferencesCompat {
fun getValue(key: String): Any? {
return try {
// Try new DataStore
getFromDataStore(key)
} catch (e: Exception) {
try {
// Fallback to SharedPreferences
getFromSharedPreferences(key)
} catch (e: Exception) {
// Return default
getDefaultValue(key)
}
}
}
}
The Human Cost
What the post-mortem didn’t capture was the human side. Our customer support team received over 10,000 calls that morning. One support agent, Linh, told me she had an elderly customer crying on the phone because he thought his money was gone.
That hit hard. For us, it was a technical problem. For users, it was their life savings apparently vanishing.
I personally called 50 affected users to apologize. Most were understanding. One businessman said he missed a critical transfer and lost a deal. We compensated him, but you can’t really compensate for lost opportunities.
The Silver Lining
Three months later, something interesting happened. We had another production issue - a third-party service went down. But this time, we were ready. The emergency response plan kicked in:
- Alert triggered within 30 seconds
- War room assembled in 5 minutes
- Root cause identified in 15 minutes
- Fix deployed through emergency channel in 45 minutes
- Full resolution in under 2 hours
The muscle memory from that horrible morning had turned into institutional knowledge.
What I Really Learned
Technical lessons aside, that morning taught me some fundamental truths:
1. Humility: No matter how much you test, production will surprise you. Stay humble.
2. Communication: During the crisis, clear communication saved us. We over-communicated - Slack, email, SMS, even phone calls.
3. Team: Minh, the junior who suggested checking permissions, got promoted. Good ideas can come from anywhere.
4. Users First: Every technical decision has human consequences. Those 50,000 crashes were 50,000 people unable to access their money.
5. Post-Mortems Are Not Blame Games: Our blameless post-mortem culture meant we could be honest about what went wrong.
The Code That Haunts Me
You know what the real fix was? The one we deployed in version 3.8.0 after proper testing?
class SafePreferencesMigration {
suspend fun migrate(context: Context) {
// Don't clear immediately
val oldPrefs = context.getSharedPreferences("user_prefs", Context.MODE_PRIVATE)
val newPrefs = context.getSharedPreferences("user_prefs_v2", Context.MODE_PRIVATE)
// Migrate to NEW file
oldPrefs.all.forEach { (key, value) ->
when (value) {
is String -> newPrefs.edit().putString(key, value).apply()
is Int -> newPrefs.edit().putInt(key, value).apply()
is Boolean -> newPrefs.edit().putBoolean(key, value).apply()
// ... other types
}
}
// Keep old file around for 30 days as backup
// Mark it as migrated
oldPrefs.edit().putBoolean("MIGRATED_TO_V2", true).apply()
}
}
That’s it. Use a different file name. Don’t delete the old one immediately. Such a simple solution that would have prevented everything.
One Year Later
I still wake up sometimes at 6:47 AM with a spike of anxiety. It’s like PTSD for developers. But I’m also grateful for that morning. It transformed me from a developer who wrote code to an engineer who understood systems.
We now have a tradition. Every June 15th at 6:47 AM, the team that was there that morning meets for phở. We call it “Crash Day.” We share war stories, laugh about the regex XML parser, and remind ourselves that we survived.
Last Crash Day, Minh (now a senior engineer) raised his beer and said, “To the crashes that make us better engineers.”
I’ll drink to that.
Epilogue
That emergency patch system we used? It’s now a core feature. We can push critical fixes to users within minutes. It’s saved us three times since then.
The regex XML parser? It’s still in the codebase. There’s a comment above it:
/**
* DO NOT REMOVE THIS CODE
* Yes, it's horrible. Yes, it parses XML with regex.
* But it saved 50,000 users on June 15, 2021.
* Sometimes, bad code that works is better than good code that doesn't.
*
* If you must refactor this, please test with:
* - Corrupted XML files
* - Files with root permissions
* - Files with special characters in values
* - Files larger than 5MB
* - Files that are currently being written to
*
* May the force be with you.
*/
It’s a monument to that morning. A reminder that perfect is the enemy of good, especially at 6:47 AM with 50,000 users locked out of their banking app.
Would I do anything differently? Absolutely. Would I trade the experience? Never.
If you’re dealing with a production crisis right now, remember: breathe, communicate, and focus on the users. The code can be fixed. The architecture can be improved. But user trust, once lost, is hard to regain.
And always, ALWAYS, test your SharedPreferences migrations.