How many Exceptions can a single request generate?

Goal:

Filter facility visits returned by Census endpoint.

Deployment #1

Troubling Symptoms

  • Database degradation
  • Out of memory errors!
  • The hard drives are full!?

Analyzing

Errors in Facility Config

Can’t get Facility null

This connection is closed.

Transaction cannot proceed: STATUS ROLLBACK

Abort of action invoked while multiple threads active within it.

The CDI Rabbit Hole

Hypothesis

Something is being improperly shared between CDI and Static classes!

We wanted it to work like this:

Desired Categorizer Behavior

sequenceDiagram participant StaticRequest participant StaticCategorizer participant CDICategorizer participant CdiRequest activate StaticCategorizer activate CDICategorizer StaticRequest->>StaticCategorizer: Categorize this activate StaticRequest StaticCategorizer-->>StaticRequest: Categorized deactivate StaticRequest CdiRequest->>CDICategorizer: Categorize this activate CdiRequest CDICategorizer-->>CdiRequest: Categorized deactivate CdiRequest StaticRequest->>StaticCategorizer: Categorize this activate StaticRequest StaticCategorizer-->>StaticRequest: Categorized deactivate StaticRequest CdiRequest->>CDICategorizer: Categorize this activate CdiRequest CDICategorizer-->>CdiRequest: Categorized deactivate CdiRequest deactivate StaticCategorizer deactivate CDICategorizer

But instead it worked like this:

Bad Categorizer Behavior

sequenceDiagram participant StaticRequest participant StaticCategorizer participant CDICategorizer participant CdiRequest activate StaticCategorizer StaticRequest->>StaticCategorizer: Categorize this activate StaticRequest StaticCategorizer-->>StaticRequest: Categorized deactivate StaticRequest Note over CDICategorizer: Never Used! CdiRequest->>StaticCategorizer: Categorize this activate CdiRequest StaticCategorizer-->>CdiRequest: Categorized deactivate StaticCategorizer deactivate CdiRequest Note left of StaticCategorizer: CDI cleanup StaticRequest->>StaticCategorizer: Categorize this activate StaticRequest StaticCategorizer-->>StaticRequest: NPE deactivate StaticRequest CdiRequest->>StaticCategorizer: Categorize this activate CdiRequest StaticCategorizer-->>CdiRequest: NPE deactivate CdiRequest

Because the dependencies looked like this:

graph TB subgraph subgraph StaticRequest-->StaticDependencyA StaticRequest-->StaticDependencyB StaticDependencyB-->StaticConfigAccess StaticConfigAccess-->threadLocalConnection StaticConfigAccess-.->threadLocalCdiConnection end InjectedDependencyA subgraph CdiRequest CdiRequest-->InjectedDependencyB CdiRequest-->InjectedDependencyA end InjectedDependencyB-->StaticConfigAccess InjectedConfigAccess-->threadLocalConnection end

But forget all that!

Deployment #2

Deploying

. . .

No immediate failures!

Need to rollback for a different, unrelated issue

“Well at least it ran cleanly before the rollback!”

. . .

“The load on the web monoliths is skyrocketing” - Doug

The Actual Problem

After adding some logging…

DB Activity on Census page

1 visit :  2 DB queries

2 visits:  4 DB queries

3 visits:  27 DB queries

4 visits: ~100 DB queries

5 visits: ~250 DB queries

Without my code, logs reduced, but were still very active.

Divergent Expectations

Example image

Example image

Example image

How bad could it be in practice?

10 Visits

2,000 Queries

100 Visits

2,000,000 Queries

5000 Visits

250,000,000,000 Queries

10,000 Visits

2,000,000,000,000 Queries


1 microsecond per stacktrace

~23 days

n^2

Predicate<T>

Function that takes a single T and returns a boolean

The Predica(te)ment

List<Entity> unfilteredItems;

Predicate<Entity> requiresPermission =
  (entity) -> 
    logic.sensitiveFields(unfilteredItems).matches(entity);

List<Entity> privateData  =
    unfilteredItems
      .filter(requiresPermission);

The Predica(te)ment

List<Entity> privateData(List<Entity> unfilteredItems) {

  Predicate<Entity> requiresPermission =
    (entity) -> 
      logic.sensitiveFields(unfilteredItems).matches(entity);

  return 
    unfilteredItems
      .filter(requiresPermission);
}

The Predica(te)ment

             privateData(List<Entity> unfilteredItems) {


    (entity) -> 
      logic.sensitiveFields(unfilteredItems)




}

What I said

Call sensitiveFields for each item. Use it once.


What I meant

Call sensitiveFields once. Use it for all entities.

What I said

Predicate<Entity> requiresPermission =
  (entity) -> 
    logic.sensitiveFields(unfilteredItems).matches(entity);

What I meant

Predicate<Entity> requiresPermission =
    logic.sensitiveFields(unfilteredItems)::matches;

Could also be:

Validator validator = 
  logic.sensitiveFields(unfilteredItems)

Predicate<Entity> requiresPermission =
  (entity) -> 
    validator.matches(entity);

n^3

Calling code

List<Entity> endpoint(List<Entity> bulkItems) {

  Predicate<Entity> otherPredicate = ...

  Predicate<Entity> primaryRequirement =
    (entity) -> 
      logic.allowed(bulkItems).contains(entity);

  return 
    bulkItems
      .filter(primaryRequirement.or(otherPredicate);
}

Calling code

             endpoint(List<Entity> bulkItems) {




    (entity) -> 
      logic.allowed(bulkItems)




}

But

  • Why was it so hard to find?
  • How was it able to kill entire machines?
  • Why didn’t the existing code fail?

Exception, what Exception?

2 methods below my code queried the DB

FacilityLogic

Facility get(String facilityId) {
  try {
    return DB.getFacility(facilityId);
  } catch (HorribleDbException ex) {
    log.error("Error getting facility");
    return null; // Everything is totally fine!
  }
}

FacilityConfigLogic

String configByName(Facility facility, String configName) {
  try { // facility is null!
    return DB.getConfig(facility, configName); 
  } catch (NullPointException ex) {
    log.("Can't get value for facility", facility); 
    return null; // Just keep swimming
  } 
}

Is your code acting like a pig?

Example image5

Be more like a Koala

Example image5

Die when you eat the wrong thing!

Take-aways!

Don’t pass collections into lambdas that will be applied to that same collection.

(Don’t nest loops.)

Accept defeat when you encounter a fatal Exception.

(Don’t swallow Exceptions)

If you don’t have a test that can trigger the original problem, you don’t truly know what’s wrong.

(Create a failing test before writing the “fix”)